Processing Units - CPU, GPU, APU, TPU, VPU, FPGA, QPU
- Neuromorphic Computing Chips
- Development ... Notebooks ... AI Pair Programming ... Codeless, Generators, Drag n' Drop ... AIOps/MLOps ... AIaaS/MLaaS
- Architectures for AI ... Generative AI Stack ... Enterprise Architecture (EA) ... Enterprise Portfolio Management (EPM) ... Architecture and Interior Design
- Time ... PNT ... GPS ... Retrocausality ... Delayed Choice Quantum Eraser ... Quantum
- NVIDIA A100 HPC (High-Performance Computing) Accelerator for ChatGPT
- AI accelerator | Wikipedia
- CPUs, GPUs, and Now AI Chips
- Moore’s Law Is Dying. This Brain-Inspired Analogue Chip Is a Glimpse of What’s Next | Shelly Fan
- Artificial Intelligence Is Driving A Silicon Renaissance | Rob Toews - Forbes
- MIT’s Tiny New Brain Chip Aims for AI in Your Pocket | Jason Dorrier - SingularityHub ... Alloying conducting channels for reliable neuromorphic computing | H. Yeon, P. Lin, C. Choi, S. Tan, Y. Park, D. Lee, J. Lee, F. Xu, B. Gao, H. Wu, H. Qian, Y. Nie, S. Kim & J. Kim - Nature Nanotechnology
- NVIDIA Faces a Tough New Rival in Artificial Intelligence Chips | Leo Sun - The Motley Fool
- New Chip Expands the Possibilities for AI | Allison Whitten - QuantaMagazine... an energy-efficient chip called NeuRRAM operating in an analog fashion to save more energy and space.
- AI Accelerators and Machine Learning Algorithms: Co-Design and Evolution | Shashank Prasanna - Medium ... Efficient algorithms and methods in machine learning for AI accelerators — NVIDIA GPUs, Intel Habana Gaudi and AWS Trainium and Inferentia
Neural network accelerator chips, also known as AI accelerators, are specialized hardware designed to accelerate machine learning computations. They are often manycore designs and generally focus on low-precision arithmetic, novel dataflow architectures or in-memory computing capability. Massively multicore scalar processors, also known as superscalar processors, are a type of AI accelerator that uses a large number of simple processing cores to execute different types of algorithms. Scalar processing is at the heart of this hardware accelerator. One of the advantages of these hardware components is that they use simple arithmetic units that can be combined in various ways to execute different types of algorithms. Another advantage of massively multicore scalar processors is that they are highly scalable. Different types of processors are suited for different types of machine learning models. TPUs are well suited for Convolutional Neural Network (CNN), while GPUs have benefits for some fully-connected Neural Networks, and CPUs can have advantages for Recurrent Neural Network (RNN)s.
- 1 Unit - Heart of AI
- 1.1 GPU - Graphical Process Unit
- 1.2 APU - Associative Process Unit
- 1.3 TPU - Tensor Processing Unit / AI Chip (Scalar Accelerators)
- 1.4 AWS Trainium and Inferentia
- 1.5 FPGA - Field Programmable Gate Array
- 1.6 VPU - Vision Processing Unit
- 1.7 Neuromorphic Chip
- 1.8 QPU - Quantum Processing Unit
- 2 Photonic Integrated Circuit (PIC)
Unit - Heart of AI
Central Processing Unit (CPU), Graphical Process Unit (GPU), Associative Processing Unit (APU), Tensor Processing Unit (TPU), Field Programmable Gate Array (FPGA), Vision Processing Unit (VPU), and Quantum Processing Unit (QPU)
- Deep learning rethink overcomes major obstacle in AI industry; sub-linear deep learning engine" (SLIDE) is first algorithm for training deep neural nets faster on CPUs than GPUs | Rice University
GPU - Graphical Process Unit
A Graphics Processing Unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs were originally designed to accelerate the rendering of 3D graphics, but over time they became more flexible and programmable, enhancing their capabilities. This allowed graphics programmers to create more interesting visual effects and realistic scenes with advanced lighting and shadowing techniques. Other developers also began to tap the power of GPUs to dramatically accelerate additional workloads in high performance computing (HPC), deep learning, and more.
GPUs can process many pieces of data simultaneously, making them useful for machine learning, video editing, and gaming applications. GPUs may be integrated into the computer’s CPU or offered as a discrete hardware unit. The latest graphics processing units (GPUs) unlock new possibilities in gaming, content creation, machine learning, and more.
Examples: High-performance NVIDIA T4 and NVIDIA V100 GPUs
APU - Associative Process Unit
An Associative Processing Unit (APU) is a type of AI accelerator that focuses on identification tasks. It is designed to identify patterns in large amounts of data. GSI Technology’s Gemini APU takes associative memory to a new level, bringing greater flexibility and programmability. APUs are similar to associative memories, or ternary content-addressable memory (TCAM), but they are more flexible and programmable. They can handle masking operations and work with variable length words and comparisons. The APU uses a similar structure that combines computation with words in memory.
Ternary Content-Addressable Memory (TCAM) is a type of Content-Addressable Memory (CAM) that can store and search for data in parallel, based on its content rather than its memory address. CAM is also known as associative memory. TCAMs are similar to CAMs, but they can store and search for data using three states: 0, 1, and X (don’t care). TCAMs are used in networking devices where they speed up forwarding information base and routing table operations3. This kind of associative memory is also used in cache memory. In associative cache memory, both address and content are stored side by side. When the address matches, the corresponding content is fetched from cache memory.
Content-Addressable Memory (CAM) is a special type of computer memory used in certain very-high-speed searching applications. It is also known as associative memory or associative storage and compares input search data against a table of stored data, and returns the address of matching data. Unlike standard computer memory, random-access memory (RAM), in which the user supplies a memory address and the RAM returns the data word stored at that address, a CAM is designed such that the user supplies a data word and the CAM searches its entire memory to see if that data word is stored anywhere in it. If the data word is found, the CAM returns a list of one or more storage addresses where the word was found. Thus, a CAM is the hardware embodiment of what in software terms would be called an associative array. CAM is frequently used in networking devices where it speeds up forwarding information base and routing table operations. This kind of associative memory is also used in cache memory. In associative cache memory, both address and content are stored side by side. When the address matches, the corresponding content is fetched from cache memory.
TPU - Tensor Processing Unit / AI Chip (Scalar Accelerators)
- Google Pixel Phone
- Here are the likely specs of the Google Tensor chip in the Pixel 6 | Mishaal Rahman | XDA
- Tensor Processing Unit (TPU) | Wikipedia
- Google AIY Projects Program
- Baidu unveils Kunlun AI chip for edge and cloud computing | Khari Johnson - VentureBeat
- Google Unveils The New Tensor SoC, The Heart Of The New Pixel 6 Line Of Phones | Patrick Moorhead - Forbes
A Tensor Processing Unit (TPU) is an AI accelerator application-specific integrated circuit (ASIC) developed by Google for neural network machine learning, using Google’s own TensorFlow software1. TPUs are designed for a high volume of low precision computation (e.g. as little as 8-bit precision) with more input/output operations per joule, without hardware for rasterisation/texture mapping. Google provides third parties access to TPUs through its Cloud TPU service as part of the Google Cloud Platform and through its notebook-based services Kaggle and Colaboratory.
Google Tensor: the chip still has four A55s for the small cores, but it has two Arm Cortex-X1 CPUs at 2.8 GHz to handle foreground processing duties. For "medium" cores, we get two 2.25 GHz A76 CPUs. (That's A76, not the A78 everyone else is using—these A76s are the "big" CPU cores from last year.) The “Google Silicon” team gives us a tour of the Pixel 6’s Tensor SoC | Ron Amadeo - ARS Technical
AWS Trainium and Inferentia
- AWS Trainium is a high-performance machine learning training accelerator, purpose-built by AWS for deep learning training of 100B+ parameter models1. Each Amazon Elastic Compute Cloud (EC2) Trn1 instance deploys up to 16 AWS Trainium accelerators to deliver a high-performance, low-cost solution for Deep Learning (DL) training in the cloud. Trainium has been optimized for training Natural Language Processing (NLP), computer vision, and recommender models used in a broad set of applications, such as text summarization, code generation, question answering, image and video generation, recommendation, and fraud detection.
- AWS Inferentia is a high-performance machine learning inference accelerator designed by AWS to deliver high performance at the lowest cost for your deep learning (DL) inference applications. The first-generation AWS Inferentia accelerator powers Amazon Elastic Compute Cloud (Amazon EC2) Inf1 instances, which deliver up to 2.3x higher throughput and up to 70% lower cost per inference than comparable Amazon EC2 instances.
FPGA - Field Programmable Gate Array
A Field-Programmable Gate Array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturing. The FPGA configuration is generally specified using a hardware description language (HDL), similar to that used for an application-specific integrated circuit (ASIC). FPGAs contain an array of programmable logic blocks, and a hierarchy of reconfigurable interconnects allowing blocks to be wired together. Logic blocks can be configured to perform complex combinational functions, or act as simple logic gates like AND and XOR. In most FPGAs, logic blocks also include memory elements, which may be simple flip-flops or more complete blocks of memory. FPGAs have a remarkable role in embedded system development due to their capability to start system software development simultaneously with hardware, enable system performance simulations at a very early phase of the development, and allow various system trials and design iterations before finalizing the system architecture.
VPU - Vision Processing Unit
A Vision Processing Unit (VPU) is a type of microprocessor designed to accelerate machine vision tasks. It is a specific type of AI accelerator, aimed at accelerating machine learning and artificial intelligence technologies. VPUs are used in a variety of applications, including computer vision, image recognition, and object detection. VPUs are distinct from video processing units (which are specialized for video encoding and decoding) in their suitability for running machine vision algorithms such as CNN (convolutional neural networks), SIFT (Scale-invariant feature transform), and similar. They may include direct interfaces to take data from cameras (bypassing any off-chip buffers), and have a greater emphasis on on-chip dataflow between many parallel execution units with scratchpad memory, like a manycore DSP. One example of a VPU is the Intel® Movidius™ Vision Processing Unit (VPU), which enables demanding computer vision and AI workloads with efficiency. By coupling highly parallel programmable compute with workload-specific AI hardware acceleration in a unique architecture that minimizes data movement, Movidius VPUs achieve a balance of power efficiency and compute performance.
A Neuromorphic Chip is an electronic system that imitates the function of the human brain or parts of it. They contain artificial neurons and synapses that mimic the activity spikes and the learning process of the brain. These chips are used for various applications that require smarter and more energy-efficient computing, such as image and speech recognition, robotics, medical devices, and data processing. Neuromorphic chips attempt to model in silicon the massively parallel way the brain processes information as billions of neurons and trillions of synapses respond to sensory inputs such as visual and auditory stimuli. Those neurons also change how they connect with each other in response to changing images, sounds, and the like.
QPU - Quantum Processing Unit
- List of Quantum Processing Units (QPU) | Wikipedia
- Two-qubit gate: the speediest quantum operation yet | Phys.org
- A Preview of Bristlecone, Google’s New Quantum Processor | Google ... scaled to a square array of 72 Qubits
- A Huge Step Forward in Quantum Computing Was Just Announced: The First-Ever Quantum Circuit | Felicity Nelson - ScienceAlert
A Quantum Processing Unit (QPU) is the central component of a quantum computer or quantum simulator. It is a physical or simulated processor that contains a number of interconnected qubits that can be manipulated to compute quantum algorithms. A QPU uses the behavior of particles, such as electrons or photons, to perform specific types of calculations much faster than the processors in today’s computers. QPUs rely on behaviors like superposition, the ability of a particle to be in many states at once, described in the relatively new branch of physics called quantum mechanics. By contrast, CPUs, GPUs and DPUs all apply principles of classical physics to electrical currents. That’s why today’s systems are called classical computers.
Cerebras Wafer-Scale Engine (WSE)
The Cerebras Wafer-Scale Engine (WSE) is the largest chip ever built. It is the heart of our deep learning system. 56x larger than any other chip, the WSE delivers more compute, more memory, and more communication bandwidth. This enables AI research at previously-impossible speeds and scale.
Summit or OLCF-4 is a supercomputer developed by IBM for use at Oak Ridge National Laboratory, which as of November 2018 is the fastest supercomputer in the world, capable of 200 petaflops. Its current LINPACK is clocked at 148.6 petaflops. As of November 2018, the supercomputer is also the 3rd most energy efficient in the world with a measured power efficiency of 14.668 GFlops/watt. Summit is the first supercomputer to reach exaop (exa operations per second) speed, achieving 1.88 exaops during a genomic analysis and is expected to reach 3.3 exaops using mixed precision calculations.
DESIGN: Design Each one of its 4,608 nodes (9,216 IBM POWER9 CPUs and 27,648 Nvidia Tesla GPUs) has over 600 GB of coherent memory (6×16 = 96 GB HBM2 plus 2×8×32 = 512 GB DDR4 SDRAM) which is addressable by all CPUs and GPUs plus 800 GB of non-volatile RAM that can be used as a burst buffer or as extended memory. The POWER9 CPUs and Volta GPUs are connected using Nvidia's high speed NVLink. This allows for a heterogeneous computing model. To provide a high rate of data throughput, the nodes will be connected in a non-blocking fat-tree topology using a dual-rail Mellanox EDR InfiniBand interconnect for both storage and inter-process communications traffic which delivers both 200Gb/s bandwidth between nodes and in-network computing acceleration for communications frameworks such as MPI and SHMEM/PGAS. Summit (supercomputer) | Wikipedia
Photonic Integrated Circuit (PIC)
- Microparticles create photonic nanojets for parallel manipulation of cells | Sally Cole Johnson - Laser Focus World ... Photonic nanojet-mediated optical backaction force enables parallel manipulation, which may help researchers design photonic devices with biological materials.
- Lightmatter Aims to Bridge Chiplets With Photonics | Francisco Pires - Tom's Hardware
Lightmatter is creating...
- Envise: the world's first AI Accelerator, Envise, running on photonic cores (computing via light). The company's platform unlocks improved latency performance (128^2 MVPs in a single 2.5GHz clock cycle), reduces total cost of ownership, and reduces power consumption (compared to traditional GPUs). he Envise 4S features 16 Envise Chips in a 4-U server configuration with only 3kW power consumption. 4S is a building block for a rack-scale Envise inference system that can run the largest neural networks developed to date at unprecedented performance — 3 times higher IPS than the Nvidia DGX-A100 with 8 times the IPS/W on BERT-Base SQuAD. Massive on-chip activation and weight storage enabling state-of-the-art neural network execution without leaving the processor. Standards-based host and interconnect interface. Revolutionary compute, standard communications. RISC cores per Envise processor. Generic off-load capabilities. Ultra-high performance out-of-order super-scalar processing architecture. Deployment-grade reliability, availability, and serviceability features. Next generation compute with the reliability of standard electronics. 400Gbps Lightmatter interconnect fabric per Envise chip — enabling large model scale-out. Running the most advanced neural networks on the planet.
- IDIOM: is the company's Software Development Kit that uses common machine learning frameworks, like PyTorch & TensorFlow.
- Passage: Further, Lightmatter has also created the world's first switchable optical interconnect platform, Passage, that unlocks optical speeds, system integration, dynamic workloads and reduced power consumption. Features: Reduce the carbon footprint and operating cost of your datacenter while powering the most advanced neural networks (and the next generation) with a fundamentally new, powerful and efficient computing platform: photonics. Photonics enables multiple operations within the same area. T