How an AI Accelerator Actually Works: Matmul, Memory Hierarchy, and Why Bandwidth Is the Bottleneck
- →An AI accelerator is mostly a giant matrix-multiply engine wrapped in a memory hierarchy; the math units are cheap, feeding them is the hard part.
- →Tensor cores and systolic arrays get their speed from data reuse: load a value once, use it in many multiplications.
- →The roofline model sets a threshold (around 295 FLOPs per byte on an H100); above it you are compute-bound, below it you are limited by memory bandwidth.
- →Big matrix multiplies clear that bar; elementwise ops and naive attention do not, which is why they are memory-bound and why operator fusion like FlashAttention wins.
- →At cluster scale the same story repeats one level up: the interconnect between chips becomes the new bottleneck.
Strip away the marketing and a modern AI accelerator is a simple idea executed at enormous scale: a giant matrix-multiply engine wrapped in layers of memory. The arithmetic is the easy, cheap part. The hard part, the part that decides how fast the chip really runs, is feeding those math units with data fast enough. This is a tour of how the hardware works, and why bandwidth, not raw compute, usually sets the limit.
The job is one giant matrix multiply
Almost everything a neural network does, running a layer, an attention block, a feed-forward network, reduces to multiplying matrices together. Training and inference are, at the hardware level, billions of these multiply-and-add operations. So the people who design AI chips build them around one question: how do we do matrix multiplication as fast and as cheaply (in energy) as possible?
A matrix multiply of two large matrices takes about 2·M·N·K floating-point operations (for an M×K matrix times a K×N matrix). That number grows fast with size, which turns out to be a gift: it means there is a lot of math to do per piece of data, and that is exactly the regime where hardware can be kept busy.
How the hardware actually multiplies matrices
A general-purpose CPU does a few multiplications at a time. An AI accelerator does thousands at once using dedicated blocks: NVIDIA calls them tensor cores, Google’s TPU uses a systolic array called the MXU. The two differ in detail but share one trick that makes them efficient: data reuse.
Picture a grid of small multiply-and-add units. You load a tile of weights into the grid once, then stream the input data through it. Each value that gets loaded participates in many multiplications before it leaves. That is the whole game. Reading a number from memory costs far more time and energy than multiplying it, so the more times you can reuse each value after loading it, the better the chip performs. A systolic array is just a very regular, very dense way of arranging that reuse in silicon.
The memory hierarchy: registers, SRAM, HBM
Because moving data is the expensive part, accelerators surround the math units with a hierarchy of memories. Each level up is faster and closer to the math but much smaller; each level down is larger but slower and farther away.
The figures that get printed on a spec sheet (for example an H100 with about 80 GB of HBM at roughly 3.35 TB/s) describe the bottom tier. That HBM bandwidth is generous, but it is still one to two orders of magnitude slower than the on-chip SRAM sitting right next to the math units. Every trip out to HBM is a tax. The job of a good kernel (the low-level program that runs an operation) is to pull a tile of data into SRAM once, do as much work on it as possible, and only then write the result back.
Arithmetic intensity and the roofline
There is a clean way to predict whether an operation will be limited by compute or by memory: arithmetic intensity, the number of floating-point operations performed per byte of data moved from memory. The roofline model turns this into a single threshold. Divide the chip’s peak compute by its peak memory bandwidth and you get the ridge point: the intensity above which you can keep the math units full, and below which memory bandwidth caps you.
ridge point = peak compute / peak memory bandwidth
= 989 TFLOP/s / 3.35 TB/s
≈ 295 FLOPs per byte (H100-class)So on an H100-class chip, an operation needs to do roughly 295 useful operations for every byte it reads just to stop being memory-bound. That sounds like a lot. For a big matrix multiply it is easy; for almost everything else it is not.
| Operation | Arithmetic intensity | Limited by |
|---|---|---|
| Large matrix multiply | Hundreds to thousands of FLOPs/byte | Compute |
| Attention (naive) | Low (single digits) | Memory |
| Elementwise (add, GELU, layer norm) | ~1 FLOP/byte | Memory |
| LLM token generation, batch 1 | ~2 FLOPs/byte | Memory |
Why most kernels are memory-bound (and what fusion buys you)
Notice the pattern in that table: the one operation that clears the bar is the dense matrix multiply. Most of the other things a model does, adding biases, applying activation functions, normalising, and even attention in its textbook form, have low intensity and are therefore limited by how fast you can shuttle data in and out of HBM. The chip’s headline TFLOP number is irrelevant for them. The bus is the bottleneck.
This is where operator fusion earns its keep. Instead of running each small operation separately, with a full round-trip to HBM between every one, you fuse them into a single kernel that keeps the intermediate data in SRAM and writes back only the final result. The most famous example is FlashAttention: classic attention materialises a huge scores matrix in HBM and is badly memory-bound; FlashAttention tiles the computation, keeps each tile in SRAM, and never writes the giant intermediate out at all. Same math, same answer, but a large speedup, purely by respecting the memory hierarchy.
When one chip is not enough: interconnect
A frontier model does not fit on a single accelerator, so the same story repeats one level up. Chips are wired together two ways. Scale-up links a handful of GPUs inside one server with a very fast, dedicated fabric (NVIDIA’s NVLink moves roughly 0.9 TB/s per GPU). Scale-out connects many servers across a data center with a slower network (InfiniBand or Ethernet, tens to a few hundred GB/s).
When chips cooperate they run collective operations: an all-reduce to average gradients across GPUs during training, an all-to-all to route tokens between experts in a mixture-of-experts model, and so on. These collectives move large amounts of data between chips, and the interconnect bandwidth becomes the new ceiling, exactly the way HBM bandwidth was the ceiling inside a single chip. The memory wall does not disappear at scale; it just moves outward.
The takeaway
An AI accelerator is a matrix-multiply engine, and its speed is governed by a hierarchy of memories, not by the math units alone. Whether any given operation runs fast comes down to one ratio, work done per byte moved, and the roofline tells you which side of the line you are on. Read a chip this way and the behaviour of the whole stack, from a single attention kernel up to a thousand-GPU training cluster, follows the same logic: keep data close to the math, and move as little of it as you can.
- [1]NVIDIA H100 Tensor Core GPU datasheet (tensor cores, HBM3 bandwidth, capacity)
- [2]Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit" (systolic array / MXU)
- [3]Williams, Waterman, Patterson, "Roofline: An Insightful Visual Performance Model"
- [4]Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness"
Have a question that needs this kind of depth? Get in touch.