The True Unit Cost of a Token: From Transistor to API Call
- →Generating a token costs ~2N FLOPs for an N-parameter model, but the binding constraint is memory bandwidth, not compute.
- →At batch size one, decode is catastrophically inefficient (~29 J/token for a 70B model); batching amortises weight movement almost linearly.
- →Well-utilised, a frontier-class token costs roughly $0.35–0.40 per million at the metal; energy is a rounding error next to hardware amortisation.
- →Prefill is compute-bound and cheap per token; decode is memory-bound and expensive, which is why input and output are priced differently.
- →The levers that matter: route to the smallest adequate model, maximise utilisation, reuse KV-cache, shorten outputs, and quantise.
Every token an AI model produces has a physical cost that exists long before it is priced as an API call. It is paid in transistors switching, joules dissipated as heat, bytes hauled across a memory bus, and square millimetres of silicon amortised over a few short years. This piece walks that cost up the entire stack, from the transistor to the invoice, so you can tell what a token actually costs, and where the spend is wasteful at the metal.
A token is a physical event, not a line item
When an API returns a token, a specific, measurable amount of work has happened on a chip. For a dense transformer with N parameters, generating one token in the forward pass costs roughly 2N floating-point operations, one multiply and one add per weight. A 70-billion-parameter model therefore spends on the order of 140 GFLOP per token, before attention, before overhead.
That arithmetic is the floor. The real cost is dominated not by the multiplies themselves but by the energy and time spent moving the weights to the arithmetic units. This is the single most misunderstood fact about inference economics, and it is where a silicon-grounded view changes the answer.
The silicon layer: FLOPs, joules, and die area
Modern AI accelerators are matrix-multiply engines wrapped in memory. A current-generation datacenter GPU looks roughly like this:
| Spec | NVIDIA H100 SXM | Why it matters |
|---|---|---|
| Dense BF16 throughput | ~990 TFLOP/s | Ceiling on compute-bound work |
| HBM3 capacity | 80 GB | How large a model fits per chip |
| HBM3 bandwidth | ~3.35 TB/s | The real bottleneck for decode |
| Board power (TDP) | ~700 W | Joules per second you pay for |
| Process node | TSMC 4N | Die area and energy per operation |
The energy per operation is set by the process node and the data type. On a leading-edge node a multiply-accumulate in BF16 costs on the order of a picojoule at the arithmetic unit, but fetching the operands from HBM costs one to two orders of magnitude more energy than the math itself. The headline TFLOP number is almost never the constraint. The bus is.
Memory bandwidth is the real bottleneck
Token generation (decode) is memory-bound, not compute-bound. To produce a single token at batch size one, the accelerator must read essentially every weight in the model from HBM once. For a 70B model in FP16 that is ~140 GB of traffic per token. At 3.35 TB/s:
time/token = 140 GB / 3.35 TB/s ≈ 41.8 ms
throughput = 1 / 0.0418 s ≈ 24 tokens/s (batch = 1)
energy/token = 700 W × 0.0418 s ≈ 29 J/token (batch = 1)Twenty-nine joules to emit one token is catastrophic economics, and it is exactly what a naive, unbatched deployment pays. The fix is batching. Because the weights are read once and can serve every sequence in the batch, the dominant cost amortises almost linearly with batch size:
energy/token ≈ 29 J / B (weights amortised across B sequences)
B = 128 → ~0.23 J/token (from weight movement) + computeFrom chip to throughput: utilisation (MFU)
Vendors quote peak FLOPs. Real workloads achieve a fraction of that, captured by Model FLOPs Utilisation (MFU), the ratio of useful FLOPs to the hardware ceiling. Well-tuned training reaches 40–55% MFU; latency-sensitive single-stream decode can sit in the low single digits because the matrix units starve waiting on memory. The gap between a 5%-MFU and a 45%-MFU deployment is a 9× difference in cost per token on identical hardware.
Datacenter overhead: power, cooling, PUE, interconnect
The chip does not run in a vacuum. Every watt at the die is multiplied by the facility’s Power Usage Effectiveness (PUE) , the ratio of total facility power to IT power. A modern, well-run AI datacenter targets a PUE around 1.2–1.3; older or air-cooled facilities run 1.5 or worse. On top of that sit the interconnect (NVLink, InfiniBand or Ethernet) and the host CPUs, storage, and networking that keep the accelerators fed.
The arithmetic: cost per million tokens
Now we can roll it up. Take one H100 at a representative cloud rate, kept well-utilised on batched decode:
| Input | Value | Note |
|---|---|---|
| GPU rental | $2.50 / hr | Representative on-demand cloud rate |
| Aggregate decode throughput | ~2,000 tok/s | Batched, mid-utilisation |
| Tokens / hour | 7.2M | 2,000 × 3,600 |
| Hardware cost / 1M tokens | ~$0.35 | $2.50 ÷ 7.2 |
| Energy / hour (PUE 1.5) | ~1.05 kWh | 700 W × 1.5 |
| Energy cost / 1M tokens | ~$0.012 | at $0.08 / kWh |
At the metal, well-run, this token costs on the order of $0.35–0.40 per million, and energy is a rounding error next to hardware amortisation. Published API prices for frontier models run from roughly $1 to $15 per million output tokens. The gap is real margin, model quality, R&D recovery, and the cost of not batching perfectly under latency constraints. The point is not that providers overcharge, it is that the floor is knowable, and the distance between the floor and your bill is a number you can manage.
Why prefill and decode change the math
Inference has two phases with opposite cost structures. Prefill (processing the prompt) is compute-bound and embarrassingly parallel , all prompt tokens go through the network at once, so the matrix units run hot and MFU is high. Decode (generating the answer) is sequential and memory-bound, one token at a time. A long prompt with a short answer is cheap per token; a short prompt with a long answer is expensive per token. Pricing that charges input and output tokens differently is tracking exactly this asymmetry. The KV-cache, the stored attention state that grows with context length, is what links the two and what makes very long contexts expensive in memory, not just compute.
What this implies for spend
The silicon-up view yields concrete, checkable guidance:
| Lever | Where the saving comes from |
|---|---|
| Route to the smallest model that clears the bar | Cost scales with parameters read per token; a correct smaller model is strictly cheaper. |
| Maximise batch / utilisation | Weight movement amortises across the batch; idle silicon is pure waste. |
| Cache and reuse prefill (KV reuse) | Avoids recomputing attention state for shared prefixes. |
| Shorten outputs, not just inputs | Decode is the memory-bound, per-token-expensive phase. |
| Quantise where quality allows | Fewer bytes per weight means less HBM traffic, the binding constraint. |
The bottom line
A token is not an abstract unit of API consumption. It is a measurable quantity of joules, bytes, and silicon-seconds. Once you can compute the floor, and route, batch, cache, and quantise against it, AI spend stops being a mystery invoice and becomes an engineering quantity you control. That is the whole reason to reason about AI from the silicon up.
- [1]NVIDIA H100 Tensor Core GPU datasheet (throughput, HBM3 bandwidth, TDP)
- [2]Kaplan et al., "Scaling Laws for Neural Language Models" (~2N FLOPs/token per parameter)
- [3]Korthikanti et al., "Reducing Activation Recomputation in Large Transformer Models" (Model FLOPs Utilisation)
- [4]Uptime Institute, Power Usage Effectiveness (PUE) benchmarks
Have a question that needs this kind of depth? Get in touch.