Model & inference economics/12 min

The True Unit Cost of a Token: From Transistor to API Call

By WaferZeroPublished June 16, 2026

TL;DR

→Generating a token costs ~2N FLOPs for an N-parameter model, but the binding constraint is memory bandwidth, not compute.
→At batch size one, decode is catastrophically inefficient (~29 J/token for a 70B model); batching amortises weight movement almost linearly.
→Well-utilised, a frontier-class token costs roughly $0.35–0.40 per million at the metal; energy is a rounding error next to hardware amortisation.
→Prefill is compute-bound and cheap per token; decode is memory-bound and expensive, which is why input and output are priced differently.
→The levers that matter: route to the smallest adequate model, maximise utilisation, reuse KV-cache, shorten outputs, and quantise.

Every token an AI model produces has a physical cost that exists long before it is priced as an API call. It is paid in transistors switching, joules dissipated as heat, bytes hauled across a memory bus, and square millimetres of silicon amortised over a few short years. This piece walks that cost up the entire stack, from the transistor to the invoice, so you can tell what a token actually costs, and where the spend is wasteful at the metal.

A token is a physical event, not a line item

When an API returns a token, a specific, measurable amount of work has happened on a chip. For a dense transformer with N parameters, generating one token in the forward pass costs roughly 2N floating-point operations, one multiply and one add per weight. A 70-billion-parameter model therefore spends on the order of 140 GFLOP per token, before attention, before overhead.

That arithmetic is the floor. The real cost is dominated not by the multiplies themselves but by the energy and time spent moving the weights to the arithmetic units. This is the single most misunderstood fact about inference economics, and it is where a silicon-grounded view changes the answer.

The silicon layer: FLOPs, joules, and die area

Modern AI accelerators are matrix-multiply engines wrapped in memory. A current-generation datacenter GPU looks roughly like this:

Spec	NVIDIA H100 SXM	Why it matters
Dense BF16 throughput	~990 TFLOP/s	Ceiling on compute-bound work
HBM3 capacity	80 GB	How large a model fits per chip
HBM3 bandwidth	~3.35 TB/s	The real bottleneck for decode
Board power (TDP)	~700 W	Joules per second you pay for
Process node	TSMC 4N	Die area and energy per operation

The energy per operation is set by the process node and the data type. On a leading-edge node a multiply-accumulate in BF16 costs on the order of a picojoule at the arithmetic unit, but fetching the operands from HBM costs one to two orders of magnitude more energy than the math itself. The headline TFLOP number is almost never the constraint. The bus is.

Memory bandwidth is the real bottleneck

Token generation (decode) is memory-bound, not compute-bound. To produce a single token at batch size one, the accelerator must read essentially every weight in the model from HBM once. For a 70B model in FP16 that is ~140 GB of traffic per token. At 3.35 TB/s:

time/token  = 140 GB / 3.35 TB/s  ≈ 41.8 ms
throughput  = 1 / 0.0418 s        ≈ 24 tokens/s   (batch = 1)
energy/token = 700 W × 0.0418 s   ≈ 29 J/token    (batch = 1)

Twenty-nine joules to emit one token is catastrophic economics, and it is exactly what a naive, unbatched deployment pays. The fix is batching. Because the weights are read once and can serve every sequence in the batch, the dominant cost amortises almost linearly with batch size:

energy/token ≈ 29 J / B        (weights amortised across B sequences)
B = 128  →  ~0.23 J/token (from weight movement) + compute

From chip to throughput: utilisation (MFU)

Vendors quote peak FLOPs. Real workloads achieve a fraction of that, captured by Model FLOPs Utilisation (MFU), the ratio of useful FLOPs to the hardware ceiling. Well-tuned training reaches 40–55% MFU; latency-sensitive single-stream decode can sit in the low single digits because the matrix units starve waiting on memory. The gap between a 5%-MFU and a 45%-MFU deployment is a 9× difference in cost per token on identical hardware.

Datacenter overhead: power, cooling, PUE, interconnect

The chip does not run in a vacuum. Every watt at the die is multiplied by the facility’s Power Usage Effectiveness (PUE) , the ratio of total facility power to IT power. A modern, well-run AI datacenter targets a PUE around 1.2–1.3; older or air-cooled facilities run 1.5 or worse. On top of that sit the interconnect (NVLink, InfiniBand or Ethernet) and the host CPUs, storage, and networking that keep the accelerators fed.

Illustrative breakdown of where the marginal cost of a generated token accrues. Memory movement and idle time dominate; raw compute is a minority of the bill.

The arithmetic: cost per million tokens

Now we can roll it up. Take one H100 at a representative cloud rate, kept well-utilised on batched decode:

Input	Value	Note
GPU rental	$2.50 / hr	Representative on-demand cloud rate
Aggregate decode throughput	~2,000 tok/s	Batched, mid-utilisation
Tokens / hour	7.2M	2,000 × 3,600
Hardware cost / 1M tokens	~$0.35	$2.50 ÷ 7.2
Energy / hour (PUE 1.5)	~1.05 kWh	700 W × 1.5
Energy cost / 1M tokens	~$0.012	at $0.08 / kWh

At the metal, well-run, this token costs on the order of $0.35–0.40 per million, and energy is a rounding error next to hardware amortisation. Published API prices for frontier models run from roughly $1 to $15 per million output tokens. The gap is real margin, model quality, R&D recovery, and the cost of not batching perfectly under latency constraints. The point is not that providers overcharge, it is that the floor is knowable, and the distance between the floor and your bill is a number you can manage.

Why prefill and decode change the math

Inference has two phases with opposite cost structures. Prefill (processing the prompt) is compute-bound and embarrassingly parallel , all prompt tokens go through the network at once, so the matrix units run hot and MFU is high. Decode (generating the answer) is sequential and memory-bound, one token at a time. A long prompt with a short answer is cheap per token; a short prompt with a long answer is expensive per token. Pricing that charges input and output tokens differently is tracking exactly this asymmetry. The KV-cache, the stored attention state that grows with context length, is what links the two and what makes very long contexts expensive in memory, not just compute.

What this implies for spend

The silicon-up view yields concrete, checkable guidance:

Lever	Where the saving comes from
Route to the smallest model that clears the bar	Cost scales with parameters read per token; a correct smaller model is strictly cheaper.
Maximise batch / utilisation	Weight movement amortises across the batch; idle silicon is pure waste.
Cache and reuse prefill (KV reuse)	Avoids recomputing attention state for shared prefixes.
Shorten outputs, not just inputs	Decode is the memory-bound, per-token-expensive phase.
Quantise where quality allows	Fewer bytes per weight means less HBM traffic, the binding constraint.

The bottom line

A token is not an abstract unit of API consumption. It is a measurable quantity of joules, bytes, and silicon-seconds. Once you can compute the floor, and route, batch, cache, and quantise against it, AI spend stops being a mystery invoice and becomes an engineering quantity you control. That is the whole reason to reason about AI from the silicon up.

Sources

Have a question that needs this kind of depth? Get in touch.