Training vs Inference Economics: Why the Cost Curves Diverge, and What It Means for Buyers
- →Training is a one-time capital cost (about 6·N·D FLOPs); inference is a marginal cost that recurs with every request and scales with traffic.
- →Plotted against usage, training is a flat line and inference is a slope, so for any well-used model, inference is where the lifetime money goes.
- →The honest unit cost is training spread over all tokens served plus the marginal cost per token; at high volume the training term vanishes.
- →Fine-tuning is capital that lowers the recurring bill: it beats prompting/RAG when volume is high and the task is stable, and loses when volume is low or facts change.
- →Self-hosting is a fixed cost that beats per-token APIs only above a utilisation break-even (roughly 12% in the worked example); idle GPUs lose to the API.
Training and inference are both “running AI,” but financially they are nothing alike. One is a capital expense you pay once, up front, before a single user shows up. The other is a marginal cost that recurs with every request, forever. Treat them as the same line item and you will make the wrong call on what to build, what to buy, and when to fine-tune. This piece separates the two and gives you the break-even arithmetic that follows.
Training is a capital event
Training a model is a single, enormous, compute-bound job. A useful rule of thumb is that training costs about 6 FLOPs per parameter per token (a forward pass plus a roughly twice-as-expensive backward pass). So the total compute to train a model with N parameters on D tokens is about 6 · N · D. Because the work is dense matrix multiplication, it runs at high efficiency on the hardware, so this number translates fairly directly into GPU-hours.
Training compute ≈ 6 · N · D
Example: a 70B-parameter model on 15T tokens
= 6 × 70e9 × 15e12 ≈ 6.3e24 FLOPs
On H100s at ~400 TFLOP/s realised (≈40% utilisation):
≈ 1.6e10 GPU-seconds ≈ 4.4M GPU-hours
≈ $8–9M at $2/GPU-hour (illustrative)The defining feature of that cost is that it is fixed and one-time. You pay it before the model serves anyone, and you pay the same amount whether the model later handles a thousand requests or a trillion. (How you split the budget between a bigger model and more training data is its own question, answered by compute-optimal scaling work like Chinchilla, but the shape of the cost does not change.)
Inference is a marginal cost that recurs
Inference is the opposite. Generating an answer costs roughly 2 FLOPs per parameter per token, but as we covered in The True Unit Cost of a Token, the real constraint during generation is memory bandwidth, not compute. What matters here is the financial shape: every request incurs its own cost, so the total scales linearly with traffic and never stops. Double your users and you roughly double your inference bill. There is no point at which it is “paid off.”
| Training | Inference | |
|---|---|---|
| Cost type | Capital expense, one-time | Marginal cost, recurring |
| Scales with | Model size × data (fixed) | Traffic / usage (unbounded) |
| Hardware bound by | Compute (matrix multiply) | Memory bandwidth (decode) |
| Runs at | High utilisation | Whatever your batching achieves |
| Main lever | Data and model-size choices | Routing, batching, caching, model size |
Why the curves diverge
Put the two on the same axes and the picture is stark. Training is a flat line: a fixed amount, spent once. Inference is a slope: it climbs with every token you serve. Early in a model’s life, when usage is small, the training bill dominates and the cost per token served looks enormous. As usage grows, that one-time cost spreads across more and more tokens and shrinks toward zero per token, while inference, the slope, just keeps adding up.
Amortisation: the number buyers actually care about
The honest unit cost of a model is its training cost spread across every token it will ever serve, plus the marginal cost of each token:
cost per token ≈ (training cost / total tokens served) + inference cost per token
Low volume: the first term dominates → effectively paying off training
High volume: the first term vanishes → inference cost per token is all that mattersThe practical consequence: if you serve a lot, do not over-index on training or fine-tuning cost, because it amortises to almost nothing. Optimise the recurring term. If you serve a little, the opposite holds, and a large up-front training or fine-tuning spend may never pay back.
When fine-tuning beats prompting
This framing settles a common argument. Prompting and retrieval (RAG) are pure inference cost: you pay for the extra context tokens on every single call. Fine-tuning is a small training cost (capital, paid once) that can lower the recurring bill, by letting you use shorter prompts, a smaller model, or fewer retries for the same quality.
So the decision is a break-even between a one-time cost and a per-call saving:
| Situation | Usually cheaper |
|---|---|
| High volume, stable task | Fine-tune: the one-time cost amortises, and you cut every call’s cost |
| Low volume, or task still changing | Prompt / RAG: avoid paying training cost you will not amortise |
| Need fresh or private facts at answer time | RAG: knowledge that changes does not belong baked into weights |
| Need a fixed behaviour or format, repeatedly | Fine-tune: stop re-sending the same instructions every call |
Build vs buy: the self-hosting crossover
The same logic decides whether to call an API or run your own GPUs. An API is pure marginal cost: you pay per token, nothing when idle, and the price includes the provider’s margin. Self-hosting is a fixed cost: you rent or buy GPUs by the hour whether or not they are busy, but the per-token cost at full load is far lower. Self-hosting wins only when you keep the hardware busy enough.
Self-host, one H100 at $2.50/hr, ~2,000 tokens/sec when full:
full-load cost ≈ $0.35 per million tokens
API price (output) ≈ $3.00 per million tokens (illustrative)
Effective self-host cost = $0.35 / utilisation
utilisation 100% → $0.35 / M (≈ 9× cheaper than API)
utilisation 20% → $1.75 / M (still cheaper)
utilisation 5% → $7.00 / M (now more expensive than the API)
Break-even here is roughly 12% sustained utilisation.Below the break-even, idle silicon makes self-hosting the more expensive option, and the API’s “pay only for what you use” model wins outright. Above it, and especially at steady, high-volume traffic, your own (well-batched) hardware pulls away. The mistake to avoid is buying GPUs for a workload that cannot keep them full.
What it means for buyers
Three questions answer most of the cost decisions: How much will this be used? Is the cost fixed or per-call? And at our real volume, on which side of the crossover do we sit? Training and a fine-tune are capital; spend them only where the volume will amortise them. Inference is the bill that never stops; that is where routing, batching, caching, and right-sizing the model pay back every single day. Keep the two costs separate on the page and the right answer usually becomes obvious.
- [1]Kaplan et al., "Scaling Laws for Neural Language Models" (~6N training and ~2N inference FLOPs per token)
- [2]Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla: balancing model size and data)
- [3]NVIDIA H100 Tensor Core GPU datasheet (throughput, HBM3 bandwidth, for the serving economics)
Have a question that needs this kind of depth? Get in touch.