Model & inference economics/9 min

Training vs Inference Economics: Why the Cost Curves Diverge, and What It Means for Buyers

By WaferZeroPublished June 16, 2026

TL;DR

→Training is a one-time capital cost (about 6·N·D FLOPs); inference is a marginal cost that recurs with every request and scales with traffic.
→Plotted against usage, training is a flat line and inference is a slope, so for any well-used model, inference is where the lifetime money goes.
→The honest unit cost is training spread over all tokens served plus the marginal cost per token; at high volume the training term vanishes.
→Fine-tuning is capital that lowers the recurring bill: it beats prompting/RAG when volume is high and the task is stable, and loses when volume is low or facts change.
→Self-hosting is a fixed cost that beats per-token APIs only above a utilisation break-even (roughly 12% in the worked example); idle GPUs lose to the API.

Training and inference are both “running AI,” but financially they are nothing alike. One is a capital expense you pay once, up front, before a single user shows up. The other is a marginal cost that recurs with every request, forever. Treat them as the same line item and you will make the wrong call on what to build, what to buy, and when to fine-tune. This piece separates the two and gives you the break-even arithmetic that follows.

Training is a capital event

Training a model is a single, enormous, compute-bound job. A useful rule of thumb is that training costs about 6 FLOPs per parameter per token (a forward pass plus a roughly twice-as-expensive backward pass). So the total compute to train a model with N parameters on D tokens is about 6 · N · D. Because the work is dense matrix multiplication, it runs at high efficiency on the hardware, so this number translates fairly directly into GPU-hours.

Training compute ≈ 6 · N · D

Example: a 70B-parameter model on 15T tokens
  = 6 × 70e9 × 15e12  ≈ 6.3e24 FLOPs

On H100s at ~400 TFLOP/s realised (≈40% utilisation):
  ≈ 1.6e10 GPU-seconds ≈ 4.4M GPU-hours
  ≈ $8–9M at $2/GPU-hour   (illustrative)

The defining feature of that cost is that it is fixed and one-time. You pay it before the model serves anyone, and you pay the same amount whether the model later handles a thousand requests or a trillion. (How you split the budget between a bigger model and more training data is its own question, answered by compute-optimal scaling work like Chinchilla, but the shape of the cost does not change.)

Inference is a marginal cost that recurs

Inference is the opposite. Generating an answer costs roughly 2 FLOPs per parameter per token, but as we covered in The True Unit Cost of a Token, the real constraint during generation is memory bandwidth, not compute. What matters here is the financial shape: every request incurs its own cost, so the total scales linearly with traffic and never stops. Double your users and you roughly double your inference bill. There is no point at which it is “paid off.”

	Training	Inference
Cost type	Capital expense, one-time	Marginal cost, recurring
Scales with	Model size × data (fixed)	Traffic / usage (unbounded)
Hardware bound by	Compute (matrix multiply)	Memory bandwidth (decode)
Runs at	High utilisation	Whatever your batching achieves
Main lever	Data and model-size choices	Routing, batching, caching, model size

Why the curves diverge

Put the two on the same axes and the picture is stark. Training is a flat line: a fixed amount, spent once. Inference is a slope: it climbs with every token you serve. Early in a model’s life, when usage is small, the training bill dominates and the cost per token served looks enormous. As usage grows, that one-time cost spreads across more and more tokens and shrinks toward zero per token, while inference, the slope, just keeps adding up.

Training is a one-time fixed cost; inference rises with usage. Past the crossover, cumulative inference spend overtakes the entire cost of training the model.

Amortisation: the number buyers actually care about

The honest unit cost of a model is its training cost spread across every token it will ever serve, plus the marginal cost of each token:

cost per token ≈ (training cost / total tokens served) + inference cost per token

Low volume:  the first term dominates  → effectively paying off training
High volume: the first term vanishes   → inference cost per token is all that matters

The practical consequence: if you serve a lot, do not over-index on training or fine-tuning cost, because it amortises to almost nothing. Optimise the recurring term. If you serve a little, the opposite holds, and a large up-front training or fine-tuning spend may never pay back.

When fine-tuning beats prompting

This framing settles a common argument. Prompting and retrieval (RAG) are pure inference cost: you pay for the extra context tokens on every single call. Fine-tuning is a small training cost (capital, paid once) that can lower the recurring bill, by letting you use shorter prompts, a smaller model, or fewer retries for the same quality.

So the decision is a break-even between a one-time cost and a per-call saving:

Situation	Usually cheaper
High volume, stable task	Fine-tune: the one-time cost amortises, and you cut every call’s cost
Low volume, or task still changing	Prompt / RAG: avoid paying training cost you will not amortise
Need fresh or private facts at answer time	RAG: knowledge that changes does not belong baked into weights
Need a fixed behaviour or format, repeatedly	Fine-tune: stop re-sending the same instructions every call

Build vs buy: the self-hosting crossover

The same logic decides whether to call an API or run your own GPUs. An API is pure marginal cost: you pay per token, nothing when idle, and the price includes the provider’s margin. Self-hosting is a fixed cost: you rent or buy GPUs by the hour whether or not they are busy, but the per-token cost at full load is far lower. Self-hosting wins only when you keep the hardware busy enough.

Self-host, one H100 at $2.50/hr, ~2,000 tokens/sec when full:
  full-load cost ≈ $0.35 per million tokens
  API price (output)   ≈ $3.00 per million tokens   (illustrative)

Effective self-host cost = $0.35 / utilisation
  utilisation 100%  → $0.35 / M   (≈ 9× cheaper than API)
  utilisation 20%   → $1.75 / M   (still cheaper)
  utilisation 5%    → $7.00 / M   (now more expensive than the API)

Break-even here is roughly 12% sustained utilisation.

Below the break-even, idle silicon makes self-hosting the more expensive option, and the API’s “pay only for what you use” model wins outright. Above it, and especially at steady, high-volume traffic, your own (well-batched) hardware pulls away. The mistake to avoid is buying GPUs for a workload that cannot keep them full.

What it means for buyers

Three questions answer most of the cost decisions: How much will this be used? Is the cost fixed or per-call? And at our real volume, on which side of the crossover do we sit? Training and a fine-tune are capital; spend them only where the volume will amortise them. Inference is the bill that never stops; that is where routing, batching, caching, and right-sizing the model pay back every single day. Keep the two costs separate on the page and the right answer usually becomes obvious.

Sources

Have a question that needs this kind of depth? Get in touch.