← Research
Model & inference economics/11 min

Mixture-of-Experts, Speculative Decoding, and KV-Cache: Where the Next Efficiency Gains Come From

By WaferZeroPublished June 16, 2026
TL;DR
  • The next inference gains come from working around the memory wall, not from raw scale: mixture-of-experts, speculative decoding, and KV-cache management.
  • MoE activates only a few experts per token, cutting compute, but every expert still sits in memory, so it trades compute for memory and network traffic.
  • Speculative decoding uses a cheap draft model to propose tokens that the big model verifies in one parallel pass, a lossless 2 to 3x speedup that depends on acceptance rate.
  • The KV-cache can grow larger than the model weights for long context and is read every step, which is the real reason long context is expensive; GQA, quantisation, and paged attention shrink it.
  • The three are complementary and stack, but each attacks a different bottleneck, so the saving is workload-specific.

Dense models are running into a wall: making them smarter by making them bigger makes every token proportionally more expensive to serve. The next round of gains is coming not from raw scale but from three techniques that get more useful work out of the same hardware: mixture-of-experts, speculative decoding, and smarter management of the KV-cache. Each one attacks a different bottleneck, and understanding which is which tells you where inference cost is actually going.

Mixture-of-experts: pay for a fraction of the model

A dense model runs every one of its parameters for every token. A mixture-of-experts (MoE) model instead splits the feed-forward layers into many “experts” and adds a small router that picks just a few of them per token. A model might hold 64 experts but activate only two, so its total capacity is huge while its active compute per token stays small. Mixtral 8x7B is the familiar example: about 47B parameters total, but only around 13B active on any given token.

123456788 experts held in memory · top-2 run per token (Mixtral-style)
An MoE layer keeps many experts in memory but routes each token to only a couple of them. Capacity goes up; compute per token stays low.

The catch is that the saving is in compute, not memory. Every expert still has to sit in HBM in case the router calls it, so an MoE model has a large memory footprint even though it does little math per token. Worse, when the experts are spread across many GPUs, routing tokens to the right expert means an all-to-all shuffle across the interconnect, and load imbalance (some experts getting more tokens than others) wastes capacity. MoE is best understood as trading compute for memory and network traffic: exactly the bottlenecks we keep returning to.

Speculative decoding: draft cheap, verify in bulk

Token generation is sequential and memory-bound: each new token requires reading the model’s weights from memory, and you cannot start token N+1 until token N exists. Speculative decoding breaks that serial chain. A small, cheap draft model quickly proposes several tokens ahead; the big model then verifies all of them in a single parallel pass, which is compute-bound and cheap per token. Every guess the big model accepts is a token produced without its own sequential step.

draft model proposes:   "the  cat  sat  on   a"
big model verifies all five in ONE parallel pass:
  accepts "the cat sat on"  (4 tokens), corrects the 5th
=> ~4 tokens for the cost of one big-model step

Crucially, this is lossless: the big model only accepts tokens it would have generated itself, so output quality is unchanged. The speedup (often two to three times) depends entirely on the acceptance rate, which is high for predictable text and low for genuinely surprising content. The cost is running and storing a second model and some added complexity. Self-speculation variants (extra prediction heads on the main model) avoid the separate draft model at the price of training changes.

KV-cache: the memory that makes long context expensive

To avoid re-reading the entire conversation on every step, a model caches the attention keys and values for every previous token, at every layer. This KV-cache is what makes generation tractable, but its size grows linearly with context length and batch size:

KV-cache bytes ≈ 2 (K and V) × layers × heads × head_dim
                 × sequence_length × batch × bytes_per_value

For long contexts the KV-cache can grow larger than the model weights, and it must be read on every decode step, so it competes for exactly the memory bandwidth that already limits generation. This, not compute, is why long context is expensive. The fixes are all about shrinking or managing that cache:

TechniqueWhat it does
Grouped-query attention (GQA / MQA)Share keys and values across attention heads, shrinking the cache several-fold
Quantised KV-cacheStore keys and values in 8-bit or 4-bit instead of 16-bit
PagedAttention (vLLM)Manage the cache like virtual-memory pages to cut waste and raise batch size
Sliding-window / sparse attentionBound how much past context each token attends to

How they compose, and where they stop

The three are complementary because they attack different bottlenecks. MoE cuts the active compute per token; speculative decoding cuts the number of sequential big-model passes; KV-cache management cuts the memory and bandwidth that context consumes. A modern serving stack uses all three at once, and their gains largely stack.

The takeaway

When someone says a new model is “cheaper to run,” the useful question is which bottleneck did they relieve: compute (MoE), sequential latency (speculative decoding), or context memory (KV-cache tricks). The answer tells you whether the saving will hold for your workload, because a technique that helps predictable, short-context traffic may do little for long-context, hard-reasoning traffic. Efficiency is workload-specific, and reading it correctly is worth real money.

Sources
  1. [1]Shazeer et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer"
  2. [2]Leviathan et al., "Fast Inference from Transformers via Speculative Decoding"
  3. [3]Kwon et al., "Efficient Memory Management for LLM Serving with PagedAttention" (vLLM)
  4. [4]Ainslie et al., "GQA: Training Generalized Multi-Query Transformer Models"

Have a question that needs this kind of depth? Get in touch.