The KV Cache

If you understand one data structure in LLM inference, make it the KV cache. Almost every other optimization is downstream of how it behaves.

Why it exists

Transformers attend to every previous token. During training the whole sequence is available. During generation, recomputing attention over the full history at every step would be quadratic and absurd.

The KV cache solves this: each layer's keys and values projected from past tokens are saved. For the next token, the model computes K, V only for the new token, reads the cached K, V for everything before, and runs attention against the concatenation.

You pay the per-token K, V computation once and reuse it for the rest of generation.

Why it dominates memory

For L layers, H heads (dim d), batch B, sequence T:

KV cache size = 2 · B · L · H · d · T · bytes_per_element

It scales linearly with sequence length and batch size. For modern models a 70B serving long contexts spends more VRAM on KV cache than on weights. Common case, not corner case.

Two phases of serving

Prefill — process the prompt in one parallel forward pass to populate the cache. Compute-bound, lots of arithmetic per byte moved. GPUs love this. Dominates TTFT.

Decode — generate one token at a time. Memory-bound: tiny arithmetic per step, but the entire cache must be read from memory. The GPU sits mostly idle waiting on memory. Dominates per-user TPS and the cost of long generations.

Serious engines disaggregate prefill and decode — different resource profiles, mixing wastes both.

What an engine optimizes

Storing compactly. Naive contiguous allocation fragments badly. Paged attention (vLLM's contribution) treats the cache like virtual memory: fixed-size blocks pointed to by a per-request table. Massively increases users per GPU.

Shrinking the cache. The most impactful architectural change is MQA / GQA — multiple query heads share a smaller K, V set. K/V projection FLOPs drop, and cache size drops (less data to read per decode step). GQA is now the default for serving-oriented models.

Quantizing the cache. Storing K, V in INT8 (TurboQuant) or FP8 halves memory. Same accuracy↔memory↔kernel-support trade as weight quant, with potentially bigger upside since the cache is usually the bigger consumer.

Mental model

When LLM serving is slow or expensive, ask in order:

Are we prefill or decode bound?
If decode, are we memory-bandwidth bound? (Almost certainly yes.)
What is the KV cache doing — how big, how fragmented, how reused?