Section 12

The KV cache

Why decode is cheap and memory is expensive

In the naive loop, generating token 1,001 means re-running the model on tokens 1–1,000 and computing all 1,000 queries, keys, and values again — even though only the new token’s query has changed since the last step. That waste is unnecessary. Once we have the keys and values for token 5, they’re frozen: nothing in the future can change them, because attention is causal.

So we save them. The collection of saved KK and VV vectors, indexed by (layer, head, position), is called the KV cache KV cache The stored keys and values from all past tokens, so attention at step t only needs to compute Q for the new token. See in glossary → . Every layer has its own KV cache; every request has its own.

Decode after the KV cache

With the KV cache in place, the decode step becomes much cheaper in compute. To generate token t+1t+1:

  1. Take the new token’s embedding.
  2. Run it through each layer. At every attention layer:
    • Compute only the new token’s qq, kk, vv — three matrix multiplies on a single token.
    • Append the new kk and vv to that layer’s KV cache.
    • Compute attention scores: just one row of QKQ K^\top, against all cached keys.
    • Combine: weighted sum over all cached values.
  3. After the last layer, compute logits for the one new position and sample.

The MLP also only runs on the single new token. Everything per step is “input length = 1,” but the attention reads grow with the total cached sequence length.

That last point is the new bottleneck. We saved compute, but we created a memory obligation. The KV cache has to be held in HBM for the duration of the request, and it can be enormous.

How big does it get?

The formula is one of the most-quoted in the field:

bytes per token, per layer=2num_kv_headshead_dimdtype_bytes\text{bytes per token, per layer} = 2 \cdot \text{num\_kv\_heads} \cdot \text{head\_dim} \cdot \text{dtype\_bytes}

The 22 is because we store both keys and values. Multiply by num_layers to get bytes per token across the whole model, then by sequence length to get bytes per request.

For Llama-3-8B at fp16 with full multi-head attention (32 K/V heads), one token would cost 2321282=16,3842 \cdot 32 \cdot 128 \cdot 2 = 16{,}384 bytes per layer, or 524 KB across 32 layers. An 8,000-token sequence = ~4 GB just in KV cache. Now imagine 64 such requests sharing the GPU. We’re way past the 64 GB of HBM that remains after the model weights are loaded.

This is the wall every serving system runs into. There are two ways out: shrink each token’s cache (GQA, KV quantization) and manage the cache better (paging, eviction, offload). Let’s start with the first.

GQA: the trick every modern model uses

Grouped-Query Attention (GQA) GQA Grouped-Query Attention — multiple query heads share one K/V head, shrinking the KV cache by 4–8× with minimal quality loss. See in glossary → is the answer. Instead of having 32 query heads each paired with their own key/value heads, you group, say, 4 query heads to share one K/V head. The number of queries stays the same (so quality barely budges), but the number of stored K/V pairs drops by 4× — and the KV cache shrinks by exactly that factor.

Llama-3-8B uses 32 query heads but only 8 K/V heads. The KV cache is therefore 4× smaller than it would be with vanilla multi-head attention. The 70B model is even more aggressive: 64 query heads, 8 K/V heads, an 8× reduction.

The extreme version, multi-query attention MQA Multi-Query Attention — extreme GQA where all query heads share a single K/V head. See in glossary → (MQA), uses only one K/V head shared across all queries. PaLM used it. It’s the smallest possible cache, with some quality cost.

Try it

Play with the model preset, the sequence length, and the GQA toggle. Notice that 32k tokens × 70B with GQA off is well into double-digit gigabytes per request. With GQA on you get an order of magnitude back. Now imagine running many requests concurrently.

KV cache growth
Step the sequence length and watch the per-layer KV memory grow. Toggle GQA off to see the cost without it.
Per-layer KV cache for one request — each bar is one layer (showing 16 of 32)
L0
L2
L4
L6
L8
L10
L12
L14
L16
L18
L20
L22
L24
L26
L28
L30
Per-token, per-layer bytes:
2 · num_kv_heads · head_dim · dtype_bytes
= 2 · 8 · 128 · 2= 4.0 KB
Per-token, all 32 layers:
128.0 KB
For 2,048 tokens × 1 request:
256.0 MB
An H100 SXM has 80 GB of HBM. Once the model weights are loaded (Llama-3-8B ≈ 16 GB at fp16), the KV cache competes for the rest. Try concurrent requests = 16 with seq_len = 8k on the 70B model: you'll quickly run out of room. Paged attention (§15) and prefix caching (§16) are vLLM's answers.

What this implies for serving

Three points to carry forward:

  1. KV cache is a real budget. After the weights are loaded, whatever HBM is left is the KV-cache pool, and that pool sets a hard limit on how many concurrent requests × how long their sequences can be. vLLM tracks this explicitly.

  2. Long contexts are KV-cache-dominated. By 32k tokens, you’re carrying more KV cache than model weights for many models. A 128k or 1M-token context is wildly KV-heavy.

  3. The KV cache moves with the request. If you migrate a request from one GPU to another, you have to ship its KV cache. That’s why request-routing decisions are sticky: once a request is on a GPU, it stays there.

There are more KV-shrinking tricks (KV quantization, KV compression, SSM/hybrid architectures SSM / hybrid architectures State-Space Models (SSMs) replace attention with a recurrent operator (Mamba, RWKV) that compresses the entire past into a fixed-size hidden state — no KV cache to grow with sequence length. Hybrids (Jamba, Zamba, RecurrentGemma) interleave a few attention layers with many SSM layers, keeping most of the recall power of attention while shrinking the KV cache by 5–20×. They're a different bet on the same memory problem. See in glossary → ) but they don’t change the picture much. The path forward is managing the cache more cleverly: where it lives, how it’s chunked, whether it’s shared across requests. Before we get to that, we need to be precise about where bytes physically live on a GPU. That’s the next section.