Paged attention
KV cache as a page table
vLLM’s biggest single contribution is taking an idea that operating systems have used to manage RAM since the 1960s — virtual memory with paging — and applying it to the KV cache. The result, paged attention, is what made it practical to serve hundreds of concurrent requests on a single GPU. This section is about why a page table is exactly the right data structure for the KV cache, and how it changes what “running out of memory” means.
The problem with naive contiguous allocation
Before vLLM, the standard approach was: when a request arrives, allocate a contiguous slab of KV memory big enough for its maximum expected sequence length. So if your model supports 4k context, every request reserves 4k tokens worth of KV space immediately — even if it only ends up generating 200 tokens.
This is catastrophically wasteful:
- Internal fragmentation fragmentation Wasted memory from allocations that don’t fit cleanly. Paging trades internal fragmentation (≤1 page per request) for none of the external kind. See in glossary → — almost every request leaves most of its reserved slab unused.
- External fragmentation — when a request finishes, its slab is freed, but the next request might need a slightly different size and can’t reuse the hole cleanly. Memory becomes Swiss cheese over time.
- Hard limits on concurrency — you can fit far fewer concurrent requests in HBM than your actual KV-cache contents would require, because you have to reserve worst-case for each.
The 2023 vLLM paper (“Efficient Memory Management for Large Language Model Serving with PagedAttention”) measured this. Real workloads were leaving 60–80% of allocated KV memory unused.
The page table trick
Here’s the move. Stop allocating contiguous slabs at all. Instead:
- Split HBM’s KV-cache region into a pool of fixed-size pages page A fixed-size slab of KV cache memory (e.g. 16 tokens). The unit vLLM allocates and frees. See in glossary → (typically 16 tokens × all heads × head_dim × 2 for K and V). For Llama-3-8B with GQA, one page is about 1 MB.
- Each request maintains a page table page table Per-sequence mapping from logical position → physical page in the KV cache. Same idea as OS virtual memory, applied to attention. See in glossary → : a small list mapping the request’s logical position (token 0…N) to the physical block in the pool that holds those tokens’ KV.
- When the request grows by another decode step, you allocate one more page (or extend the last partial page).
- When the request finishes, its pages return to the free pool.
Internal fragmentation now caps at less than one page per request — and at 16 tokens per page, that’s at most 16 tokens × ~64 KB ≈ 1 MB wasted, regardless of the request’s total length. External fragmentation goes to zero because every allocation and free is the same size.
Modifying the attention kernel
The catch is that the GPU’s attention kernel now has to do an indirection — every key/value access goes through the page table. The vLLM team wrote a custom CUDA kernel called paged_attention that takes the per-request page table as input and walks it during the QKᵀ dot products and value combinations.
The overhead vs vanilla attention is small (a few percent). The payoff in memory efficiency is enormous: a 3-4× increase in concurrent requests per GPU, regularly. That single change accounts for much of vLLM’s headline throughput numbers.
Try it
Below is a simplified paged allocator: 24 physical blocks of 4 tokens each (real vLLM uses ~16 tokens/page, but 4 fits on screen). Add requests, decode steps, finish them, and watch:
- The block grid on the left shows physical block occupancy. Empty (dashed) blocks are free.
- The page tables on the right show each request’s logical → physical mapping.
- Try toggling prefix caching on, then add multiple requests with
+ With shared prefix. The same physical blocks back several requests’ page tables — you’ll see a teal “shared” badge on those blocks. That’s the next section’s idea.
The free-block queue
Mechanically: vLLM keeps a single FIFO queue of free physical blocks. Every allocation pops from the front; every free returns to the back. No search, no compaction, no defragmentation pass. Allocation is O(1).
The scheduler asks the allocator “do you have K free blocks?” before deciding to admit a request. If not, it either preempts a running request (sending its KV back to host RAM, or just dropping it and re-computing later when it gets re-admitted) or waits.
Preemption and swap
Two preemption modes:
-
Recompute: drop the KV cache for the preempted request entirely; when it comes back, re-run prefill. Cheap to free, expensive to resume. Good when KV is small relative to recompute cost.
-
Swap: copy the KV pages down to host RAM via PCIe; when the request comes back, copy them back up. Slower on each side but avoids the recompute. Good for long contexts where recomputing prefill is expensive.
vLLM chooses dynamically based on policy and request shape.
What this enables
Paged attention alone is “memory hygiene”: better packing, no fragmentation. But because the cache is now content-addressed by (request_id, logical_block) and the physical blocks are identical-shaped, you can also do something else that the old contiguous layout couldn’t: share physical blocks across requests. That’s prefix caching, and it’s our next section.