Section 15

Paged attention

KV cache as a page table

vLLM’s biggest single contribution is taking an idea that operating systems have used to manage RAM since the 1960s — virtual memory with paging — and applying it to the KV cache. The result, paged attention, is what made it practical to serve hundreds of concurrent requests on a single GPU. This section is about why a page table is exactly the right data structure for the KV cache, and how it changes what “running out of memory” means.

The problem with naive contiguous allocation

Before vLLM, the standard approach was: when a request arrives, allocate a contiguous slab of KV memory big enough for its maximum expected sequence length. So if your model supports 4k context, every request reserves 4k tokens worth of KV space immediately — even if it only ends up generating 200 tokens.

This is catastrophically wasteful:

Internal fragmentation — almost every request leaves most of its reserved slab unused.
External fragmentation — when a request finishes, its slab is freed, but the next request might need a slightly different size and can’t reuse the hole cleanly. Memory becomes Swiss cheese over time.
Hard limits on concurrency — you can fit far fewer concurrent requests in HBM than your actual KV-cache contents would require, because you have to reserve worst-case for each.

The 2023 vLLM paper (“Efficient Memory Management for Large Language Model Serving with PagedAttention”) measured this. Real workloads were leaving 60–80% of allocated KV memory unused.

The page table trick

Here’s the move. Stop allocating contiguous slabs at all. Instead:

Split HBM’s KV-cache region into a pool of fixed-size pages (typically 16 tokens × all heads × head_dim × 2 for K and V). For Llama-3-8B with GQA, one page is about 1 MB.
Each request maintains a page table : a small list mapping the request’s logical position (token 0…N) to the physical block in the pool that holds those tokens’ KV.
When the request grows by another decode step, you allocate one more page (or extend the last partial page).
When the request finishes, its pages return to the free pool.

Internal fragmentation now caps at less than one page per request — and at 16 tokens per page, that’s at most 16 tokens × ~64 KB ≈ 1 MB wasted, regardless of the request’s total length. External fragmentation goes to zero because every allocation and free is the same size.

Modifying the attention kernel

The catch is that the GPU’s attention kernel now has to do an indirection — every key/value access goes through the page table. The vLLM team wrote a custom CUDA kernel called paged_attention that takes the per-request page table as input and walks it during the QKᵀ dot products and value combinations.

The overhead vs vanilla attention is small (a few percent). The payoff in memory efficiency is enormous: a 3-4× increase in concurrent requests per GPU, regularly. That single change accounts for much of vLLM’s headline throughput numbers.

Try it

Below is a simplified paged allocator: 24 physical blocks of 4 tokens each (real vLLM uses ~16 tokens/page, but 4 fits on screen). Add requests, decode steps, finish them, and watch:

The block grid on the left shows physical block occupancy. Empty (dashed) blocks are free.
The page tables on the right show each request’s logical → physical mapping.
Try toggling prefix caching on, then add multiple requests with + With shared prefix. The same physical blocks back several requests’ page tables — you’ll see a teal “shared” badge on those blocks. That’s the next section’s idea.

Paged attention allocator

24 physical blocks of 4 tokens each. Add requests, step decoding, finish them — watch the page table track everything.

Prefix caching

Physical KV blocks

Per-request page tables

Click "+ Request" to start.

Used / total

0 / 24

Utilization

Wasted (internal frag)

0.0%

Each request gets its own pages. Toggle prefix caching on and add multiple "shared prefix" requests to see the same physical blocks light up across multiple page tables.

The free-block queue

Mechanically: vLLM keeps a single FIFO queue of free physical blocks. Every allocation pops from the front; every free returns to the back. No search, no compaction, no defragmentation pass. Allocation is O(1).

The scheduler asks the allocator “do you have K free blocks?” before deciding to admit a request. If not, it either preempts a running request (sending its KV back to host RAM, or just dropping it and re-computing later when it gets re-admitted) or waits.

Preemption and swap

Two preemption modes:

Recompute: drop the KV cache for the preempted request entirely; when it comes back, re-run prefill. Cheap to free, expensive to resume. Good when KV is small relative to recompute cost.
Swap: copy the KV pages down to host RAM via PCIe; when the request comes back, copy them back up. Slower on each side but avoids the recompute. Good for long contexts where recomputing prefill is expensive.

vLLM chooses dynamically based on policy and request shape.

What this enables

Paged attention alone is “memory hygiene”: better packing, no fragmentation. But because the cache is now content-addressed by (request_id, logical_block) and the physical blocks are identical-shaped, you can also do something else that the old contiguous layout couldn’t: share physical blocks across requests. That’s prefix caching, and it’s our next section.