GPU memory hierarchy
Where data actually lives on H100s
Every optimization in vLLM ultimately bottoms out in one fact: bytes move at very different speeds depending on where they live. To reason about why prefill is fast, why decode is slow, why batching helps, why KV cache offload is a real option, and why multi-GPU serving requires NVLink — you have to know the rough numbers of the GPU memory hierarchy.
We’ll use the Nvidia H100 SXM5 SXM5 Server PCI eXpress Module, 5th generation — Nvidia's proprietary mezzanine board form factor for datacenter GPUs. (Despite the name, SXM bypasses PCIe entirely.) An H100 SXM5 module plugs directly into the motherboard via the SXM socket, which gives it more power (700 W vs ~350 W for PCIe), more NVLink bandwidth (900 GB/s per GPU), and higher HBM bandwidth than the PCIe variant of the same chip. Standard in HGX/DGX servers; what you get in most cloud H100 instances. See in glossary → as our concrete reference because it’s the workhorse of modern inference deployments. Other chips (A100, B100/B200, MI300X) have similar shapes with different magnitudes.
The tiers
From smallest and fastest at the top, to largest and slowest at the bottom:
| Tier | Location | Capacity | Bandwidth | Latency |
|---|---|---|---|---|
| Registers | per-thread, on-die | ~256 KB/SM | ~120 TB/s | 1 cycle |
| L1 / Shared ( SRAM SRAM Static Random-Access Memory — the on-chip scratchpad / L1+shared memory inside each SM. Tiny (~100s of KB per SM) but ~10× faster than HBM. See in glossary → ) | per-SM scratchpad | ~228 KB/SM | ~33 TB/s | tens of cycles |
| L2 cache | shared across SMs | 60 MB | ~12 TB/s | ~250 cycles |
| HBM3 HBM High-Bandwidth Memory — the DRAM stack soldered next to the GPU die. H100 SXM has 80 GB at ~3.35 TB/s. See in glossary → (VRAM) | on the GPU package | 80 GB | 3.35 TB/s | ~400 cycles |
| NVLink NVLink Nvidia’s high-speed GPU-to-GPU interconnect. H100 NVLink ≈ 900 GB/s per GPU — much faster than PCIe. See in glossary → | GPU ↔ peer GPU | — | 900 GB/s | µs |
| PCIe Gen5 PCIe The bus between the GPU and the host (CPU/RAM/NVMe). PCIe Gen5 x16 ≈ 64 GB/s — far slower than HBM. See in glossary → | GPU ↔ host CPU/RAM | — | ~64 GB/s | µs |
| Host DDR5 RAM | on the CPU | 512 GB–2 TB | ~400 GB/s | ~100 ns |
| NVMe SSD | local disk | TBs | ~14 GB/s | ~50 µs |
| NIC ( RDMA RDMA Remote DMA — letting one node’s NIC write directly into another node’s memory without involving the CPU. The basis of InfiniBand and RoCE. See in glossary → / InfiniBand) | node ↔ node | — | ~50 GB/s | 1–10 µs |
The spread from registers to NIC is about 2,000× in bandwidth and even more in latency. Every order of magnitude matters.
What lives where
In a running vLLM worker, here’s where each kind of byte typically sits:
- Model weights: HBM. 16 GB for Llama-3-8B, ~140 GB for 70B (split across multiple GPUs).
- KV cache: HBM. Whatever is left after weights.
- Activations (intermediate per-step tensors): HBM, briefly. SRAM/registers during a kernel.
- The kernel’s actual operands: SRAM and registers. The matrix multiplies happen here.
- Idle KV pages from paused requests (if offload is enabled): host DDR5 RAM, reached over PCIe.
- Other GPUs’ shards of the same model: in their HBM, reached over NVLink.
- Other nodes’ shards: in their HBM, reached over the NIC via RDMA + GPUDirect.
- Cold model files: on NVMe disk until first load.
Try the scenarios
The visual below traces where bytes go in five common operations. Click through them and watch the animated packet follow its hops up and down the hierarchy.
A few things to notice:
-
HBM is fast, but not “free.” The H100’s 3.35 TB/s is the fastest commodity DRAM on earth, but it’s still about 35× slower than SRAM and 250× slower than registers. The whole job of a CUDA kernel is to read HBM into SRAM once and then squeeze as much arithmetic out of those bytes as possible before they go away.
-
PCIe is a cliff. Going off the GPU is 50× slower than HBM. This is why KV cache offload exists but isn’t free: pulling a 128 MB KV slab back from host RAM takes 2 ms, on top of whatever queueing the request did.
-
NVLink is why multi-GPU works. For tensor parallelism you do an all-reduce after every layer. On NVLink that’s hundreds of microseconds; on PCIe it would be tens of milliseconds — entirely killing the GPU. NVLink is what makes “1 model across 8 GPUs” a realistic deployment shape.
-
The NIC is the bottleneck for multi-node. ~50 GB/s for the best InfiniBand or RoCE setups, less for cloud-grade interconnects. GPUDirect RDMA GPUDirect Nvidia tech that lets the NIC or NVMe DMA straight into/out of GPU HBM, bypassing host RAM. See in glossary → avoids a round-trip through host RAM by letting the NIC DMA straight to/from HBM, which is essential at scale.
Why decode is memory-bound, made concrete
Here’s the back-of-envelope for Llama-3-8B at fp16:
- Weights: 16 GB.
- HBM bandwidth: 3.35 TB/s.
- Time to read all weights once: ms.
- Number of matrix-multiply FLOPs per decode step: ~16 GFLOPS (1 token through 8B weights, ×2 ops per param).
- H100 Tensor Core (TC) throughput (BF16): ~990 TFLOPS.
- Time to do the math: ms.
The math takes 16 microseconds. The weights take 4.8 milliseconds to read. The GPU is idle for ~99.7% of decode time, waiting on HBM.
So: read-heavy work, slow when you only have one user, much faster when you have many. The scheduler’s job is to fold “many users” into single GPU steps cleanly. That’s continuous batching, and it’s where serving stops being about the model and starts being about systems.