Section 13

GPU memory hierarchy

Where data actually lives on H100s

Every optimization in vLLM ultimately bottoms out in one fact: bytes move at very different speeds depending on where they live. To reason about why prefill is fast, why decode is slow, why batching helps, why KV cache offload is a real option, and why multi-GPU serving requires NVLink — you have to know the rough numbers of the GPU memory hierarchy.

We’ll use the Nvidia H100 SXM5 as our concrete reference because it’s the workhorse of modern inference deployments. Other chips (A100, B100/B200, MI300X) have similar shapes with different magnitudes.

Where the tiers physically live — zooming out from one SM to a rack

Each step is one "zoom out" level. Every row maps to one or more entries in the bandwidth/latency table below.

The tiers

From smallest and fastest at the top, to largest and slowest at the bottom:

Tier	Location	Capacity	Bandwidth	Latency
Registers	per-thread, on-die	~256 KB/SM	~120 TB/s	1 cycle
L1 / Shared ( SRAM )	per-SM scratchpad	~228 KB/SM	~33 TB/s	tens of cycles
L2 cache	shared across SMs	60 MB	~12 TB/s	~250 cycles
HBM3 (VRAM)	on the GPU package	80 GB	3.35 TB/s	~400 cycles
NVLink	GPU ↔ peer GPU	—	900 GB/s	µs
PCIe Gen5	GPU ↔ host CPU/RAM	—	~64 GB/s	µs
Host DDR5 RAM	on the CPU	512 GB–2 TB	~400 GB/s	~100 ns
NVMe SSD	local disk	TBs	~14 GB/s	~50 µs
NIC ( RDMA / InfiniBand)	node ↔ node	—	~50 GB/s	1–10 µs

The spread from registers to NIC is about 2,000× in bandwidth and even more in latency. Every order of magnitude matters.

What lives where

In a running vLLM worker, here’s where each kind of byte typically sits:

Model weights: HBM. 16 GB for Llama-3-8B, ~140 GB for 70B (split across multiple GPUs).
KV cache: HBM. Whatever is left after weights.
Activations (intermediate per-step tensors): HBM, briefly. SRAM/registers during a kernel.
The kernel’s actual operands: SRAM and registers. The matrix multiplies happen here.
Idle KV pages from paused requests (if offload is enabled): host DDR5 RAM, reached over PCIe.
Other GPUs’ shards of the same model: in their HBM, reached over NVLink.
Other nodes’ shards: in their HBM, reached over the NIC via RDMA + GPUDirect.
Cold model files: on NVMe disk until first load.

Try the scenarios

The visual below traces where bytes go in five common operations. Click through them and watch the animated packet follow its hops up and down the hierarchy.

Where the bytes live (Nvidia H100)

Pick a scenario; an animated packet follows the hops bytes actually take. Bandwidth/latency numbers are H100 SXM5 ballpark — not exact.

The exact same weight read happens for every decode step — but the math done with those weights is only ~1 token worth. This is why decode is memory-bandwidth bound and why batching helps so much: read the weights once, use them N times.

Registers

per-thread, on-die

~256 KB / SM

~120 TB/s

~1 cycle

L1 / Shared (SRAM)

per-SM scratchpad

~228 KB / SM

~33 TB/s

~30 cycles

L2 cache

shared across SMs

60 MB

~12 TB/s

~250 cyc

HBM3 (VRAM)

on-package DRAM

80 GB

3.35 TB/s

~400 cyc

NVLink → peer GPU

GPU ↔ GPU

— (transit)

900 GB/s

~µs

PCIe Gen5 ↔ host

GPU ↔ CPU/RAM

— (transit)

~64 GB/s

~µs

Host RAM (DDR5)

on the CPU

512 GB–2 TB

~400 GB/s

~100 ns

NVMe SSD

local disk

4–60 TB

~14 GB/s

~50 µs

NIC (RDMA / IB)

node ↔ node

— (transit)

~50 GB/s

~1–10 µs

Step 1 of 2

HBM3 (VRAM) → L2 cache

3.35 TB/s · ~5 ms per layer pass

Notice the bandwidth column spans 120 TB/s at the top down to ~50 GB/s at the bottom — over 2,000× from registers to the NIC. Every order-of-magnitude jump down the tier list is where serving systems gain (or lose) most of their performance.

A few things to notice:

HBM is fast, but not “free.” The H100’s 3.35 TB/s is the fastest commodity DRAM on earth, but it’s still about 35× slower than SRAM and 250× slower than registers. The whole job of a CUDA kernel is to read HBM into SRAM once and then squeeze as much arithmetic out of those bytes as possible before they go away.
PCIe is a cliff. Going off the GPU is 50× slower than HBM. This is why KV cache offload exists but isn’t free: pulling a 128 MB KV slab back from host RAM takes 2 ms, on top of whatever queueing the request did.
NVLink is why multi-GPU works. For tensor parallelism you do an all-reduce after every layer. On NVLink that’s hundreds of microseconds; on PCIe it would be tens of milliseconds — entirely killing the GPU. NVLink is what makes “1 model across 8 GPUs” a realistic deployment shape.
The NIC is the bottleneck for multi-node. ~50 GB/s for the best InfiniBand or RoCE setups, less for cloud-grade interconnects. GPUDirect RDMA avoids a round-trip through host RAM by letting the NIC DMA straight to/from HBM, which is essential at scale.

Why decode is memory-bound, made concrete

Here’s the back-of-envelope for Llama-3-8B at fp16:

Weights: 16 GB.
HBM bandwidth: 3.35 TB/s.
Time to read all weights once: $16 / 3{,}350 \approx 4.8$ ms.
Number of matrix-multiply FLOPs per decode step: ~16 GFLOPS (1 token through 8B weights, ×2 ops per param).
H100 Tensor Core (TC) throughput (BF16): ~990 TFLOPS.
Time to do the math: $16 / 990{,}000 \approx 0.016$ ms.

The math takes 16 microseconds. The weights take 4.8 milliseconds to read. The GPU is idle for ~99.7% of decode time, waiting on HBM.

So: read-heavy work, slow when you only have one user, much faster when you have many. The scheduler’s job is to fold “many users” into single GPU steps cleanly. That’s continuous batching, and it’s where serving stops being about the model and starts being about systems.