Section 13

GPU memory hierarchy

Where data actually lives on H100s

Every optimization in vLLM ultimately bottoms out in one fact: bytes move at very different speeds depending on where they live. To reason about why prefill is fast, why decode is slow, why batching helps, why KV cache offload is a real option, and why multi-GPU serving requires NVLink — you have to know the rough numbers of the GPU memory hierarchy.

We’ll use the Nvidia H100 SXM5 SXM5 Server PCI eXpress Module, 5th generation — Nvidia's proprietary mezzanine board form factor for datacenter GPUs. (Despite the name, SXM bypasses PCIe entirely.) An H100 SXM5 module plugs directly into the motherboard via the SXM socket, which gives it more power (700 W vs ~350 W for PCIe), more NVLink bandwidth (900 GB/s per GPU), and higher HBM bandwidth than the PCIe variant of the same chip. Standard in HGX/DGX servers; what you get in most cloud H100 instances. See in glossary → as our concrete reference because it’s the workhorse of modern inference deployments. Other chips (A100, B100/B200, MI300X) have similar shapes with different magnitudes.

Where the tiers physically live — zooming out from one SM to a rack
Each step is one "zoom out" level. Every row maps to one or more entries in the bandwidth/latency table below.
Single SM L1 instruction cache · TMA · texture units Partition 0 warp scheduler · dispatch 64 KB register file FP32 ×32INT32 ×32FP64 ×16 4th-gen Tensor Core FP16 / BF16 / FP8 matmul Partition 1 warp scheduler · dispatch 64 KB register file FP32 ×32INT32 ×32FP64 ×16 4th-gen Tensor Core FP16 / BF16 / FP8 matmul Partition 2 warp scheduler · dispatch 64 KB register file FP32 ×32INT32 ×32FP64 ×16 4th-gen Tensor Core FP16 / BF16 / FP8 matmul Partition 3 warp scheduler · dispatch 64 KB register file FP32 ×32INT32 ×32FP64 ×16 4th-gen Tensor Core FP16 / BF16 / FP8 matmul Shared SRAM · L1 data cache (configurable, up to 228 KB) streaming multiprocessor — 4 partitions sharing an L1 / SRAM block ~256 KB total registers · 228 KB shared SRAM / L1 · 4 tensor cores · 128 CUDA cores · TMA Set of SMs 132 SMs total one GPU die · 132 SMs arranged on the silicon (H100) sharing 60 MB L2 cache GPU package HBM HBM HBM HBM HBM GPU die (132 SMs + L2) HBM HBM HBM HBM HBM die + on-package HBM3 stacks under the metal lid 80 GB HBM3 · 3.35 TB/s memory bandwidth Single host (HGX node) GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7 GPU 8 NVSwitch · 900 GB/s per GPU CPU + host RAM (PCIe Gen5 link to each GPU) 8 GPU packages · full NVLink mesh · CPU + host RAM 640 GB HBM total · 900 GB/s NVLink per GPU Single rack rack ToR switch InfiniBand multiple hosts in one cabinet · NIC fabric · top-of-rack switch TBs HBM aggregate · InfiniBand / RoCE RDMA · ~50 GB/s/node

The tiers

From smallest and fastest at the top, to largest and slowest at the bottom:

TierLocationCapacityBandwidthLatency
Registersper-thread, on-die~256 KB/SM~120 TB/s1 cycle
L1 / Shared ( SRAM SRAM Static Random-Access Memory — the on-chip scratchpad / L1+shared memory inside each SM. Tiny (~100s of KB per SM) but ~10× faster than HBM. See in glossary → )per-SM scratchpad~228 KB/SM~33 TB/stens of cycles
L2 cacheshared across SMs60 MB~12 TB/s~250 cycles
HBM3 HBM High-Bandwidth Memory — the DRAM stack soldered next to the GPU die. H100 SXM has 80 GB at ~3.35 TB/s. See in glossary → (VRAM)on the GPU package80 GB3.35 TB/s~400 cycles
NVLink NVLink Nvidia’s high-speed GPU-to-GPU interconnect. H100 NVLink ≈ 900 GB/s per GPU — much faster than PCIe. See in glossary → GPU ↔ peer GPU900 GB/sµs
PCIe Gen5 PCIe The bus between the GPU and the host (CPU/RAM/NVMe). PCIe Gen5 x16 ≈ 64 GB/s — far slower than HBM. See in glossary → GPU ↔ host CPU/RAM~64 GB/sµs
Host DDR5 RAMon the CPU512 GB–2 TB~400 GB/s~100 ns
NVMe SSDlocal diskTBs~14 GB/s~50 µs
NIC ( RDMA RDMA Remote DMA — letting one node’s NIC write directly into another node’s memory without involving the CPU. The basis of InfiniBand and RoCE. See in glossary → / InfiniBand)node ↔ node~50 GB/s1–10 µs

The spread from registers to NIC is about 2,000× in bandwidth and even more in latency. Every order of magnitude matters.

What lives where

In a running vLLM worker, here’s where each kind of byte typically sits:

  • Model weights: HBM. 16 GB for Llama-3-8B, ~140 GB for 70B (split across multiple GPUs).
  • KV cache: HBM. Whatever is left after weights.
  • Activations (intermediate per-step tensors): HBM, briefly. SRAM/registers during a kernel.
  • The kernel’s actual operands: SRAM and registers. The matrix multiplies happen here.
  • Idle KV pages from paused requests (if offload is enabled): host DDR5 RAM, reached over PCIe.
  • Other GPUs’ shards of the same model: in their HBM, reached over NVLink.
  • Other nodes’ shards: in their HBM, reached over the NIC via RDMA + GPUDirect.
  • Cold model files: on NVMe disk until first load.

Try the scenarios

The visual below traces where bytes go in five common operations. Click through them and watch the animated packet follow its hops up and down the hierarchy.

Where the bytes live (Nvidia H100)
Pick a scenario; an animated packet follows the hops bytes actually take. Bandwidth/latency numbers are H100 SXM5 ballpark — not exact.
The exact same weight read happens for every decode step — but the math done with those weights is only ~1 token worth. This is why decode is memory-bandwidth bound and why batching helps so much: read the weights once, use them N times.
Registers
per-thread, on-die
~256 KB / SM
~120 TB/s
~1 cycle
L1 / Shared (SRAM)
per-SM scratchpad
~228 KB / SM
~33 TB/s
~30 cycles
L2 cache
shared across SMs
60 MB
~12 TB/s
~250 cyc
HBM3 (VRAM)
on-package DRAM
80 GB
3.35 TB/s
~400 cyc
NVLink → peer GPU
GPU ↔ GPU
— (transit)
900 GB/s
~µs
PCIe Gen5 ↔ host
GPU ↔ CPU/RAM
— (transit)
~64 GB/s
~µs
Host RAM (DDR5)
on the CPU
512 GB–2 TB
~400 GB/s
~100 ns
NVMe SSD
local disk
4–60 TB
~14 GB/s
~50 µs
NIC (RDMA / IB)
node ↔ node
— (transit)
~50 GB/s
~1–10 µs
Step 1 of 2
RegistersL1 / Shared (SRAM)L2 cacheHBM3 (VRAM)NVLink → peer GPUPCIe Gen5 ↔ hostHost RAM (DDR5)NVMe SSDNIC (RDMA / IB)
HBM3 (VRAM)L2 cache
3.35 TB/s · ~5 ms per layer pass
Notice the bandwidth column spans 120 TB/s at the top down to ~50 GB/s at the bottom — over 2,000× from registers to the NIC. Every order-of-magnitude jump down the tier list is where serving systems gain (or lose) most of their performance.

A few things to notice:

  • HBM is fast, but not “free.” The H100’s 3.35 TB/s is the fastest commodity DRAM on earth, but it’s still about 35× slower than SRAM and 250× slower than registers. The whole job of a CUDA kernel is to read HBM into SRAM once and then squeeze as much arithmetic out of those bytes as possible before they go away.

  • PCIe is a cliff. Going off the GPU is 50× slower than HBM. This is why KV cache offload exists but isn’t free: pulling a 128 MB KV slab back from host RAM takes 2 ms, on top of whatever queueing the request did.

  • NVLink is why multi-GPU works. For tensor parallelism you do an all-reduce after every layer. On NVLink that’s hundreds of microseconds; on PCIe it would be tens of milliseconds — entirely killing the GPU. NVLink is what makes “1 model across 8 GPUs” a realistic deployment shape.

  • The NIC is the bottleneck for multi-node. ~50 GB/s for the best InfiniBand or RoCE setups, less for cloud-grade interconnects. GPUDirect RDMA GPUDirect Nvidia tech that lets the NIC or NVMe DMA straight into/out of GPU HBM, bypassing host RAM. See in glossary → avoids a round-trip through host RAM by letting the NIC DMA straight to/from HBM, which is essential at scale.

Why decode is memory-bound, made concrete

Here’s the back-of-envelope for Llama-3-8B at fp16:

  • Weights: 16 GB.
  • HBM bandwidth: 3.35 TB/s.
  • Time to read all weights once: 16/3,3504.816 / 3{,}350 \approx 4.8 ms.
  • Number of matrix-multiply FLOPs per decode step: ~16 GFLOPS (1 token through 8B weights, ×2 ops per param).
  • H100 Tensor Core (TC) throughput (BF16): ~990 TFLOPS.
  • Time to do the math: 16/990,0000.01616 / 990{,}000 \approx 0.016 ms.

The math takes 16 microseconds. The weights take 4.8 milliseconds to read. The GPU is idle for ~99.7% of decode time, waiting on HBM.

So: read-heavy work, slow when you only have one user, much faster when you have many. The scheduler’s job is to fold “many users” into single GPU steps cleanly. That’s continuous batching, and it’s where serving stops being about the model and starts being about systems.