Section 19

Scaling out

Tensor parallelism vs pipeline parallelism

For models that fit on one GPU, the previous sections cover the whole story. But Llama-3-70B at fp16 is 140 GB, and an H100 has 80 GB of HBM. There is no choice: you need multiple GPUs to hold the weights, and you need them to communicate at every step. How you split the model across GPUs determines what the math looks like and what interconnect you need.

There are two main strategies, with combinations on top of them.

Tensor parallelism: split every matrix

Tensor parallelism (TP) shards every weight matrix across $N$ GPUs. With $N = 4$ , each GPU holds one quarter of every weight in every layer. The forward pass works like this:

Each GPU receives the same input activations (the residual stream).
Each GPU multiplies by its slice of the weight matrix, producing its slice of the output.
After certain operations (specifically: at the end of attention’s output projection and the MLP’s down-projection), the slices need to be summed across all GPUs to reconstruct the full output activations.

That summation is an all-reduce : every GPU sends its slice, every GPU receives the sum.

For a Llama-style transformer block, you do two all-reduces per layer (one after attention, one after the MLP). At 32 layers and ~100 µs per all-reduce on NVLink, that’s ~6 ms of pure interconnect time per token — comparable to the compute time itself. NVLink barely keeps up. PCIe alone (~64 GB/s) cannot keep up. This is why you need NVLink for TP.

Pipeline parallelism: split by layer

Pipeline parallelism (PP) takes a different cut: each GPU owns a contiguous block of layers. For a 32-layer model on 4 GPUs, GPU 0 has layers 0–7, GPU 1 has layers 8–15, and so on. A token’s activations flow through GPU 0, then over the interconnect to GPU 1, then to GPU 2, …, then back to GPU 0 for the next token.

The interconnect cost per token is much smaller — you ship the residual stream once per stage boundary, not all-reduce per layer. This makes PP a good fit for cross-node scaling (over RDMA, where bandwidth is lower than NVLink).

The catch is pipeline bubbles: if GPU 0 is processing token T while GPU 1 is processing T-1 and GPU 2 is processing T-2, that’s great. But when you’re filling the pipeline (after a flush or at the very start), some GPUs are idle. With careful “microbatching” you minimize this, but PP is fundamentally more complex than TP and gets worse latency per-token (each token has to traverse every pipeline stage).

Combining them: TP within a node, PP across nodes

Production multi-node setups usually do both:

Inside a node (8 GPUs sharing NVSwitch): TP=8. The all-reduces ride NVLink.
Across nodes (multiple racks): PP=N. Activations cross the InfiniBand fabric once per stage boundary via GPUDirect RDMA.

For very large models like Llama-3-405B (810 GB at fp16, requires 11 H100s of HBM minimum), TP=8 × PP=2 is typical. For DeepSeek-V3 at 671B params: TP=8 × PP=4 or wider.

Data parallelism for throughput

The third axis is data parallelism (DP): just replicate the model and serve different requests on different replicas. This adds throughput linearly until you exhaust GPUs, with zero interconnect cost between replicas (they don’t talk to each other for inference; they share nothing).

In a multi-tenant deployment, DP is the most common shape on top of TP: “8 GPUs per replica via TP, 4 replicas via DP.” A load balancer routes requests to replicas; within each replica vLLM does its KV management.

What this means for memory

Each parallelism mode changes where bytes live:

TP: each GPU holds (model weights / N) + a full KV cache for the requests it’s serving.
PP: each GPU holds (model weights / N) + a KV cache for the layers it owns across all active requests.
DP: each GPU holds a full copy of the model + its own KV cache.

The KV cache split is subtle. With TP, the KV cache is also sharded — each GPU stores the K/V values for its share of the heads. With PP, each GPU stores all heads but only for its layers.

The interconnect, revisited

Reviewing the memory hierarchy with these parallelism strategies in mind:

Where the bytes live (Nvidia H100)

Pick a scenario; an animated packet follows the hops bytes actually take. Bandwidth/latency numbers are H100 SXM5 ballpark — not exact.

The exact same weight read happens for every decode step — but the math done with those weights is only ~1 token worth. This is why decode is memory-bandwidth bound and why batching helps so much: read the weights once, use them N times.

Registers

per-thread, on-die

~256 KB / SM

~120 TB/s

~1 cycle

L1 / Shared (SRAM)

per-SM scratchpad

~228 KB / SM

~33 TB/s

~30 cycles

L2 cache

shared across SMs

60 MB

~12 TB/s

~250 cyc

HBM3 (VRAM)

on-package DRAM

80 GB

3.35 TB/s

~400 cyc

NVLink → peer GPU

GPU ↔ GPU

— (transit)

900 GB/s

~µs

PCIe Gen5 ↔ host

GPU ↔ CPU/RAM

— (transit)

~64 GB/s

~µs

Host RAM (DDR5)

on the CPU

512 GB–2 TB

~400 GB/s

~100 ns

NVMe SSD

local disk

4–60 TB

~14 GB/s

~50 µs

NIC (RDMA / IB)

node ↔ node

— (transit)

~50 GB/s

~1–10 µs

Step 1 of 2

HBM3 (VRAM) → L2 cache

3.35 TB/s · ~5 ms per layer pass

Notice the bandwidth column spans 120 TB/s at the top down to ~50 GB/s at the bottom — over 2,000× from registers to the NIC. Every order-of-magnitude jump down the tier list is where serving systems gain (or lose) most of their performance.

Pick the “Tensor-parallel all-reduce across 8 GPUs” scenario to see what TP looks like at the byte level: HBM → NVLink → peer GPU HBM, twice per layer. Pick “Multi-node: weights / activations cross the NIC” for PP: HBM → NIC → fabric → peer node’s HBM.

The takeaway: you do not pick a parallelism strategy by elegance. You pick it by the speed of the slowest interconnect that has to participate in the per-step critical path.

All-reduce per layer needs NVLink (or it kills you).
Activations once per stage can ride RDMA fabric.
Different replicas can be on different continents.

With one or many GPUs, all the same scheduling tricks from sections 14–18 still apply — vLLM does paged attention, continuous batching, prefix caching, chunked prefill, and speculative decoding in the TP/PP/DP combos as well. The model is just bigger; the playbook is the same.

One last thing before we wrap up: a closing look at the throughput-vs-latency tradeoff and the knobs you actually turn in production.