Section 21

Recap

And further reading

If you read straight through, you’ve covered roughly the same material a senior ML systems engineer would expect from a new hire: what an LLM is, what attention does, why decode is memory-bound, why a KV cache matters, and how every clever idea in modern serving is some flavor of “manage that cache better.”

This page is a short recap and a handful of pointers if you want to go deeper.

The whole story in 15 bullets

Text becomes a list of integer token IDs via a BPE tokenizer. Each ID indexes into a 128k-entry vocabulary.
Each ID is mapped to a vector (“embedding”) of size $d_{\text{model}}$ (4096-ish). That row is looked up from a giant matrix.
Position information is injected, typically via RoPE (rotating query/key vectors inside attention).
Each transformer layer does two things: attention (cross-token mixing) and MLP (per-token nonlinear processing). Both surrounded by residual + RMSNorm.
Attention computes scores $Q K^\top$ , applies softmax + causal mask, takes a weighted sum of $V$ . Multi-head splits this across many parallel “heads.”
Stacking 32-128 of these blocks, plus an embedding lookup and an LM head, is the model.
To generate, you take the final position’s logits, apply a sampling strategy (greedy / temperature / top-p), get a token, append, repeat.
The first pass on the whole prompt (prefill) is compute-bound and fast. Every subsequent single-token pass (decode) is memory-bound — the GPU spends most of its time reading weights from HBM.
To avoid redoing prefill work for past tokens during decode, we cache the keys and values: the KV cache. It can be enormous; modern models use GQA to shrink it.
The serving story is dominated by HBM bandwidth and HBM capacity. The rest of the memory hierarchy (SRAM, PCIe, NVLink, RDMA NIC) determines what kinds of parallelism work.
Multiple users share one GPU step via continuous batching: requests drop in and out of the batch at every decode step.
KV memory is managed via paged attention: HBM is split into fixed-size pages, each request has a page table, allocation is O(1) and fragmentation is bounded.
Pages can be shared across requests via prefix caching: identical prompts re-use the same physical KV blocks. The first prefill is computed; the rest pay zero.
Long prompts are processed via chunked prefill so they don’t block decoders.
Speculative decoding lets a cheap draft model propose K tokens that the target verifies in one pass — net more tokens per target-model step at no quality cost.

When a single GPU isn’t enough, tensor parallelism (every matrix split across GPUs, all-reduce per layer over NVLink) and pipeline parallelism (layers split across GPUs, activations forwarded once per stage) carry the load, with data parallelism stacking replicas on top.

Underneath all of this, the same arithmetic ratio governs everything: how many FLOPs you do per byte of memory you read. Every optimization in the essay is some flavor of pushing that ratio up.

Recap

The whole story in 15 bullets

Further reading