Recap
And further reading
If you read straight through, you’ve covered roughly the same material a senior ML systems engineer would expect from a new hire: what an LLM is, what attention does, why decode is memory-bound, why a KV cache matters, and how every clever idea in modern serving is some flavor of “manage that cache better.”
This page is a short recap and a handful of pointers if you want to go deeper.
The whole story in 15 bullets
- Text becomes a list of integer token IDs via a BPE tokenizer. Each ID indexes into a 128k-entry vocabulary.
- Each ID is mapped to a vector (“embedding”) of size (4096-ish). That row is looked up from a giant matrix.
- Position information is injected, typically via RoPE (rotating query/key vectors inside attention).
- Each transformer layer does two things: attention (cross-token mixing) and MLP (per-token nonlinear processing). Both surrounded by residual + RMSNorm.
- Attention computes scores , applies softmax + causal mask, takes a weighted sum of . Multi-head splits this across many parallel “heads.”
- Stacking 32-128 of these blocks, plus an embedding lookup and an LM head, is the model.
- To generate, you take the final position’s logits, apply a sampling strategy (greedy / temperature / top-p), get a token, append, repeat.
- The first pass on the whole prompt (prefill) is compute-bound and fast. Every subsequent single-token pass (decode) is memory-bound — the GPU spends most of its time reading weights from HBM.
- To avoid redoing prefill work for past tokens during decode, we cache the keys and values: the KV cache. It can be enormous; modern models use GQA to shrink it.
- The serving story is dominated by HBM bandwidth and HBM capacity. The rest of the memory hierarchy (SRAM, PCIe, NVLink, RDMA NIC) determines what kinds of parallelism work.
- Multiple users share one GPU step via continuous batching: requests drop in and out of the batch at every decode step.
- KV memory is managed via paged attention: HBM is split into fixed-size pages, each request has a page table, allocation is O(1) and fragmentation is bounded.
- Pages can be shared across requests via prefix caching: identical prompts re-use the same physical KV blocks. The first prefill is computed; the rest pay zero.
- Long prompts are processed via chunked prefill so they don’t block decoders.
- Speculative decoding lets a cheap draft model propose K tokens that the target verifies in one pass — net more tokens per target-model step at no quality cost.
When a single GPU isn’t enough, tensor parallelism (every matrix split across GPUs, all-reduce per layer over NVLink) and pipeline parallelism (layers split across GPUs, activations forwarded once per stage) carry the load, with data parallelism stacking replicas on top.
Underneath all of this, the same arithmetic ratio governs everything: how many FLOPs you do per byte of memory you read. Every optimization in the essay is some flavor of pushing that ratio up.
Further reading
The vLLM ecosystem and the foundational papers are remarkably accessible. A small reading list:
The papers
- Vaswani et al., Attention Is All You Need (2017). The Transformer.
- Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention (2023). The vLLM paper.
- Dao et al., FlashAttention (2022) and FlashAttention-2 (2023). The attention kernel everyone uses.
- Yu et al., Orca: A Distributed Serving System for Transformer-Based Generative Models (2022). Continuous batching, before vLLM made it the default.
- Leviathan et al., Fast Inference from Transformers via Speculative Decoding (2023). The accepted formal protocol.
- Chen et al., Accelerating Large Language Model Decoding with Speculative Sampling (2023). Concurrent independent discovery.
- Su et al., RoFormer: Enhanced Transformer with Rotary Position Embedding (2021). RoPE.
- Ainslie et al., GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (2023). The K/V sharing trick.
- Cai et al., Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads (2024).
- Li et al., EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (2024).
Codebases
- vllm-project/vllm — the canonical reference implementation.
- pytorch/torchchat — minimal PyTorch decode loop, easy to read.
- karpathy/llama2.c — an entire Llama inference engine in ~700 lines of C. Probably the best single piece of code for internalizing §1–§10.
- karpathy/llm.c — same idea, training-focused.
Posts to read next
- Aleksa Gordić’s vLLM internals blog — the more advanced cousin of this essay, written by someone who’s worked deeply on vLLM. The structural inspiration for this site.
- The Annotated Transformer — Harvard NLP’s classic walkthrough of the original Transformer paper, with PyTorch code.
- Lilian Weng — Large Language Model Inference Optimization — broad survey, very readable.
- Horace He — Making Deep Learning Go Brrr From First Principles — the canonical explanation of arithmetic intensity vs memory bandwidth.
Once you have all of this, the productive next step is to clone vLLM, find one of its core files (scheduler.py, worker.py, the paged-attention kernel), and read it. Everything in this essay is in there, with the details and the rough edges that make it production code.
That’s all. Thanks for reading. Now go run something.