LLM & vLLM Inference, from the ground up
A long-form, interactive explainer
This is a walkthrough of how modern Large Language Models actually run — from the moment text becomes a list of tokens, through every matrix multiplication inside a transformer, to the clever memory tricks that production serving systems like vLLM use to keep an Nvidia GPU saturated.
No prior machine-learning knowledge required. Every term gets defined the first time it shows up — you can hover any underlined word for a quick tooltip, or jump to the glossary at any time.
There are interactive widgets throughout: a real tokenizer you can type into, an attention heatmap you can hover, a KV-cache that fills up as you step through decoding, a paged-attention allocator you can poke at, and a data-flow visualization of where bytes actually live on an H100. They're meant to be played with, not just looked at.
Contents
Foundations
- 01 What is an LLM? — And what does "inference" mean?
- 02 Tokens — Text → numbers the model can see
- 03 Embeddings — Token IDs → vectors
- 04 Attention — Queries, keys, and values
- 05 Multi-head attention — Many attentions in parallel
- 06 Positional encoding — Telling the model where each token sits
- 07 The MLP block — Per-token nonlinear processing
- 08 A full transformer block — Putting it together
- 09 Stacking into a full model — From embeddings to logits
- 10 Sampling — Logits → the next token