Chinchilla
Compute-optimal training and the 20:1 rule
Paper: Training Compute-Optimal Large Language Models (Chinchilla) — Hoffmann et al., 2022
Chinchilla (Hoffmann et al., 2022, Training Compute-Optimal Large Language Models) is one of the most consequential pre-training papers ever written, because it showed that almost everyone — including GPT-3 — had been training their models wrong. The fix was a single, memorable rule, and it redirected how the entire field spends compute.
The question: how to split a fixed compute budget
Recall the master equation, . For a fixed compute budget , you can trade parameters parameters The numbers (weights) inside a model that get adjusted during training. A “7B model” has 7 billion of them. See in glossary → against tokens token The atomic unit of text the model sees. Roughly a word-fragment — “tokenization” is a piece of text → list of token IDs. See in glossary → : a bigger model trained on less data, or a smaller model trained on more. Kaplan’s laws said go big — most of the budget into . Chinchilla asked the question more carefully, training over 400 models across sizes and token counts, and got a different answer.
The answer: scale N and D equally
Chinchilla’s finding, reached three independent ways (fixing model size and varying data, IsoFLOP IsoFLOP A curve of loss versus model size at a fixed compute budget ("iso" = equal FLOPs). Its minimum reveals the compute-optimal model size; Chinchilla used IsoFLOP profiles to find the 20:1 rule. See in glossary → profiles, and a parametric fit), all agreed: for compute-optimal compute-optimal The allocation of a fixed compute budget between model size and training tokens that minimizes loss. Chinchilla showed it means scaling both roughly equally — about 20 tokens per parameter. See in glossary → training, model size and training tokens should scale in equal proportion. Double your compute, and you should roughly double both and — not pour it all into . In their fit, the optimal exponents were .
The practical form is the rule everyone now quotes: train on about 20 tokens per parameter tokens per parameter The ratio of training tokens to model parameters (D/N). Chinchilla's compute-optimal point is around 20; modern models often deliberately exceed it to get smaller, cheaper-to-serve models. See in glossary → .
The widget is the IsoFLOP picture made interactive. Fix a compute budget, then slide the tokens-per-parameter ratio. The loss curve has a clear minimum near 20:1 — and both extremes lose. A giant model starved of data (the Kaplan regime, far left) and a tiny model drowned in data (far right) are both worse than the balanced middle.
The proof: Chinchilla vs. Gopher
The authors didn’t just fit curves — they bet on them. Taking the exact compute budget used for DeepMind’s Gopher (280B parameters, 300B tokens), their laws predicted the optimal model should be ~4× smaller and trained on ~4× more data. So they trained Chinchilla: 70B parameters on 1.4 trillion tokens. Despite being a quarter of Gopher’s size, Chinchilla uniformly outperformed it — and GPT-3, and the others — including a 7-point jump on the MMLU benchmark.
The consequence: GPT-3 was badly under-trained
The implication was stark. GPT-3’s 175B parameters on ~300B tokens is about 1.7 tokens per parameter — more than 10× short of compute-optimal. A Chinchilla-style model at the same compute would have been far smaller and far better. Overnight, “just make it bigger” was replaced by “balance size and data,” and the model sizes of the next generation came down even as capabilities went up.
That closes the scaling group. We now have the architecture, the recipe, and the laws that govern how to spend compute. From here, every chapter is a real, recent, production model — and the question shifts from “how big?” to “how do we make every FLOP and every token count?”