Section 16

Chinchilla

Compute-optimal training and the 20:1 rule

Paper: Training Compute-Optimal Large Language Models (Chinchilla) — Hoffmann et al., 2022

Chinchilla (Hoffmann et al., 2022, Training Compute-Optimal Large Language Models) is one of the most consequential pre-training papers ever written, because it showed that almost everyone — including GPT-3 — had been training their models wrong. The fix was a single, memorable rule, and it redirected how the entire field spends compute.

The question: how to split a fixed compute budget

Recall the master equation, C6NDC \approx 6ND. For a fixed compute budget CC, you can trade parameters parameters The numbers (weights) inside a model that get adjusted during training. A “7B model” has 7 billion of them. See in glossary → NN against tokens token The atomic unit of text the model sees. Roughly a word-fragment — “tokenization” is a piece of text → list of token IDs. See in glossary → DD: a bigger model trained on less data, or a smaller model trained on more. Kaplan’s laws said go big — most of the budget into NN. Chinchilla asked the question more carefully, training over 400 models across sizes and token counts, and got a different answer.

The answer: scale N and D equally

Chinchilla’s finding, reached three independent ways (fixing model size and varying data, IsoFLOP IsoFLOP A curve of loss versus model size at a fixed compute budget ("iso" = equal FLOPs). Its minimum reveals the compute-optimal model size; Chinchilla used IsoFLOP profiles to find the 20:1 rule. See in glossary → profiles, and a parametric fit), all agreed: for compute-optimal compute-optimal The allocation of a fixed compute budget between model size and training tokens that minimizes loss. Chinchilla showed it means scaling both roughly equally — about 20 tokens per parameter. See in glossary → training, model size and training tokens should scale in equal proportion. Double your compute, and you should roughly double both NN and DD — not pour it all into NN. In their fit, the optimal exponents were ab0.5a \approx b \approx 0.5.

The practical form is the rule everyone now quotes: train on about 20 tokens per parameter tokens per parameter The ratio of training tokens to model parameters (D/N). Chinchilla's compute-optimal point is around 20; modern models often deliberately exceed it to get smaller, cheaper-to-serve models. See in glossary → .

Compute-optimal training (Chinchilla)
Fix a compute budget. Split it between model size and tokens. There's a sweet spot near 20 tokens per parameter.
1101001000tokens per parameter (D / N), log scale
Your model size N
69.2B
Your tokens D
1.38T
Predicted loss
1.930
Compute-optimal for this budget: N* = 69.9B, D* = 1.37T (20 tokens/param, loss 1.930)
The curve is the IsoFLOP slice: at fixed compute, both extremes are worse. A giant model starved of data (left, "Kaplan-style") and a tiny model drowned in data (far right) both lose to the balanced middle near ~20:1. Note how flat the curve is to the right of the minimum — that near-free region is what Llama-style over-training exploits to get a smaller, cheaper-to-serve model for almost the same loss. (Schematic IsoFLOP curve; optimum pinned at the empirical ~20:1 of Hoffmann et al.)

The widget is the IsoFLOP picture made interactive. Fix a compute budget, then slide the tokens-per-parameter ratio. The loss curve has a clear minimum near 20:1 — and both extremes lose. A giant model starved of data (the Kaplan regime, far left) and a tiny model drowned in data (far right) are both worse than the balanced middle.

The proof: Chinchilla vs. Gopher

The authors didn’t just fit curves — they bet on them. Taking the exact compute budget used for DeepMind’s Gopher (280B parameters, 300B tokens), their laws predicted the optimal model should be ~4× smaller and trained on ~4× more data. So they trained Chinchilla: 70B parameters on 1.4 trillion tokens. Despite being a quarter of Gopher’s size, Chinchilla uniformly outperformed it — and GPT-3, and the others — including a 7-point jump on the MMLU benchmark.

The consequence: GPT-3 was badly under-trained

The implication was stark. GPT-3’s 175B parameters on ~300B tokens is about 1.7 tokens per parameter — more than 10× short of compute-optimal. A Chinchilla-style model at the same compute would have been far smaller and far better. Overnight, “just make it bigger” was replaced by “balance size and data,” and the model sizes of the next generation came down even as capabilities went up.

That closes the scaling group. We now have the architecture, the recipe, and the laws that govern how to spend compute. From here, every chapter is a real, recent, production model — and the question shifts from “how big?” to “how do we make every FLOP and every token count?”