Section 16

Chinchilla

Compute-optimal training and the 20:1 rule

Paper: Training Compute-Optimal Large Language Models (Chinchilla) — Hoffmann et al., 2022

Chinchilla (Hoffmann et al., 2022, Training Compute-Optimal Large Language Models) is one of the most consequential pre-training papers ever written, because it showed that almost everyone — including GPT-3 — had been training their models wrong. The fix was a single, memorable rule, and it redirected how the entire field spends compute.

The question: how to split a fixed compute budget

Recall the master equation, $C \approx 6ND$ . For a fixed compute budget $C$ , you can trade parameters $N$ against tokens $D$ : a bigger model trained on less data, or a smaller model trained on more. Kaplan’s laws said go big — most of the budget into $N$ . Chinchilla asked the question more carefully, training over 400 models across sizes and token counts, and got a different answer.

The answer: scale N and D equally

Chinchilla’s finding, reached three independent ways (fixing model size and varying data, IsoFLOP profiles, and a parametric fit), all agreed: for compute-optimal training, model size and training tokens should scale in equal proportion. Double your compute, and you should roughly double both $N$ and $D$ — not pour it all into $N$ . In their fit, the optimal exponents were $a \approx b \approx 0.5$ .

The practical form is the rule everyone now quotes: train on about 20 tokens per parameter .

Compute-optimal training (Chinchilla)

Fix a compute budget. Split it between model size and tokens. There's a sweet spot near 20 tokens per parameter.

Compute budget C = 5.75e+23 FLOPsTokens per parameter = 20· near compute-optimal (Chinchilla)

Your model size N

69.2B

Your tokens D

1.38T

Predicted loss

1.930

Compute-optimal for this budget: N* = 69.9B, D* = 1.37T (20 tokens/param, loss 1.930)

The curve is the IsoFLOP slice: at fixed compute, both extremes are worse. A giant model starved of data (left, "Kaplan-style") and a tiny model drowned in data (far right) both lose to the balanced middle near ~20:1. Note how flat the curve is to the right of the minimum — that near-free region is what Llama-style over-training exploits to get a smaller, cheaper-to-serve model for almost the same loss. (Schematic IsoFLOP curve; optimum pinned at the empirical ~20:1 of Hoffmann et al.)

The widget is the IsoFLOP picture made interactive. Fix a compute budget, then slide the tokens-per-parameter ratio. The loss curve has a clear minimum near 20:1 — and both extremes lose. A giant model starved of data (the Kaplan regime, far left) and a tiny model drowned in data (far right) are both worse than the balanced middle.

The proof: Chinchilla vs. Gopher

The authors didn’t just fit curves — they bet on them. Taking the exact compute budget used for DeepMind’s Gopher (280B parameters, 300B tokens), their laws predicted the optimal model should be ~4× smaller and trained on ~4× more data. So they trained Chinchilla: 70B parameters on 1.4 trillion tokens. Despite being a quarter of Gopher’s size, Chinchilla uniformly outperformed it — and GPT-3, and the others — including a 7-point jump on the MMLU benchmark.

Why Kaplan got it wrong

Kaplan’s experiments used a learning-rate schedule that didn’t decay to match each run’s token budget. As we saw in the optimizer chapter, the cosine schedule must be stretched to end exactly when training ends; otherwise models trained on more tokens look artificially worse. That subtle flaw made data look less valuable than it is, biasing the recipe toward oversized, under-trained models. A scheduling detail rewrote the field’s scaling strategy — a vivid reminder that the systems details and the science are inseparable.

The consequence: GPT-3 was badly under-trained

The implication was stark. GPT-3’s 175B parameters on ~300B tokens is about 1.7 tokens per parameter — more than 10× short of compute-optimal. A Chinchilla-style model at the same compute would have been far smaller and far better. Overnight, “just make it bigger” was replaced by “balance size and data,” and the model sizes of the next generation came down even as capabilities went up.

The twist that's coming: deliberate over-training

Chinchilla answers “what minimizes loss for a fixed training budget?” But training cost isn’t the only cost — you also pay to run the model, forever, at inference. A smaller model is cheaper to serve. So if you’ll deploy a model widely, it’s often worth over-training a smaller model on far more than 20 tokens per parameter: you spend extra training compute to get a model that’s cheaper for the rest of its life. Notice in the widget how flat the loss curve is to the right of the minimum — that near-free flat region is exactly what makes over-training pay. This is the choice Llama 3 makes, and it’s where the modern era begins.

That closes the scaling group. We now have the architecture, the recipe, and the laws that govern how to spend compute. From here, every chapter is a real, recent, production model — and the question shifts from “how big?” to “how do we make every FLOP and every token count?”