Scaling laws
Loss as a power law in size, data, and compute
Paper: Scaling Laws for Neural Language Models — Kaplan et al., 2020
GPT-2 showed that scaling works. Kaplan et al.’s 2020 Scaling Laws for Neural Language Models showed that scaling is predictable — so precisely that you can forecast a model’s loss before you train it. This is the paper that turned pre-training from a craft into something you could put on a spreadsheet and budget against.
Loss is a power law in scale
The central finding is startlingly clean. Train transformers of many sizes on many amounts of data, and the test loss cross-entropy loss The standard LM loss: the negative log-probability the model assigned to the actual next token, averaged over all positions. Zero would mean perfect confidence in every correct token. See in glossary → follows a smooth power law power law A relationship of the form y = a·x^(−b): on log-log axes it's a straight line. Pre-training loss follows a power law in scale, so each 10× of compute buys a roughly constant drop in loss. See in glossary → in each of the three scale factors — model size , dataset size , and compute — over many orders of magnitude:
with (non-embedding parameters) and tokens. There’s a matching law for compute with exponent . A power law power law A relationship of the form y = a·x^(−b): on log-log axes it's a straight line. Pre-training loss follows a power law in scale, so each 10× of compute buys a roughly constant drop in loss. See in glossary → plots as a straight line on log-log axes: each multiplicative step in scale buys a fixed additive drop in loss.
These are scaling laws scaling laws Empirical formulas showing that test loss falls as a smooth power law in model size, dataset size, and compute. They let you predict a large model's performance from small experiments. See in glossary →
A few properties made the result so influential:
- Smoothness. No bumps, no plateaus across the studied range — just clean power laws.
- Universality. The shape barely depends on architectural details (depth vs. width, etc.) within reason; it’s dominated by scale.
- Sample efficiency of large models. Bigger models reach any given loss using fewer tokens. A large model “learns more per example.”
Kaplan’s recipe — and the catch
From these laws, Kaplan derived how to spend a fixed compute budget 6ND rule A rule of thumb: training a dense model with N parameters on D tokens costs about 6ND floating-point operations (≈2ND forward + ≈4ND backward). See in glossary → () optimally. Their answer: pour most of the increase into model size, with data growing only slowly — they estimated , i.e. as you scale compute, grow the model fast and the dataset gently, training very large models and stopping well before convergence.
This recipe shaped a generation of models, including GPT-3 — make them enormous, don’t worry too much about training on proportionally more tokens. It was also, in one important respect, wrong. A couple of years later, Chinchilla showed Kaplan had badly under-weighted data, largely because of a subtle flaw in how the learning rate was scheduled across the experiments. We’ll see exactly what changed in chapter 16.
First, though, the model that took Kaplan’s “go big” recipe to its logical extreme — and discovered something nobody had predicted from the loss curves alone.