Section 14

Scaling laws

Loss as a power law in size, data, and compute

Paper: Scaling Laws for Neural Language Models — Kaplan et al., 2020

GPT-2 showed that scaling works. Kaplan et al.’s 2020 Scaling Laws for Neural Language Models showed that scaling is predictable — so precisely that you can forecast a model’s loss before you train it. This is the paper that turned pre-training from a craft into something you could put on a spreadsheet and budget against.

Loss is a power law in scale

The central finding is startlingly clean. Train transformers of many sizes on many amounts of data, and the test loss follows a smooth power law in each of the three scale factors — model size $N$ , dataset size $D$ , and compute $C$ — over many orders of magnitude:

L(N) = \left(\frac{N_c}{N}\right)^{0.076} \qquad L(D) = \left(\frac{D_c}{D}\right)^{0.095}

with $N_c \approx 8.8\times10^{13}$ (non-embedding parameters) and $D_c \approx 5.4\times10^{13}$ tokens. There’s a matching law for compute with exponent $\approx 0.050$ . A power law plots as a straight line on log-log axes: each multiplicative step in scale buys a fixed additive drop in loss.

Scaling laws (Kaplan et al.)

Test loss falls as a power law in scale. On log-log axes, a power law is a straight line.

Model size N = 1.0B → loss 2.376

L(N) = (8.8e13 / N) ^ 0.076

The line never bends: each 10× in scale buys a fixed drop in loss. The exponents are small (~0.08), so gains are real but slow — which is precisely why frontier labs spend 10× more compute for each increment. These curves let you predict a big model's loss from small ones — and set up the next question: given a fixed compute budget, how should you split it between N and D?

These are scaling laws

A few properties made the result so influential:

Smoothness. No bumps, no plateaus across the studied range — just clean power laws.
Universality. The shape barely depends on architectural details (depth vs. width, etc.) within reason; it’s dominated by scale.
Sample efficiency of large models. Bigger models reach any given loss using fewer tokens. A large model “learns more per example.”

Kaplan’s recipe — and the catch

From these laws, Kaplan derived how to spend a fixed compute budget ( $C \approx 6ND$ ) optimally. Their answer: pour most of the increase into model size, with data growing only slowly — they estimated $D \propto C^{0.27}$ , i.e. as you scale compute, grow the model fast and the dataset gently, training very large models and stopping well before convergence.

This recipe shaped a generation of models, including GPT-3 — make them enormous, don’t worry too much about training on proportionally more tokens. It was also, in one important respect, wrong. A couple of years later, Chinchilla showed Kaplan had badly under-weighted data, largely because of a subtle flaw in how the learning rate was scheduled across the experiments. We’ll see exactly what changed in chapter 16.

First, though, the model that took Kaplan’s “go big” recipe to its logical extreme — and discovered something nobody had predicted from the loss curves alone.