Section 15

GPT-3

175B parameters and in-context learning

Paper: Language Models are Few-Shot Learners (GPT-3) — Brown et al., 2020

GPT-3 (Brown et al., 2020, Language Models are Few-Shot Learners) is the model that took the scaling hypothesis and Kaplan’s “go big” recipe to their conclusion: 175 billion parameters, more than ten times any previous dense language model. But its lasting contribution wasn’t the size — it was a capability that fell out of scale, one the loss curves never advertised.

The same objective, two orders of magnitude bigger

Architecturally, GPT-3 is barely changed from GPT-2: a decoder-only transformer trained on next-token prediction . It just scaled everything. The flagship is 96 layers, hidden size 12,288, 96 attention heads, a 2,048-token context window , and ~175B parameters (with alternating dense and locally-banded sparse attention to keep attention costs manageable). Following Kaplan, it was trained on roughly 300B tokens — a lot, but as we’ll see next chapter, too few for its size.

The data was a deliberate mixture: filtered Common Crawl as the bulk, plus higher-quality sources (a WebText-style corpus, two book collections, and Wikipedia) up-weighted in the mixture so the model saw them more often than their raw size would imply. That up-weighting of quality is a direct ancestor of every modern data-mixing strategy.

The surprise: in-context learning

Here’s what scale unlocked. GPT-3 can perform a brand-new task at inference time, with no gradient updates, simply from a description and a few examples placed in its prompt:

Translate English to French:
sea otter => loutre de mer
cheese => fromage
plush giraffe => ???

The model just continues the pattern. This is in-context learning , and GPT-3 measured it across three regimes: zero-shot (instructions only), one-shot (one example), and few-shot (a handful). Crucially, the larger the model, the steeper its in-context learning curves — bigger models extract far more from the same few examples.

Emergence

In-context learning is the headline example of emergent abilities — capabilities largely absent in smaller models that appear, sometimes sharply, past a size threshold. GPT-3 reported many: arithmetic, word unscrambling, using novel words, all weak-to-absent at small scale and suddenly workable at 175B. (How “sharp” emergence really is became a subject of later debate, but the qualitative point — new abilities show up with scale — reshaped the field’s ambitions.)

GPT-3 followed Kaplan faithfully: a giant model trained on comparatively modest data. It worked spectacularly — and yet it left a vast amount of performance on the table. Showing exactly how much, and why, is the achievement of Chinchilla.