GPT-3
175B parameters and in-context learning
Paper: Language Models are Few-Shot Learners (GPT-3) — Brown et al., 2020
GPT-3 (Brown et al., 2020, Language Models are Few-Shot Learners) is the model that took the scaling hypothesis scaling hypothesis The idea — crystallized around GPT-2 — that simply scaling up model size, data, and compute keeps improving capabilities, without needing fundamentally new architectures. See in glossary → and Kaplan’s “go big” recipe to their conclusion: 175 billion parameters, more than ten times any previous dense language model. But its lasting contribution wasn’t the size — it was a capability that fell out of scale, one the loss curves never advertised.
The same objective, two orders of magnitude bigger
Architecturally, GPT-3 is barely changed from GPT-2: a decoder-only decoder The half of a transformer that generates a sequence one token at a time using masked (causal) self-attention. GPT-style language models are decoder-only. See in glossary → transformer trained on next-token prediction next-token prediction The pre-training objective for GPT-style models: given the tokens so far, predict a probability distribution over the next token. Also called causal or autoregressive language modeling. See in glossary → . It just scaled everything. The flagship is 96 layers, hidden size 12,288, 96 attention heads, a 2,048-token context window context length The maximum number of tokens the model can attend to at once (also called the context window or sequence length). Pre-training picks a context length; later stages often extend it. See in glossary → , and ~175B parameters (with alternating dense and locally-banded sparse attention to keep attention costs manageable). Following Kaplan, it was trained on roughly 300B tokens — a lot, but as we’ll see next chapter, too few for its size.
The data was a deliberate mixture: filtered Common Crawl Common Crawl A free, monthly public crawl of the web — petabytes of raw HTML. It is the raw feedstock for most large pre-training corpora after heavy filtering. See in glossary → as the bulk, plus higher-quality sources (a WebText-style corpus, two book collections, and Wikipedia) up-weighted in the mixture data mixture The recipe specifying what fraction of training tokens comes from each source (web, code, books, math, multilingual). Tuning the mixture is one of the highest-leverage data decisions. See in glossary → so the model saw them more often than their raw size would imply. That up-weighting of quality is a direct ancestor of every modern data-mixing strategy.
The surprise: in-context learning
Here’s what scale unlocked. GPT-3 can perform a brand-new task at inference time, with no gradient updates, simply from a description and a few examples placed in its prompt:
Translate English to French:
sea otter => loutre de mer
cheese => fromage
plush giraffe => ???
The model just continues the pattern. This is in-context learning in-context learning A model performing a new task purely from examples or instructions placed in its prompt, with no gradient updates. GPT-3 showed this emerges from pure next-token pre-training at scale. See in glossary → , and GPT-3 measured it across three regimes: zero-shot zero-shot Performing a task from instructions alone, with no examples given. GPT-2 showed a pre-trained LM can do many tasks zero-shot, just by being prompted. See in glossary → (instructions only), one-shot (one example), and few-shot few-shot Giving the model a handful of worked examples in the prompt before the real query, so it infers the task from them. Contrast with zero-shot (instructions only) and one-shot (a single example). See in glossary → (a handful). Crucially, the larger the model, the steeper its in-context learning curves — bigger models extract far more from the same few examples.
Emergence
In-context learning is the headline example of emergent abilities emergent abilities Capabilities that are absent in smaller models but appear, sometimes abruptly, once a model is large enough — e.g. multi-step arithmetic or in-context learning of novel tasks. See in glossary → — capabilities largely absent in smaller models that appear, sometimes sharply, past a size threshold. GPT-3 reported many: arithmetic, word unscrambling, using novel words, all weak-to-absent at small scale and suddenly workable at 175B. (How “sharp” emergence really is became a subject of later debate, but the qualitative point — new abilities show up with scale — reshaped the field’s ambitions.)
GPT-3 followed Kaplan faithfully: a giant model trained on comparatively modest data. It worked spectacularly — and yet it left a vast amount of performance on the table. Showing exactly how much, and why, is the achievement of Chinchilla.