Section 12

GPT-2

Scale, zero-shot, and the scaling hypothesis

Paper: Language Models are Unsupervised Multitask Learners (GPT-2) — Radford et al., 2019

GPT-2 (Radford et al., 2019) changed almost nothing about the method and almost everything about the field’s ambitions. It is the same decoder-only causal model as GPT-1, trained on the same next-token objective — just bigger, on more and better data. What came out the other side was a result that reframed how people thought about pre-training: a model that could do tasks it was never trained for, with no fine-tuning at all.

Same recipe, more scale

GPT-2 scaled GPT-1 up across the board, releasing four sizes — 117M, 345M, 762M, and 1,542M parameters. The largest (commonly “GPT-2 1.5B”: 48 layers, hidden size 1,600) was over ten times GPT-1. The context length context length The maximum number of tokens the model can attend to at once (also called the context window or sequence length). Pre-training picks a context length; later stages often extend it. See in glossary → doubled from 512 to 1,024 tokens. A few architectural refinements came along, and one of them stuck for good:

Two data innovations

WebText. Rather than books, GPT-2 was trained on WebText WebText The dataset behind GPT-2: ~8 million web pages reached via outbound Reddit links with at least 3 karma, used as a quality filter. About 40 GB of text. See in glossary → — about 8 million documents (40 GB of text) scraped from outbound links on Reddit that had at least 3 karma. That karma threshold was a cheap human quality filter quality filtering Discarding low-value text (spam, boilerplate, gibberish) using heuristics and trained classifiers, keeping the corpus closer to the kind of text you want the model to learn. See in glossary → : a stand-in for “a person found this link worth sharing.” It’s a direct ancestor of the elaborate quality classifiers in today’s data pipelines, and a clean example of using a proxy signal to curate web data at scale.

Byte-level BPE. GPT-2 introduced byte-level BPE byte-level BPE Byte-level Byte Pair Encoding — running BPE over raw bytes rather than Unicode characters, so any possible input (emoji, code, any language) is representable with a small base vocabulary. Introduced by GPT-2. See in glossary → : run Byte Pair Encoding BPE Byte-Pair Encoding — the most common tokenization algorithm. It merges frequent byte pairs into tokens. See in glossary → over raw bytes instead of Unicode characters. Because there are only 256 possible bytes, the base vocabulary is tiny, yet any string — any language, emoji, code, control characters — is representable, so there is never an out-of-vocabulary token. The final vocabulary was 50,257. Byte-level (or byte-fallback) tokenization is now near-universal.

The result that mattered: zero-shot

The headline finding wasn’t a benchmark number — it was a capability. GPT-2 could perform tasks zero-shot zero-shot Performing a task from instructions alone, with no examples given. GPT-2 showed a pre-trained LM can do many tasks zero-shot, just by being prompted. See in glossary → : summarize, translate, answer questions, all with no task-specific training and no fine-tuning, simply by being prompted in the right way (e.g., appending “TL;DR:” to elicit a summary). The paper’s framing was that a sufficiently capable language model, in learning to predict diverse internet text, implicitly learns to perform the many tasks demonstrated within that text — “unsupervised multitask learning.”

The scaling hypothesis is born

That smooth, scale-driven improvement is the seed of the scaling hypothesis scaling hypothesis The idea — crystallized around GPT-2 — that simply scaling up model size, data, and compute keeps improving capabilities, without needing fundamentally new architectures. See in glossary → : that you can keep getting more capable models mostly by scaling up size, data, and compute — without fundamentally new architectures. GPT-2 was the existence proof. It left an obvious, urgent question hanging: if scaling works, exactly how does the loss improve as we add parameters and data — and how much should we add?

That question is the entire subject of the next group of chapters. The answer — scaling laws — turned pre-training from alchemy into something you could budget on a spreadsheet.