Section 12

GPT-2

Scale, zero-shot, and the scaling hypothesis

Paper: Language Models are Unsupervised Multitask Learners (GPT-2) — Radford et al., 2019

GPT-2 (Radford et al., 2019) changed almost nothing about the method and almost everything about the field’s ambitions. It is the same decoder-only causal model as GPT-1, trained on the same next-token objective — just bigger, on more and better data. What came out the other side was a result that reframed how people thought about pre-training: a model that could do tasks it was never trained for, with no fine-tuning at all.

Same recipe, more scale

GPT-2 scaled GPT-1 up across the board, releasing four sizes — 117M, 345M, 762M, and 1,542M parameters. The largest (commonly “GPT-2 1.5B”: 48 layers, hidden size 1,600) was over ten times GPT-1. The context length doubled from 512 to 1,024 tokens. A few architectural refinements came along, and one of them stuck for good:

The pre-norm switch

GPT-2 moved LayerNorm to the input of each sub-block (with an extra LayerNorm after the final block) and scaled the residual-layer initialization by $1/\sqrt{N}$ for $N$ layers. This is the pre-norm arrangement from our precision chapter, and it made deep transformers substantially more stable to train. Essentially every model after GPT-2 is pre-norm. A small change; a permanent one.

Two data innovations

WebText. Rather than books, GPT-2 was trained on WebText — about 8 million documents (40 GB of text) scraped from outbound links on Reddit that had at least 3 karma. That karma threshold was a cheap human quality filter : a stand-in for “a person found this link worth sharing.” It’s a direct ancestor of the elaborate quality classifiers in today’s data pipelines, and a clean example of using a proxy signal to curate web data at scale.

Byte-level BPE. GPT-2 introduced byte-level BPE : run Byte Pair Encoding over raw bytes instead of Unicode characters. Because there are only 256 possible bytes, the base vocabulary is tiny, yet any string — any language, emoji, code, control characters — is representable, so there is never an out-of-vocabulary token. The final vocabulary was 50,257. Byte-level (or byte-fallback) tokenization is now near-universal.

The result that mattered: zero-shot

The headline finding wasn’t a benchmark number — it was a capability. GPT-2 could perform tasks zero-shot : summarize, translate, answer questions, all with no task-specific training and no fine-tuning, simply by being prompted in the right way (e.g., appending “TL;DR:” to elicit a summary). The paper’s framing was that a sufficiently capable language model, in learning to predict diverse internet text, implicitly learns to perform the many tasks demonstrated within that text — “unsupervised multitask learning.”

The scaling hypothesis is born

That smooth, scale-driven improvement is the seed of the scaling hypothesis : that you can keep getting more capable models mostly by scaling up size, data, and compute — without fundamentally new architectures. GPT-2 was the existence proof. It left an obvious, urgent question hanging: if scaling works, exactly how does the loss improve as we add parameters and data — and how much should we add?

That question is the entire subject of the next group of chapters. The answer — scaling laws — turned pre-training from alchemy into something you could budget on a spreadsheet.