Section 10

GPT-1

Generative pre-training, then fine-tune

Paper: Improving Language Understanding by Generative Pre-Training (GPT-1) — Radford et al., 2018

The transformer was an architecture in search of its killer application. GPT-1 — Radford et al.’s 2018 Improving Language Understanding by Generative Pre-Training — found it, and in doing so defined the recipe that the whole field still follows: pre-train a language model on raw text, then adapt it. This is the chapter where our foundations objective becomes a general-purpose method.

The two-stage recipe

GPT-1’s framework has two stages. The first is the one we care about:

Unsupervised pre-training. Train a high-capacity language model on a large corpus with the standard next-token objective — maximize $\sum_i \log P(u_i \mid u_{i-k}, \ldots, u_{i-1})$ . This is exactly the cross-entropy objective from our foundations, now used as the engine of general learning.
Supervised fine-tuning. Adapt the pre-trained model to a labeled task by adding a single linear output layer and continuing training.

The radical claim was that stage 1 does almost all the work. Where prior approaches built bespoke architectures per task, GPT-1 used one pre-trained model and made only minimal changes to fine-tune it — often just a linear head and some delimiter tokens. This is transfer learning in its modern form: the general skills learned by predicting text transfer to nearly everything.

A decoder-only transformer

Architecturally, GPT-1 took the decoder half of the transformer — a stack of blocks with masked ( causal ) self-attention , so each position sees only earlier ones — and dropped the encoder entirely. This decoder-only design is the template for every GPT-family model since. The specifics:

12 layers, hidden size 768, 12 attention heads, feed-forward inner dimension 3072.
About 117 million parameters.
GELU activations, learned position embeddings, and BPE with 40,000 merges.
Adam with a peak learning rate of 2.5e-4, linearly warmed up over 2,000 steps then cosine-decayed to zero — the warmup-then-decay schedule from our optimizer chapter, in the wild.

The data choice that mattered

GPT-1 was pre-trained on BooksCorpus — over 7,000 unpublished books. The authors were explicit about why books rather than, say, a shuffled sentence dataset: books contain long stretches of contiguous text, which lets the model learn to condition on long-range structure. A sentence-shuffled corpus of the same size would have destroyed exactly the long-range dependencies the model most needed to learn. This is the first appearance of a theme that never goes away: the structure of your data shapes what the model can learn, not just its quantity.

GPT-1 proved the recipe at modest scale. Its contemporary, BERT, made a different bet about the objective — and for understanding tasks, briefly won. That contrast is the next chapter.