GPT-1
Generative pre-training, then fine-tune
Paper: Improving Language Understanding by Generative Pre-Training (GPT-1) — Radford et al., 2018
The transformer was an architecture in search of its killer application. GPT-1 — Radford et al.’s 2018 Improving Language Understanding by Generative Pre-Training — found it, and in doing so defined the recipe that the whole field still follows: pre-train a language model on raw text, then adapt it. This is the chapter where our foundations objective becomes a general-purpose method.
The two-stage recipe
GPT-1’s framework has two stages. The first is the one we care about:
- Unsupervised pre-training. Train a high-capacity language model on a large corpus with the standard next-token objective — maximize . This is exactly the cross-entropy objective from our foundations, now used as the engine of general learning.
- Supervised fine-tuning. Adapt the pre-trained model to a labeled task by adding a single linear output layer and continuing training.
The radical claim was that stage 1 does almost all the work. Where prior approaches built bespoke architectures per task, GPT-1 used one pre-trained model and made only minimal changes to fine-tune it — often just a linear head and some delimiter tokens. This is transfer learning transfer learning Learning general skills on one task (here, next-token prediction on huge text) and reusing them on other tasks. Pre-training plus adaptation is the transfer-learning recipe behind modern LLMs. See in glossary → in its modern form: the general skills learned by predicting text transfer to nearly everything.
A decoder-only transformer
Architecturally, GPT-1 took the decoder decoder The half of a transformer that generates a sequence one token at a time using masked (causal) self-attention. GPT-style language models are decoder-only. See in glossary → half of the transformer — a stack of blocks with masked ( causal causal mask A mask applied before the attention softmax that sets future positions to −∞, preventing each token from attending to tokens that come after it. What makes a decoder autoregressive. See in glossary → ) self-attention self-attention Attention where the queries, keys, and values all come from the same sequence, so each token can gather information from every other token. The core operation of the transformer. See in glossary → , so each position sees only earlier ones — and dropped the encoder entirely. This decoder-only design is the template for every GPT-family model since. The specifics:
- 12 layers, hidden size 768, 12 attention heads, feed-forward inner dimension 3072.
- About 117 million parameters.
- GELU GELU Gaussian Error Linear Unit — a smooth nonlinearity used inside the MLP. SiLU/SwiGLU are common modern variants. See in glossary → activations, learned position embeddings, and BPE BPE Byte-Pair Encoding — the most common tokenization algorithm. It merges frequent byte pairs into tokens. See in glossary → with 40,000 merges.
- Adam Adam Adaptive Moment Estimation — an optimizer that tracks running averages of the gradient (first moment) and its square (second moment) to give each parameter its own adaptive step size. See in glossary → with a peak learning rate learning rate The size of each parameter step. Too high and training diverges; too low and it crawls. The single most important hyperparameter in pre-training. See in glossary → of 2.5e-4, linearly warmed up over 2,000 steps then cosine-decayed cosine decay A learning-rate schedule that follows a half-cosine curve from the peak down to a small floor, decaying slowly at first and fast at the end. The most common LLM schedule. See in glossary → to zero — the warmup-then-decay schedule from our optimizer chapter, in the wild.
The data choice that mattered
GPT-1 was pre-trained on BooksCorpus BooksCorpus A dataset of around 7,000 unpublished books (~800M words) used to pre-train GPT-1. Long contiguous passages made it good for learning long-range structure. See in glossary → — over 7,000 unpublished books. The authors were explicit about why books rather than, say, a shuffled sentence dataset: books contain long stretches of contiguous text, which lets the model learn to condition on long-range structure. A sentence-shuffled corpus of the same size would have destroyed exactly the long-range dependencies the model most needed to learn. This is the first appearance of a theme that never goes away: the structure of your data shapes what the model can learn, not just its quantity.
GPT-1 proved the recipe at modest scale. Its contemporary, BERT, made a different bet about the objective — and for understanding tasks, briefly won. That contrast is the next chapter.