Section 09

Attention Is All You Need

The transformer, from a training lens

Paper: Attention Is All You Need — Vaswani et al., 2017

We’ve built the training machine in the abstract. Now we meet the architecture it actually trains — the transformer , introduced in Vaswani et al.’s 2017 paper Attention Is All You Need. Every model in the rest of this explainer is a transformer. This chapter recaps it from a training point of view: what the forward pass computes, and therefore what backprop has to flow through.

The problem it solved

Before 2017, the best sequence models were recurrent (RNNs, LSTMs). They processed text one token at a time, each step depending on the last. That sequential dependency was fatal for training at scale: you couldn’t parallelize within a sequence, so GPUs sat idle. The transformer’s headline contribution was to replace recurrence with attention, which processes all positions at once. As the authors put it, it could be “trained for as little as twelve hours on eight P100 GPUs” — and, more importantly for us, its parallelism is exactly what later let training scale to thousands of GPUs and trillions of tokens.

It’s worth being precise about one historical point: the original transformer was a translation model, trained supervised on sentence pairs. It was not yet a pre-trained language model. The next three chapters — GPT-1, BERT, GPT-2 — are the story of taking this architecture and pointing it at the self-supervised next-token (and masked-token) objectives from our foundations. But the architecture is the shared substrate, so we start here.

Self-attention, from the training side

The heart of the transformer is self-attention . Each token’s vector $x_i$ is projected into three vectors by learned weight matrices: a query $q_i$ , a key $k_i$ , and a value $v_i$ . The output for each token is a weighted sum of all the values, where the weights come from comparing that token’s query to every key:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V

The dot product $QK^\top$ scores how much each token should attend to each other token; dividing by $\sqrt{d_k}$ (the scaled dot-product ) keeps those scores from growing with dimension and saturating the softmax ; the softmax turns each row into attention weights; multiplying by $V$ mixes the values accordingly. The crucial property for our purposes: this is all matrix multiplication, which is what GPUs and backprop are best at. The gradient flows cleanly back through three weight matrices ( $W_Q, W_K, W_V$ ) that the model learns.

Multi-head attention

A single attention pattern can only express one notion of “who attends to whom.” Real language needs many at once — agreement, coreference, syntax. Multi-head attention runs several attention operations in parallel, each with its own projections into a smaller dimension ( $d_k = d_{\text{model}}/h$ ), then concatenates and projects the results. The original used $h = 8$ heads with $d_k = 64$ , keeping total compute the same as one full-width head. From a training view, each head is just more learnable matrices, all differentiated together.

The pieces that make depth trainable

Attention alone isn’t the whole block. Three supporting ingredients are what let you actually stack many layers and train them:

A position-wise feed-forward network (two linear layers with a nonlinearity between, inner dimension $d_{\text{ff}} = 2048$ in the base model) processes each token independently after attention. This is where much of the model’s raw capacity — and parameter count — lives.
Residual connections wrap each sub-layer: the output is $\text{input} + \text{Sublayer}(\text{input})$ . These give gradients a direct path backward through the whole stack, which is the single most important trick for training deep networks.
LayerNorm normalizes activations around each sub-layer, keeping their scale in check.

Position, and the original training recipe

Because attention is order-agnostic (it’s a weighted sum, blind to sequence position), the transformer adds positional encodings to the embeddings — originally fixed sinusoids of different frequencies. (Learned position embeddings worked about as well; modern models use rotary positions, which we’ll meet with Llama.)

The training recipe itself is a useful checkpoint against our foundations chapters — it’s all there, in miniature:

Optimizer: Adam with $\beta_1 = 0.9$ , $\beta_2 = 0.98$ .
Schedule: linear warmup for 4,000 steps, then inverse-square-root decay — the original of the warmup-then-decay pattern we now take for granted.
Regularization: dropout of 0.1 and label smoothing of 0.1.
Data: WMT 2014 translation pairs, tokenized with BPE (a ~37k shared vocabulary).

The base model was 65M parameters; the “big” model 213M. Tiny by today’s standards — but the architecture scaled essentially unchanged to a thousand times that size. That scalability is the whole reason the next decade happened. The first team to point this architecture at self-supervised pre-training was OpenAI, with GPT-1.