Section 09

Attention Is All You Need

The transformer, from a training lens

Paper: Attention Is All You Need — Vaswani et al., 2017

We’ve built the training machine in the abstract. Now we meet the architecture it actually trains — the transformer transformer The neural-network architecture introduced in "Attention Is All You Need" (2017), built from stacked self-attention and feed-forward layers. Every model in this explainer is a transformer. See in glossary → , introduced in Vaswani et al.’s 2017 paper Attention Is All You Need. Every model in the rest of this explainer is a transformer. This chapter recaps it from a training point of view: what the forward pass computes, and therefore what backprop has to flow through.

The problem it solved

Before 2017, the best sequence models were recurrent (RNNs, LSTMs). They processed text one token at a time, each step depending on the last. That sequential dependency was fatal for training at scale: you couldn’t parallelize within a sequence, so GPUs sat idle. The transformer’s headline contribution was to replace recurrence with attention, which processes all positions at once. As the authors put it, it could be “trained for as little as twelve hours on eight P100 GPUs” — and, more importantly for us, its parallelism is exactly what later let training scale to thousands of GPUs and trillions of tokens.

It’s worth being precise about one historical point: the original transformer was a translation model, trained supervised on sentence pairs. It was not yet a pre-trained language model. The next three chapters — GPT-1, BERT, GPT-2 — are the story of taking this architecture and pointing it at the self-supervised next-token (and masked-token) objectives from our foundations. But the architecture is the shared substrate, so we start here.

Self-attention, from the training side

The heart of the transformer is self-attention self-attention Attention where the queries, keys, and values all come from the same sequence, so each token can gather information from every other token. The core operation of the transformer. See in glossary → . Each token’s vector xix_i is projected into three vectors by learned weight matrices: a query query A vector asking “what am I looking for in other tokens?”. Computed per token, used to score against keys. See in glossary → qiq_i, a key key A vector saying “what I represent”. Compared against queries to compute attention scores. See in glossary → kik_i, and a value value A vector representing the content actually mixed into the output when a token gets attended to. See in glossary → viv_i. The output for each token is a weighted sum of all the values, where the weights come from comparing that token’s query to every key:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V

The dot product dot product A single number summarizing how aligned two vectors are. To compute a · b: multiply corresponding components (a₀·b₀, a₁·b₁, …, a_{d-1}·b_{d-1}) and sum the results. Large positive = the two vectors point in similar directions; near zero = they're unrelated; large negative = opposite directions. See in glossary → QKQK^\top scores how much each token should attend to each other token; dividing by dk\sqrt{d_k} (the scaled dot-product scaled dot-product attention softmax(QKᵀ / √d_k) · V — the canonical attention formula from “Attention is All You Need”. See in glossary → ) keeps those scores from growing with dimension and saturating the softmax softmax Function that turns any vector into a probability distribution (positive, sums to 1) by exponentiating and normalizing. See in glossary → ; the softmax turns each row into attention weights; multiplying by VV mixes the values accordingly. The crucial property for our purposes: this is all matrix multiplication, which is what GPUs and backprop are best at. The gradient flows cleanly back through three weight matrices (WQ,WK,WVW_Q, W_K, W_V) that the model learns.

Multi-head attention

A single attention pattern can only express one notion of “who attends to whom.” Real language needs many at once — agreement, coreference, syntax. Multi-head attention multi-head attention Running several attention operations ("heads") in parallel, each with its own learned projections, so the layer can track many kinds of relationships at once, then concatenating their outputs. See in glossary → runs several attention operations in parallel, each with its own projections into a smaller dimension (dk=dmodel/hd_k = d_{\text{model}}/h), then concatenates and projects the results. The original used h=8h = 8 heads with dk=64d_k = 64, keeping total compute the same as one full-width head. From a training view, each head is just more learnable matrices, all differentiated together.

The pieces that make depth trainable

Attention alone isn’t the whole block. Three supporting ingredients are what let you actually stack many layers and train them:

  • A position-wise feed-forward network MLP Multi-Layer Perceptron — a stack of dense (matrix-multiply + nonlinearity) layers applied per-token. The transformer’s feed-forward block. See in glossary → (two linear layers with a nonlinearity between, inner dimension dff=2048d_{\text{ff}} = 2048 in the base model) processes each token independently after attention. This is where much of the model’s raw capacity — and parameter count — lives.
  • Residual connections residual connection output = x + f(x). Lets gradients flow through deep stacks and means each block adds a refinement rather than rewriting. See in glossary → wrap each sub-layer: the output is input+Sublayer(input)\text{input} + \text{Sublayer}(\text{input}). These give gradients a direct path backward through the whole stack, which is the single most important trick for training deep networks.
  • LayerNorm LayerNorm Layer Normalization — rescales each token's activation vector to zero mean and unit variance (then applies learned scale/shift), stabilizing training. RMSNorm is the cheaper modern variant. See in glossary → normalizes activations around each sub-layer, keeping their scale in check.

Position, and the original training recipe

Because attention is order-agnostic (it’s a weighted sum, blind to sequence position), the transformer adds positional encodings positional encoding Information added to embeddings so the model knows where each token sits in the sequence. See in glossary → to the embeddings — originally fixed sinusoids of different frequencies. (Learned position embeddings worked about as well; modern models use rotary positions, which we’ll meet with Llama.)

The training recipe itself is a useful checkpoint against our foundations chapters — it’s all there, in miniature:

  • Optimizer: Adam with β1=0.9\beta_1 = 0.9, β2=0.98\beta_2 = 0.98.
  • Schedule: linear warmup warmup Starting training with a tiny learning rate and ramping it up over the first few thousand steps, to avoid blowing up the still-random early model. See in glossary → for 4,000 steps, then inverse-square-root decay — the original of the warmup-then-decay pattern we now take for granted.
  • Regularization: dropout dropout A regularizer that randomly zeroes a fraction of activations during training, forcing the network not to rely on any single unit. Common in early models; large modern pre-training often uses little or none. See in glossary → of 0.1 and label smoothing label smoothing Softening the one-hot target so a little probability mass is spread over all other tokens. It slightly worsens perplexity but discourages overconfidence and often improves downstream quality. See in glossary → of 0.1.
  • Data: WMT 2014 translation pairs, tokenized with BPE BPE Byte-Pair Encoding — the most common tokenization algorithm. It merges frequent byte pairs into tokens. See in glossary → (a ~37k shared vocabulary).

The base model was 65M parameters; the “big” model 213M. Tiny by today’s standards — but the architecture scaled essentially unchanged to a thousand times that size. That scalability is the whole reason the next decade happened. The first team to point this architecture at self-supervised pre-training was OpenAI, with GPT-1.