The MLP block
Per-token nonlinear processing
Attention let tokens look at each other. The other half of every layer does the opposite: it works on each token in isolation. That half is called the MLP MLP Multi-Layer Perceptron — a stack of dense (matrix-multiply + nonlinearity) layers applied per-token. The transformer’s feed-forward block. See in glossary → block — short for Multi-Layer Perceptron, which is a fancy name for the most ordinary kind of neural net there is.
If attention is the social half of a transformer layer, the MLP is the private half — every token gets pulled aside, run through the same little network, and handed back with its representation refined.
What’s actually inside
An MLP, here, is a three-step procedure:
- Take the token’s vector (size — say 4,096).
- Project it up to a much wider vector — the expanded layer (size , usually 3–4× wider).
- Apply an elementwise nonlinearity that decides which entries of that wider vector “fire.”
- Project the result back down to .
That’s it. Two matrix multiplications with a nonlinearity in between. The output replaces the input in the residual stream, and we move on.
A useful way to think about the middle “expanded” layer: it’s a collection of feature detectors. Each entry asks something like “does this token look like a verb of motion?” or “does the residual stream look like it’s building up to a comma?”. The nonlinearity decides whether each detector fires; the down-projection blends the firing detectors back into a new vector, which the rest of the model reads.
The detectors aren’t designed — they’re learned. But people who have probed real models can often find single MLP neurons that correspond to surprisingly clean things: “indented Python code,” “the word ‘because’ is coming up,” “this is a token inside a quoted string.” It’s not always so clean, but the principle holds — the wide middle layer is where the model stores most of what it knows.
Why there has to be a nonlinearity
Stack two matrix multiplies with nothing in between and they collapse into a single matrix multiply. Multiplying by a matrix can only do scaling and rotation, so without the nonlinearity, the MLP couldn’t represent any of the “if A and B, then turn on C” kind of logic that text understanding actually needs. The nonlinearity is the bend. It’s what lets the network be more than a glorified rotation.
The specific bend doesn’t matter much; what matters is that there is one. The original Transformer used ReLU. GPT-2 used GELU GELU Gaussian Error Linear Unit — a smooth nonlinearity used inside the MLP. SiLU/SwiGLU are common modern variants. See in glossary → . Modern Llama-class models use a slightly fancier gated variant called SwiGLU SwiGLU A gated MLP variant (Llama, PaLM): output = SiLU(xW₁) ⊙ (xW₂), then projected. Outperforms plain MLPs at the same param count. See in glossary → , which adds a second up-projection that acts as a per-feature volume knob on the first. SwiGLU is consistently a little better at the same parameter count, which is why it’s now the default — but it’s an incremental improvement on the same three-step picture above.
The MLP holds most of the weights
This is the load-bearing fact for the rest of the essay. For a Llama-3-8B layer:
- Attention: roughly 67 million parameters per layer.
- MLP: roughly 176 million parameters per layer — over 2.5× the attention.
Multiplied by 32 layers, that’s about 5.6 billion of the model’s 8 billion parameters living in MLP blocks. Most of a modern LLM, by parameter count, is feed-forward networks. Attention gets the credit; the MLP carries the weight.
This is what makes decoding memory-bound. Every time the model generates one new token, it has to read most of those MLP weights from HBM HBM High-Bandwidth Memory — the DRAM stack soldered next to the GPU die. H100 SXM has 80 GB at ~3.35 TB/s. See in glossary → — the slow GPU DRAM — and use them for a single token’s worth of math. The matrix units finish in microseconds and then sit waiting for the next chunk of weights to arrive. We’ll come back to this asymmetry in §11 and §13.
Position-wise, not sequence-wise
Worth saying once more, because it matters: the MLP runs the same way at every position, independently. There’s no mixing between tokens here, no sense of order, no past or future. All cross-token mixing happens in attention; all per-token transformation happens in the MLP. A transformer layer alternates those two operations, and that’s the whole recipe.
Speaking of which — we now have all the pieces. Let’s assemble them into a single transformer block.