Section 04

Optimizers & schedules

From SGD to AdamW, warmup and decay

The gradient tells you which way is downhill. The optimizer decides how to actually move. The previous chapter’s widget showed the problem with the naïve choice: one global learning rate can’t suit a surface that is steep in some directions and shallow in others. Modern pre-training solves this with a better update rule and a carefully shaped schedule.

From SGD to Adam

Plain stochastic gradient descent updates every parameter by the same rule, $\theta \leftarrow \theta - \eta\, g$ , where $g$ is the mini-batch gradient. It works, but it crawls in flat directions and oscillates in steep ones.

The first fix is momentum : instead of stepping along the raw gradient, accumulate a running average of recent gradients and step along that. Consistent directions build up speed; oscillating ones cancel out. It’s the difference between a marble skittering down a rough chute and a heavy ball that smooths over the bumps.

The fix that actually dominates LLM training is Adam (Adaptive Moment Estimation). Adam keeps two running averages per parameter:

the first moment $m$ — the mean of recent gradients (this is momentum), and
the second moment $v$ — the mean of recent squared gradients, a measure of how large that parameter’s gradients have been.

The update divides the first moment by the square root of the second (the tiny constant $\epsilon$ , the Greek letter epsilon, just guards against division by zero):

\theta \leftarrow \theta - \eta \, \frac{\hat{m}}{\sqrt{\hat{v}} + \epsilon}

The effect is that every parameter gets its own effective step size. A parameter with consistently large gradients (a steep direction) is automatically scaled down; one with tiny gradients (a flat direction) is scaled up. This is exactly the per-axis adaptivity that the lopsided bowl — steep one way, shallow the other — was begging for, and it is why Adam trains transformers so much more reliably than SGD.

AdamW is the real default

Modern runs use AdamW — Adam with decoupled Weight decay. Plain Adam folds weight decay into the gradient, which interacts badly with the per-parameter scaling. AdamW instead shrinks the weights directly, separately from the adaptive step. Nearly every model in this explainer — GPT-3, Llama, the Qwens, the Gemmas — is trained with AdamW. The 2026 frontier introduces a challenger, the Muon optimizer, which we meet with Kimi K2.5.

The optimizer isn't free

Adam’s two moments cost memory: in FP32 that’s 8 bytes per parameter on top of the weights, often more memory than the model itself. This is why optimizer states are a headline term in the memory budget, and why sharding them across GPUs ( ZeRO / FSDP ) is one of the first moves when scaling out.

The learning-rate schedule

Even with Adam, the learning rate $\eta$ is not held constant — it follows a schedule over the course of training. Two ideas dominate.

Warmup. A freshly initialized model is fragile; its activations and gradients haven’t settled. Hitting it with the full learning rate immediately can blow it up. So we warm up : start near zero and ramp the rate up over the first few thousand steps (typically the first 0.5–2% of training).

Decay. After warmup, the rate decays — almost always a cosine decay — from its peak down to a small floor. High early rates make fast progress while the loss is far from a minimum; the gentle taper at the end lets the model settle into a good region instead of bouncing around it.

Learning-rate schedule

Warmup ramps the rate up; decay brings it back down. Shape the curve and see the canonical "warmup + cosine" recipe.

Warmup fraction = 5% of stepsFinal learning rate = 10% of peak

The early model is fragile, so we warm up over the first few percent of steps instead of slamming it with the full rate. Then a long cosine decay spends most of training at a high, productive rate and eases down to a small floor for a clean finish. Decaying all the way to zero is common; many modern runs stop around 10% of peak.

A few supporting tricks round out the recipe. Gradient clipping caps the global norm of the gradient so an occasional huge batch can’t destabilize training. Weight decay regularizes. And the hyperparameters — peak learning rate, warmup length, batch size, $\beta$ (the Greek letter beta) values — are tuned at small scale and extrapolated up, increasingly with the help of scaling laws.

We now have the full optimization loop: objective, gradient, optimizer, schedule. Before we can run it on real hardware, though, we have to confront the fact that GPUs don’t compute in tidy real numbers — they compute in finite-precision floating point. That is the next chapter.