Section 04

Optimizers & schedules

From SGD to AdamW, warmup and decay

The gradient tells you which way is downhill. The optimizer optimizer The rule that turns gradients into parameter updates. Plain gradient descent is the simplest; Adam-family optimizers add per-parameter adaptive step sizes and dominate LLM training. See in glossary → decides how to actually move. The previous chapter’s widget showed the problem with the naïve choice: one global learning rate learning rate The size of each parameter step. Too high and training diverges; too low and it crawls. The single most important hyperparameter in pre-training. See in glossary → can’t suit a surface that is steep in some directions and shallow in others. Modern pre-training solves this with a better update rule and a carefully shaped schedule.

From SGD to Adam

Plain stochastic gradient descent SGD Stochastic Gradient Descent — gradient descent using a noisy gradient estimated from one mini-batch at a time rather than the whole dataset. See in glossary → updates every parameter by the same rule, θθηg\theta \leftarrow \theta - \eta\, g, where gg is the mini-batch gradient. It works, but it crawls in flat directions and oscillates in steep ones.

The first fix is momentum momentum An optimizer trick that accumulates a running average of past gradients, letting updates build up speed in consistent directions and damp out oscillations. See in glossary → : instead of stepping along the raw gradient, accumulate a running average of recent gradients and step along that. Consistent directions build up speed; oscillating ones cancel out. It’s the difference between a marble skittering down a rough chute and a heavy ball that smooths over the bumps.

The fix that actually dominates LLM training is Adam Adam Adaptive Moment Estimation — an optimizer that tracks running averages of the gradient (first moment) and its square (second moment) to give each parameter its own adaptive step size. See in glossary → (Adaptive Moment Estimation). Adam keeps two running averages per parameter:

  • the first moment mm — the mean of recent gradients (this is momentum), and
  • the second moment vv — the mean of recent squared gradients, a measure of how large that parameter’s gradients have been.

The update divides the first moment by the square root of the second (the tiny constant ϵ\epsilon, the Greek letter epsilon, just guards against division by zero):

θθηm^v^+ϵ\theta \leftarrow \theta - \eta \, \frac{\hat{m}}{\sqrt{\hat{v}} + \epsilon}

The effect is that every parameter gets its own effective step size. A parameter with consistently large gradients (a steep direction) is automatically scaled down; one with tiny gradients (a flat direction) is scaled up. This is exactly the per-axis adaptivity that the lopsided bowl — steep one way, shallow the other — was begging for, and it is why Adam trains transformers so much more reliably than SGD.

The learning-rate schedule

Even with Adam, the learning rate η\eta is not held constant — it follows a schedule learning-rate schedule A plan for changing the learning rate over training — typically a short warmup ramp up followed by a long cosine or linear decay down to a small final value. See in glossary → over the course of training. Two ideas dominate.

Warmup. A freshly initialized model is fragile; its activations and gradients haven’t settled. Hitting it with the full learning rate immediately can blow it up. So we warm up warmup Starting training with a tiny learning rate and ramping it up over the first few thousand steps, to avoid blowing up the still-random early model. See in glossary → : start near zero and ramp the rate up over the first few thousand steps (typically the first 0.5–2% of training).

Decay. After warmup, the rate decays — almost always a cosine decay cosine decay A learning-rate schedule that follows a half-cosine curve from the peak down to a small floor, decaying slowly at first and fast at the end. The most common LLM schedule. See in glossary → — from its peak down to a small floor. High early rates make fast progress while the loss is far from a minimum; the gentle taper at the end lets the model settle into a good region instead of bouncing around it.

Learning-rate schedule
Warmup ramps the rate up; decay brings it back down. Shape the curve and see the canonical "warmup + cosine" recipe.
warmuppeak00.10training step →
The early model is fragile, so we warm up over the first few percent of steps instead of slamming it with the full rate. Then a long cosine decay spends most of training at a high, productive rate and eases down to a small floor for a clean finish. Decaying all the way to zero is common; many modern runs stop around 10% of peak.

A few supporting tricks round out the recipe. Gradient clipping gradient clipping Capping the overall size (norm) of the gradient before the update, to stop occasional huge gradients from destabilizing training. See in glossary → caps the global norm of the gradient so an occasional huge batch can’t destabilize training. Weight decay regularizes. And the hyperparameters hyperparameter A training setting you choose rather than learn — learning rate, batch size, number of layers, etc. Tuning these well is much of the craft of pre-training. See in glossary → — peak learning rate, warmup length, batch size, β\beta (the Greek letter beta) values — are tuned at small scale and extrapolated up, increasingly with the help of scaling laws.

We now have the full optimization loop: objective, gradient, optimizer, schedule. Before we can run it on real hardware, though, we have to confront the fact that GPUs don’t compute in tidy real numbers — they compute in finite-precision floating point. That is the next chapter.