From next-token to behavior
Likelihood, KL divergence, and entropy
Almost every method in this explainer — supervised fine-tuning, RLHF, DPO, GRPO, the reasoning pipelines — is built from three quantities: a likelihood, a KL divergence, and an entropy. They sound abstract, but each has a sharp, intuitive job. Likelihood is how we say “make the model put more probability on this text.” KL divergence is the leash that keeps a model from wandering too far from where it started. Entropy is the dial that keeps it from collapsing into a single boring answer. Get comfortable with these three now and every later chapter becomes a recombination of pieces you already understand.
We’ll keep things intuitive, but the formulas matter — they’re the actual objects the optimizers manipulate.
Recap: the next-token objective
Pre-training trained the model to predict the next token. Concretely, the model defines a probability distribution over the next token given everything before it. Write the model as a policy policy In RL, the thing that chooses actions — here, the language model itself, viewed as a distribution over next tokens given the context. RL post-training optimizes the policy. See in glossary → with parameters ; for a sequence of tokens following a context , the model assigns the sequence a probability by multiplying its per-token predictions together:
Each factor is a softmax softmax Function that turns any vector into a probability distribution (positive, sums to 1) by exponentiating and normalizing. See in glossary → over the vocabulary, read off the final-layer logits logits The raw, pre-softmax scores the model produces — one per vocabulary token, per position. Bigger logit = the model finds that token more likely; the actual value can be any real number, positive or negative. Applying softmax across the vocabulary turns logits into a probability distribution that sums to 1. Sampling then picks one token from that distribution. See in glossary → . Training minimized the cross-entropy cross-entropy loss The standard LM loss: the negative log-probability the model assigned to the actual next token, averaged over all positions. Zero would mean perfect confidence in every correct token. See in glossary → between this distribution and the actual next token in the corpus. That’s the entire pre-training story, and it’s the foundation everything here builds on. (If the perplexity perplexity The exponential of the cross-entropy loss — roughly "how many equally-likely tokens is the model choosing between?" Lower is better; a perplexity of 1 means perfect prediction. See in glossary → and cross-entropy machinery is hazy, the pre-training explainer covers it in depth.)
Likelihood, and why SFT is “just more pre-training”
The likelihood likelihood The probability a model assigns to observed data. Supervised fine-tuning maximizes the likelihood of human-written target responses given their prompts. See in glossary → of a piece of text under the model is simply the probability the model assigns to it: , the product above. Because multiplying many small probabilities underflows to zero, we always work in logs, where the product becomes a sum:
This single expression is the engine of supervised fine-tuning supervised fine-tuning (SFT) Training a pre-trained model on curated (prompt, response) pairs with the ordinary next-token objective, so it imitates demonstrated assistant behavior. The first stage of post-training. See in glossary → . SFT takes a curated dataset of good (instruction, response) pairs and adjusts to maximize the log-likelihood of the target responses. We’re telling the model: “make text that looks like this — a helpful assistant answering — more probable.”
The punchline is that SFT is mechanically identical to pre-training. Same loss, same gradients, same optimizer. The only thing that changed is the data: instead of the raw web, we feed it carefully chosen demonstrations of the behavior we want. Instruction tuning instruction tuning Fine-tuning on many tasks phrased as natural-language instructions so the model learns to follow instructions in general — including ones it never saw in training. See in glossary → is exactly this, applied to instruction-following data. We unpack it properly in Chapter 5; for now, the key idea is that “maximize likelihood of good responses” is the SFT objective.
KL divergence: the leash
Maximizing likelihood is fine when you have demonstrations to imitate. But the reinforcement-learning methods later in this explainer do something riskier: they let the model generate its own text and push it toward higher reward. Left unconstrained, that optimization can drag the model far from sensible English — it can discover degenerate, high-reward gibberish, or simply forget how to write. We need a way to say “improve, but don’t drift too far from the model you started as.”
That measure is the Kullback–Leibler (KL) divergence KL divergence Kullback–Leibler divergence — a measure of how far one probability distribution is from another. Used in post-training as a "leash" that keeps a model close to a reference policy. See in glossary → . Given two distributions and over the same set of outcomes, it is defined as:
Read it as an expected log-ratio: for outcomes that considers likely, how much do and disagree? If and are identical, every ratio is , every log is , and the divergence is . The more puts mass where does not, the larger it grows.
In post-training, the second distribution is almost always a frozen copy of the model before RL began — the reference model reference model A frozen copy of the policy (usually the SFT model) that RLHF and DPO stay close to via a KL penalty, preventing the optimized policy from drifting into degenerate text. See in glossary → , (typically the SFT checkpoint). We measure and add it to the objective as a penalty. The reward pulls the policy toward better behavior; the KL term pulls it back toward the reference. The balance between those two forces is the central knob of RLHF, and we’ll see the exact KL penalty KL penalty A term added to the RLHF reward that subtracts β times the KL divergence from the reference policy, keeping the optimized model from drifting too far while chasing reward. See in glossary → when we get to PPO in practice. For now, hold onto the image: KL is the leash, and the reference model is the post the leash is tied to.
This same quantity reappears in a completely different costume in DPO, where the KL-constrained RLHF objective gets solved in closed form and turns into a simple loss on preference pairs. The leash never leaves; it just gets folded into the math.
Entropy: keeping the distribution alive
The last tool is entropy entropy A measure of how spread-out (uncertain) a probability distribution is. In RL post-training, keeping entropy up preserves exploration and prevents premature collapse onto one answer. See in glossary → , a measure of how spread-out a distribution is. For the model’s next-token distribution over the vocabulary, it is:
High entropy means the model is genuinely uncertain — probability spread across many plausible next tokens. Low entropy means it’s nearly committed to one. (Note the shape: entropy is just the expected negative log-probability the model assigns to its own samples.)
Why do we care during post-training? Because reward optimization has a relentless tendency to reduce entropy. As the policy learns that a particular phrasing scores well, it piles probability onto it, and the distribution sharpens. Pushed too hard, this becomes mode collapse: the model converges on one rigid template and produces it for everything, losing diversity and, often, the ability to explore better answers.
There’s a real tension here, and it runs through the whole field. We want the model to become more confident about good behavior (lower entropy on the right things) while not collapsing into a single mode (preserving enough entropy to explore and to stay interesting). Much of the algorithmic cleverness in later chapters — from PPO’s clipping to GRPO’s group normalization to the entropy-management tricks in modern GRPO refinements — is, at bottom, about navigating that tension.
The toolkit, assembled
Three quantities, three jobs:
- Likelihood () — the target SFT pushes up on good text. The “imitate this” signal.
- KL divergence () — the leash that keeps an optimizing policy anchored to a trusted reference. The “don’t drift” signal.
- Entropy () — the spread we protect to keep the model exploring and diverse. The “don’t collapse” signal.
Nearly every objective in this explainer is some weighted combination of these three, plus a reward. When you meet PPO’s loss or DPO’s loss for the first time and it looks intimidating, come back here: you’ll find it’s these same pieces, rearranged. Next, we turn to why we need anything beyond likelihood at all — the alignment problem, and the limits of pure imitation.