Section 02

From next-token to behavior

Likelihood, KL divergence, and entropy

Almost every method in this explainer — supervised fine-tuning, RLHF, DPO, GRPO, the reasoning pipelines — is built from three quantities: a likelihood, a KL divergence, and an entropy. They sound abstract, but each has a sharp, intuitive job. Likelihood is how we say “make the model put more probability on this text.” KL divergence is the leash that keeps a model from wandering too far from where it started. Entropy is the dial that keeps it from collapsing into a single boring answer. Get comfortable with these three now and every later chapter becomes a recombination of pieces you already understand.

We’ll keep things intuitive, but the formulas matter — they’re the actual objects the optimizers manipulate.

Recap: the next-token objective

Pre-training trained the model to predict the next token. Concretely, the model defines a probability distribution over the next token given everything before it. Write the model as a policy $\pi_\theta$ with parameters $\theta$ ; for a sequence of tokens $y = (y_1, y_2, \ldots, y_T)$ following a context $x$ , the model assigns the sequence a probability by multiplying its per-token predictions together:

\pi_\theta(y \mid x) = \prod_{t=1}^{T} \pi_\theta(y_t \mid x, y_{<t})

Each factor is a softmax over the vocabulary, read off the final-layer logits . Training minimized the cross-entropy between this distribution and the actual next token in the corpus. That’s the entire pre-training story, and it’s the foundation everything here builds on. (If the perplexity and cross-entropy machinery is hazy, the pre-training explainer covers it in depth.)

Likelihood, and why SFT is “just more pre-training”

The likelihood of a piece of text under the model is simply the probability the model assigns to it: $\pi_\theta(y \mid x)$ , the product above. Because multiplying many small probabilities underflows to zero, we always work in logs, where the product becomes a sum:

\log \pi_\theta(y \mid x) = \sum_{t=1}^{T} \log \pi_\theta(y_t \mid x, y_{<t})

This single expression is the engine of supervised fine-tuning . SFT takes a curated dataset of good (instruction, response) pairs and adjusts $\theta$ to maximize the log-likelihood of the target responses. We’re telling the model: “make text that looks like this — a helpful assistant answering — more probable.”

The punchline is that SFT is mechanically identical to pre-training. Same loss, same gradients, same optimizer. The only thing that changed is the data: instead of the raw web, we feed it carefully chosen demonstrations of the behavior we want. Instruction tuning is exactly this, applied to instruction-following data. We unpack it properly in Chapter 5; for now, the key idea is that “maximize likelihood of good responses” is the SFT objective.

Why cross-entropy training = maximizing likelihood

These are two names for the same gradient. The cross-entropy loss on a target token $y_t$ is $-\log \pi_\theta(y_t \mid x, y_{<t})$ — the negative log-probability the model put on the correct token. Summed over a sequence and averaged over the dataset, minimizing cross-entropy is:

\min_\theta \; -\frac{1}{N}\sum_{i=1}^{N} \log \pi_\theta(y^{(i)} \mid x^{(i)})

The minus sign and the “min” cancel conceptually: minimizing negative log-likelihood is maximizing log-likelihood. So “train with cross-entropy” and “do maximum-likelihood estimation” describe one operation. Every time we say SFT maximizes likelihood, this is what’s running underneath.

KL divergence: the leash

Maximizing likelihood is fine when you have demonstrations to imitate. But the reinforcement-learning methods later in this explainer do something riskier: they let the model generate its own text and push it toward higher reward. Left unconstrained, that optimization can drag the model far from sensible English — it can discover degenerate, high-reward gibberish, or simply forget how to write. We need a way to say “improve, but don’t drift too far from the model you started as.”

That measure is the Kullback–Leibler (KL) divergence . Given two distributions $p$ and $q$ over the same set of outcomes, it is defined as:

D_{\mathrm{KL}}(p \,\|\, q) = \sum_{z} p(z) \, \log \frac{p(z)}{q(z)} = \mathbb{E}_{z \sim p}\!\left[\log \frac{p(z)}{q(z)}\right]

Read it as an expected log-ratio: for outcomes that $p$ considers likely, how much do $p$ and $q$ disagree? If $p$ and $q$ are identical, every ratio is $1$ , every log is $0$ , and the divergence is $0$ . The more $p$ puts mass where $q$ does not, the larger it grows.

In post-training, the second distribution is almost always a frozen copy of the model before RL began — the reference model , $\pi_{\mathrm{ref}}$ (typically the SFT checkpoint). We measure $D_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_{\mathrm{ref}})$ and add it to the objective as a penalty. The reward pulls the policy toward better behavior; the KL term pulls it back toward the reference. The balance between those two forces is the central knob of RLHF, and we’ll see the exact KL penalty when we get to PPO in practice. For now, hold onto the image: KL is the leash, and the reference model is the post the leash is tied to.

This same quantity reappears in a completely different costume in DPO, where the KL-constrained RLHF objective gets solved in closed form and turns into a simple loss on preference pairs. The leash never leaves; it just gets folded into the math.

Entropy: keeping the distribution alive

The last tool is entropy , a measure of how spread-out a distribution is. For the model’s next-token distribution $\pi_\theta(\cdot \mid x)$ over the vocabulary, it is:

H(\pi_\theta) = -\sum_{v} \pi_\theta(v \mid x) \, \log \pi_\theta(v \mid x)

High entropy means the model is genuinely uncertain — probability spread across many plausible next tokens. Low entropy means it’s nearly committed to one. (Note the shape: entropy is just the expected negative log-probability the model assigns to its own samples.)

Why do we care during post-training? Because reward optimization has a relentless tendency to reduce entropy. As the policy learns that a particular phrasing scores well, it piles probability onto it, and the distribution sharpens. Pushed too hard, this becomes mode collapse: the model converges on one rigid template and produces it for everything, losing diversity and, often, the ability to explore better answers.

There’s a real tension here, and it runs through the whole field. We want the model to become more confident about good behavior (lower entropy on the right things) while not collapsing into a single mode (preserving enough entropy to explore and to stay interesting). Much of the algorithmic cleverness in later chapters — from PPO’s clipping to GRPO’s group normalization to the entropy-management tricks in modern GRPO refinements — is, at bottom, about navigating that tension.

The toolkit, assembled

Three quantities, three jobs:

Likelihood ( $\log \pi_\theta(y \mid x)$ ) — the target SFT pushes up on good text. The “imitate this” signal.
KL divergence ( $D_{\mathrm{KL}}(\pi_\theta \| \pi_{\mathrm{ref}})$ ) — the leash that keeps an optimizing policy anchored to a trusted reference. The “don’t drift” signal.
Entropy ( $H(\pi_\theta)$ ) — the spread we protect to keep the model exploring and diverse. The “don’t collapse” signal.

Nearly every objective in this explainer is some weighted combination of these three, plus a reward. When you meet PPO’s loss or DPO’s loss for the first time and it looks intimidating, come back here: you’ll find it’s these same pieces, rearranged. Next, we turn to why we need anything beyond likelihood at all — the alignment problem, and the limits of pure imitation.