Section 14

PPO for RLHF in practice

The KL-to-reference penalty and the loop

We have the algorithm — PPO with its clipped surrogate and a GAE advantage. Now we wire it into actual RLHF. Doing so surfaces a problem the clean theory hides: if you simply turn PPO loose to maximize a learned reward model reward model (RM) A model trained from human preference data to output a scalar score for how good a response is. Stands in for a human judge so RL can query reward millions of times. See in glossary → , the policy will find the reward model’s blind spots and exploit them, drifting into fluent nonsense that the RM loves and humans hate. The fix — a KL penalty to a frozen reference model — is the single most important practical detail in PPO-based RLHF, and it’s where this chapter begins.

The real per-token reward

In the textbook setup the reward arrives once, at the end of the response. In RLHF that terminal reward is the reward-model score rRM(x,y)r_{\text{RM}}(x, y) on the full prompt–response pair. But there’s a second term, applied at every token: a penalty for straying from a frozen reference model reference model A frozen copy of the policy (usually the SFT model) that RLHF and DPO stay close to via a KL penalty, preventing the optimized policy from drifting into degenerate text. See in glossary → πref\pi_{\text{ref}} (typically the SFT checkpoint we started from). The per-token reward the policy actually optimizes is:

Rt=rRM(x,y)1[t=T]RM score, at the final token    βlogπθ(ytx,y<t)πref(ytx,y<t)per-token KL penaltyR_t = \underbrace{r_{\text{RM}}(x, y) \cdot \mathbb{1}[t = T]}_{\text{RM score, at the final token}} \;-\; \underbrace{\beta \, \log \frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_{\text{ref}}(y_t \mid x, y_{<t})}}_{\text{per-token KL penalty}}

The RM score lands only on the last token; the second term is paid at every step. That second term is a per-token estimate of the KL divergence KL divergence Kullback–Leibler divergence — a measure of how far one probability distribution is from another. Used in post-training as a "leash" that keeps a model close to a reference policy. See in glossary → between the policy and the reference — a KL penalty KL penalty A term added to the RLHF reward that subtracts β times the KL divergence from the reference policy, keeping the optimized model from drifting too far while chasing reward. See in glossary → scaled by a coefficient β\beta. Equivalently, we are optimizing the RM score minus βKL(πθπref)\beta \, \mathrm{KL}(\pi_\theta \,\|\, \pi_{\text{ref}}) over the whole sequence. These per-token rewards feed straight into the GAE machinery from chapter 12 to produce the advantages PPO clips.

Why the KL-to-reference penalty exists

Two reasons, both essential.

It blocks reward hacking. A reward model is a learned, imperfect proxy for human preference. Optimize any proxy hard enough and you stop optimizing the thing it stands for — the policy discovers inputs where the RM is wrong and pumps them, producing text that scores 9.8 while being repetitive, sycophantic, or gibberish. The KL penalty tethers the policy to the distribution of fluent, sensible language defined by the reference model: the further the policy drifts to chase RM quirks, the bigger the penalty it pays. This is a preview of reward hacking reward hacking When a policy finds ways to score high on the reward model without actually being better — exploiting quirks of an imperfect proxy. A central danger of RL post-training. See in glossary → , the subject of chapter 15.

It preserves what SFT taught. The reference is usually the SFT model — already fluent, already instruction-following. Without the anchor, RL optimization can erode that hard-won competence in pursuit of reward. The KL term keeps the policy in the neighborhood of a known-good model, so RLHF refines the SFT behavior rather than overwriting it.

The loop

Here is PPO-RLHF as it actually runs, one iteration:

  1. Sample rollouts. Take a batch of prompts and generate responses from the current policy. Because we generate fresh data from the live policy each round, this is online RL online RL RL that generates fresh rollouts from the current policy during training (e.g. PPO, GRPO). Expensive but adaptive, since the data tracks the improving policy. See in glossary → .
  2. Score. Run the reward model on each completed response to get rRMr_{\text{RM}}, and compute the per-token KL penalty against the reference model.
  3. Estimate advantages. Feed the per-token rewards through the critic and GAE to get A^t\hat{A}_t for every token.
  4. PPO update. Take several gradient steps on this batch using the clipped surrogate objective, updating both the policy and the critic. The clip keeps each update inside the trust region.
  5. Repeat with fresh rollouts from the now-slightly-better policy.

Steps 1–2 are generation and inference; steps 3–4 are training. Real systems pour enormous engineering into overlapping them, because generation (slow, autoregressive) and the gradient update (a dense matmul) have very different performance profiles.

Four models in memory

Count the networks this loop holds at once:

  • the policy πθ\pi_\theta — the model we’re training;
  • the reference πref\pi_{\text{ref}} — frozen, for the KL penalty;
  • the reward model — frozen, scores responses;
  • the critic — trained alongside the policy, estimates the value function for GAE.

Four models, two of them roughly the size of the model you ultimately ship. That is a lot of memory and orchestration, and it’s the main reason PPO-RLHF is operationally heavy.

Try it

The slider below sweeps the KL coefficient β\beta. Watch the policy’s distribution pull away from the reference as β\beta shrinks, and snap back toward it as β\beta grows — and watch the resulting trade-off curve between reward earned and KL distance traveled. The sweet spot is a policy that has climbed the reward hill without wandering off the manifold of sensible language.

RLHF reward-vs-KL tradeoff
β is the KL-penalty coefficient — the leash length. Slide it and watch where the policy settles on the reward-vs-drift frontier.
KL from reference (drift) →raw reward →β=0.20
Raw reward
0.875
KL from reference
1.300
Net = reward − β·KL
0.615
referencepolicy driftdegeneration
RLHF optimizes reward(x) − β·KL(π ‖ π_ref). β is the leash length. Pull it too loose and the policy sprints up the reward model's gradient into weird, off-distribution text — it games the proxy reward instead of getting genuinely better. Pull it too tight and the model never leaves the reference, so it barely improves. The net objective peaks in the middle: the concave frontier means each extra unit of drift buys less and less real reward, so there's a sweet-spot β where the leash is just long enough.

Where this leads

PPO-RLHF is powerful and, for years, was the default path from SFT model to aligned assistant. But its moving parts — a reward model to train and keep honest, a critic, a reference, an online sampling loop, and a temperamental β\beta — are a lot to get right. The next two stretches of this explainer are, in large part, reactions to that complexity. Reward hacking reward hacking When a policy finds ways to score high on the reward model without actually being better — exploiting quirks of an imperfect proxy. A central danger of RL post-training. See in glossary → and over-optimization get their own treatment in chapter 15. Then DPO DPO Direct Preference Optimization (Rafailov, 2023) — a closed-form supervised loss that optimizes the RLHF objective directly from preference pairs, with no separate reward model and no RL loop. See in glossary → (chapter 16) shows how to optimize essentially the same objective directly from preference pairs, with no reward model and no RL loop at all. And later, GRPO GRPO Group Relative Policy Optimization (Shao, 2024) — drop PPO’s critic; sample a group of responses per prompt and use their mean reward as the baseline, giving a group-relative advantage. Memory-cheap RL that powered DeepSeek-R1. See in glossary → (chapter 23) keeps the RL loop but drops the critic, slimming the memory footprint for large-scale reasoning RL. Each one trades away a piece of the machinery you just assembled.