Section 14

PPO for RLHF in practice

The KL-to-reference penalty and the loop

We have the algorithm — PPO with its clipped surrogate and a GAE advantage. Now we wire it into actual RLHF. Doing so surfaces a problem the clean theory hides: if you simply turn PPO loose to maximize a learned reward model , the policy will find the reward model’s blind spots and exploit them, drifting into fluent nonsense that the RM loves and humans hate. The fix — a KL penalty to a frozen reference model — is the single most important practical detail in PPO-based RLHF, and it’s where this chapter begins.

The real per-token reward

In the textbook setup the reward arrives once, at the end of the response. In RLHF that terminal reward is the reward-model score $r_{\text{RM}}(x, y)$ on the full prompt–response pair. But there’s a second term, applied at every token: a penalty for straying from a frozen reference model $\pi_{\text{ref}}$ (typically the SFT checkpoint we started from). The per-token reward the policy actually optimizes is:

R_t = \underbrace{r_{\text{RM}}(x, y) \cdot \mathbb{1}[t = T]}_{\text{RM score, at the final token}} \;-\; \underbrace{\beta \, \log \frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_{\text{ref}}(y_t \mid x, y_{<t})}}_{\text{per-token KL penalty}}

The RM score lands only on the last token; the second term is paid at every step. That second term is a per-token estimate of the KL divergence between the policy and the reference — a KL penalty scaled by a coefficient $\beta$ . Equivalently, we are optimizing the RM score minus $\beta \, \mathrm{KL}(\pi_\theta \,\|\, \pi_{\text{ref}})$ over the whole sequence. These per-token rewards feed straight into the GAE machinery from chapter 12 to produce the advantages PPO clips.

Why the KL-to-reference penalty exists

Two reasons, both essential.

It blocks reward hacking. A reward model is a learned, imperfect proxy for human preference. Optimize any proxy hard enough and you stop optimizing the thing it stands for — the policy discovers inputs where the RM is wrong and pumps them, producing text that scores 9.8 while being repetitive, sycophantic, or gibberish. The KL penalty tethers the policy to the distribution of fluent, sensible language defined by the reference model: the further the policy drifts to chase RM quirks, the bigger the penalty it pays. This is a preview of reward hacking , the subject of chapter 15.

It preserves what SFT taught. The reference is usually the SFT model — already fluent, already instruction-following. Without the anchor, RL optimization can erode that hard-won competence in pursuit of reward. The KL term keeps the policy in the neighborhood of a known-good model, so RLHF refines the SFT behavior rather than overwriting it.

The loop

Here is PPO-RLHF as it actually runs, one iteration:

Sample rollouts. Take a batch of prompts and generate responses from the current policy. Because we generate fresh data from the live policy each round, this is online RL .
Score. Run the reward model on each completed response to get $r_{\text{RM}}$ , and compute the per-token KL penalty against the reference model.
Estimate advantages. Feed the per-token rewards through the critic and GAE to get $\hat{A}_t$ for every token.
PPO update. Take several gradient steps on this batch using the clipped surrogate objective, updating both the policy and the critic. The clip keeps each update inside the trust region.
Repeat with fresh rollouts from the now-slightly-better policy.

Steps 1–2 are generation and inference; steps 3–4 are training. Real systems pour enormous engineering into overlapping them, because generation (slow, autoregressive) and the gradient update (a dense matmul) have very different performance profiles.

Four models in memory

Count the networks this loop holds at once:

the policy $\pi_\theta$ — the model we’re training;
the reference $\pi_{\text{ref}}$ — frozen, for the KL penalty;
the reward model — frozen, scores responses;
the critic — trained alongside the policy, estimates the value function for GAE.

Four models, two of them roughly the size of the model you ultimately ship. That is a lot of memory and orchestration, and it’s the main reason PPO-RLHF is operationally heavy.

Heavy, but it's a trade-off — not categorically 'the most expensive'

It’s tempting to call this the priciest stage of building a model, but that’s misleading. The four-model footprint buys a strong, general, online optimization signal, and the actual cost depends entirely on how many rollouts and prompts you run. The whole arc of the next chapters is about trading against this cost: DPO removes the reward model, the RL loop, and the online sampling; GRPO removes the critic. Think of PPO-RLHF as one point on a cost/capability frontier, not the ceiling.

Try it

The slider below sweeps the KL coefficient $\beta$ . Watch the policy’s distribution pull away from the reference as $\beta$ shrinks, and snap back toward it as $\beta$ grows — and watch the resulting trade-off curve between reward earned and KL distance traveled. The sweet spot is a policy that has climbed the reward hill without wandering off the manifold of sensible language.

RLHF reward-vs-KL tradeoff

β is the KL-penalty coefficient — the leash length. Slide it and watch where the policy settles on the reward-vs-drift frontier.

KL coefficient β = 0.20· balanced — real reward gain while staying recognizably on-distribution

0.01 · loose1.2 · tight

Raw reward

0.875

KL from reference

1.300

Net = reward − β·KL

0.615

referencepolicy driftdegeneration

RLHF optimizes reward(x) − β·KL(π ‖ π_ref). β is the leash length. Pull it too loose and the policy sprints up the reward model's gradient into weird, off-distribution text — it games the proxy reward instead of getting genuinely better. Pull it too tight and the model never leaves the reference, so it barely improves. The net objective peaks in the middle: the concave frontier means each extra unit of drift buys less and less real reward, so there's a sweet-spot β where the leash is just long enough.

Where this leads

PPO-RLHF is powerful and, for years, was the default path from SFT model to aligned assistant. But its moving parts — a reward model to train and keep honest, a critic, a reference, an online sampling loop, and a temperamental $\beta$ — are a lot to get right. The next two stretches of this explainer are, in large part, reactions to that complexity. Reward hacking and over-optimization get their own treatment in chapter 15. Then DPO (chapter 16) shows how to optimize essentially the same objective directly from preference pairs, with no reward model and no RL loop at all. And later, GRPO (chapter 23) keeps the RL loop but drops the critic, slimming the memory footprint for large-scale reasoning RL. Each one trades away a piece of the machinery you just assembled.