TRPO to PPO
Trust regions and the clipped surrogate
We now have a clean, low-variance learning signal — the advantage from the last chapter. The remaining danger is the step size. Policy gradients tell you a direction, not a safe distance, and in RL a single over-eager step can be catastrophic in a way it never is in supervised learning. This chapter is about why, and about the fix that became the workhorse of RLHF: PPO and its clipped surrogate objective. If you internalize one algorithm from this whole explainer, make it this one.
Why one big step can destroy a policy
In supervised learning, the data is fixed. If you overshoot, the loss goes up, the next gradient points back, and you recover. In RL the data is generated by the policy itself. Take too large a step and you change the very distribution that produces your training data. The policy might lurch into a region where it generates garbage, every rollout from that broken policy scores near zero, the advantages collapse to noise, and there is no useful gradient left to climb back out. The feedback loop that was helping you is now actively hurting you. RL training can fall off a cliff and never return.
So we want to improve the policy, but only cautiously — never letting the new policy stray too far from the old one in a single update. We need to measure that “distance,” and the natural ruler is the KL divergence KL divergence Kullback–Leibler divergence — a measure of how far one probability distribution is from another. Used in post-training as a "leash" that keeps a model close to a reference policy. See in glossary → between the old and new policy distributions.
TRPO: improve, but stay in a trust region
This is the idea behind Trust Region Policy Optimization ( TRPO TRPO Trust Region Policy Optimization (Schulman, 2015) — take the largest policy-gradient step that stays within a trust region (a KL bound), guaranteeing stable improvement. PPO’s parent. See in glossary → , Schulman 2015). Define a trust region trust region A bound on how far the policy may move in one update (measured in KL divergence), so the update stays in the region where the local approximation is trustworthy. See in glossary → — a neighborhood around the current policy within which we trust our local estimate of “improvement” to be reliable. TRPO maximizes the expected advantage subject to a hard KL constraint:
Two pieces to unpack. The constraint says: take the biggest improving step you can, but don’t let the new policy diverge from the old one by more than in KL. That is the guard rail against the cliff. The objective contains a new and important ratio.
The importance-sampling ratio
That fraction in the objective,
is the importance-sampling importance sampling Reweighting samples from one distribution to estimate expectations under another, via the probability ratio π_new/π_old. The ratio PPO clips comes from here. See in glossary → ratio. It exists to solve the off-policy off-policy RL that learns from data generated by a different (older or separate) policy. DPO and rejection-sampling methods are off-policy / offline. See in glossary → problem from chapter 11. We generated our rollouts with the old policy , but we want to optimize the new policy . Importance sampling corrects for the mismatch: it reweights each old sample by how much more (or less) likely the new policy is to have produced it. If the new policy now favors a good action, and its advantage counts for more; if it has turned away from it, .
This ratio is what lets us take several gradient steps on one batch of rollouts instead of throwing the data away after a single update — a major efficiency win. But it’s also dangerous: if runs away from 1, we’re extrapolating wildly from samples the new policy would rarely generate, and the estimate becomes meaningless. That’s precisely what the trust region must contain.
TRPO enforces the constraint exactly, using second-order (Fisher-matrix) machinery. It works, but it is complicated and expensive to implement at scale. The natural question: can we get the same “don’t move too far” guarantee with plain first-order gradient steps?
PPO: clip the ratio and call it a day
The answer is Proximal Policy Optimization ( PPO PPO Proximal Policy Optimization (Schulman, 2017) — approximates TRPO’s trust region with a simple clipped surrogate objective. The RLHF workhorse: stable, simple, widely used. See in glossary → , Schulman 2017), and its trick is almost embarrassingly simple. Instead of a hard KL constraint, PPO bakes the “stay close” pressure directly into the objective by clipping the ratio. The clipped surrogate clipped surrogate objective PPO’s loss: maximize the probability-ratio-weighted advantage, but clip the ratio to [1−ε, 1+ε] so a single update can’t move the policy too far. See in glossary → objective is:
Here is a small constant (typically or ). The function pins to the interval . We compute the surrogate two ways — once with the true ratio, once with the clipped ratio — and take the minimum. That minimum is what makes the whole thing work, and it’s worth walking through both cases carefully.
The effect is a cheap, first-order approximation of TRPO’s trust region. There is no KL constraint to solve, no second-order matrix — just a clamp and a min, both one line of code. Because the gradient goes flat outside the clip band, the policy simply cannot be dragged too far from by any single batch, which is exactly the stability TRPO bought with far more machinery. That trade — almost all of the robustness, a fraction of the complexity — is why PPO, not TRPO, became the standard.
Try it
The plot below shows the clipped surrogate as a function of the probability ratio . Flip the sign of the advantage and slide . Watch the objective rise linearly with and then go flat at the clip boundary — that flat region is the trust region in disguise. Notice how the flat side switches depending on whether the advantage is positive or negative, and how a wider permits bigger steps before the brakes engage.
Once the clipped objective clicks — linear improvement up to the band, flat beyond it, and the min keeping a recovery path open — PPO holds no more mysteries. The remaining question is purely practical: how do you wire this into an actual RLHF pipeline, with a reward model, a reference model, and a KL penalty keeping the policy honest? That’s the next chapter.