Section 13

TRPO to PPO

Trust regions and the clipped surrogate

We now have a clean, low-variance learning signal — the advantage A^t\hat{A}_t from the last chapter. The remaining danger is the step size. Policy gradients tell you a direction, not a safe distance, and in RL a single over-eager step can be catastrophic in a way it never is in supervised learning. This chapter is about why, and about the fix that became the workhorse of RLHF: PPO and its clipped surrogate objective. If you internalize one algorithm from this whole explainer, make it this one.

Why one big step can destroy a policy

In supervised learning, the data is fixed. If you overshoot, the loss goes up, the next gradient points back, and you recover. In RL the data is generated by the policy itself. Take too large a step and you change the very distribution that produces your training data. The policy might lurch into a region where it generates garbage, every rollout from that broken policy scores near zero, the advantages collapse to noise, and there is no useful gradient left to climb back out. The feedback loop that was helping you is now actively hurting you. RL training can fall off a cliff and never return.

So we want to improve the policy, but only cautiously — never letting the new policy stray too far from the old one in a single update. We need to measure that “distance,” and the natural ruler is the KL divergence KL divergence Kullback–Leibler divergence — a measure of how far one probability distribution is from another. Used in post-training as a "leash" that keeps a model close to a reference policy. See in glossary → between the old and new policy distributions.

TRPO: improve, but stay in a trust region

This is the idea behind Trust Region Policy Optimization ( TRPO TRPO Trust Region Policy Optimization (Schulman, 2015) — take the largest policy-gradient step that stays within a trust region (a KL bound), guaranteeing stable improvement. PPO’s parent. See in glossary → , Schulman 2015). Define a trust region trust region A bound on how far the policy may move in one update (measured in KL divergence), so the update stays in the region where the local approximation is trustworthy. See in glossary → — a neighborhood around the current policy within which we trust our local estimate of “improvement” to be reliable. TRPO maximizes the expected advantage subject to a hard KL constraint:

maxθ  E ⁣[πθ(as)πθold(as)A^]subject toE[KL(πθoldπθ)]δ\max_\theta \; \mathbb{E}\!\left[\, \frac{\pi_\theta(a \mid s)}{\pi_{\theta_{\text{old}}}(a \mid s)} \, \hat{A} \,\right] \quad \text{subject to} \quad \mathbb{E}\big[\, \mathrm{KL}(\pi_{\theta_{\text{old}}} \,\|\, \pi_\theta) \,\big] \le \delta

Two pieces to unpack. The constraint says: take the biggest improving step you can, but don’t let the new policy diverge from the old one by more than δ\delta in KL. That is the guard rail against the cliff. The objective contains a new and important ratio.

The importance-sampling ratio

That fraction in the objective,

rt(θ)=πθ(atst)πθold(atst),r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)},

is the importance-sampling importance sampling Reweighting samples from one distribution to estimate expectations under another, via the probability ratio π_new/π_old. The ratio PPO clips comes from here. See in glossary → ratio. It exists to solve the off-policy off-policy RL that learns from data generated by a different (older or separate) policy. DPO and rejection-sampling methods are off-policy / offline. See in glossary → problem from chapter 11. We generated our rollouts with the old policy πθold\pi_{\theta_{\text{old}}}, but we want to optimize the new policy πθ\pi_\theta. Importance sampling corrects for the mismatch: it reweights each old sample by how much more (or less) likely the new policy is to have produced it. If the new policy now favors a good action, rt>1r_t > 1 and its advantage counts for more; if it has turned away from it, rt<1r_t < 1.

This ratio is what lets us take several gradient steps on one batch of rollouts instead of throwing the data away after a single update — a major efficiency win. But it’s also dangerous: if rtr_t runs away from 1, we’re extrapolating wildly from samples the new policy would rarely generate, and the estimate becomes meaningless. That’s precisely what the trust region must contain.

TRPO enforces the constraint exactly, using second-order (Fisher-matrix) machinery. It works, but it is complicated and expensive to implement at scale. The natural question: can we get the same “don’t move too far” guarantee with plain first-order gradient steps?

PPO: clip the ratio and call it a day

The answer is Proximal Policy Optimization ( PPO PPO Proximal Policy Optimization (Schulman, 2017) — approximates TRPO’s trust region with a simple clipped surrogate objective. The RLHF workhorse: stable, simple, widely used. See in glossary → , Schulman 2017), and its trick is almost embarrassingly simple. Instead of a hard KL constraint, PPO bakes the “stay close” pressure directly into the objective by clipping the ratio. The clipped surrogate clipped surrogate objective PPO’s loss: maximize the probability-ratio-weighted advantage, but clip the ratio to [1−ε, 1+ε] so a single update can’t move the policy too far. See in glossary → objective is:

LCLIP(θ)=Et[min(rt(θ)A^t,    clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\text{CLIP}}(\theta) = \mathbb{E}_t\Big[\, \min\big(\, r_t(\theta)\, \hat{A}_t, \;\; \mathrm{clip}\big(r_t(\theta),\, 1 - \epsilon,\, 1 + \epsilon\big)\, \hat{A}_t \,\big) \Big]

Here ϵ\epsilon is a small constant (typically 0.10.1 or 0.20.2). The clip\mathrm{clip} function pins rtr_t to the interval [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon]. We compute the surrogate two ways — once with the true ratio, once with the clipped ratio — and take the minimum. That minimum is what makes the whole thing work, and it’s worth walking through both cases carefully.

The effect is a cheap, first-order approximation of TRPO’s trust region. There is no KL constraint to solve, no second-order matrix — just a clamp and a min, both one line of code. Because the gradient goes flat outside the clip band, the policy simply cannot be dragged too far from πθold\pi_{\theta_{\text{old}}} by any single batch, which is exactly the stability TRPO bought with far more machinery. That trade — almost all of the robustness, a fraction of the complexity — is why PPO, not TRPO, became the standard.

Try it

The plot below shows the clipped surrogate as a function of the probability ratio rtr_t. Flip the sign of the advantage and slide ϵ\epsilon. Watch the objective rise linearly with rtr_t and then go flat at the clip boundary — that flat region is the trust region in disguise. Notice how the flat side switches depending on whether the advantage is positive or negative, and how a wider ϵ\epsilon permits bigger steps before the brakes engage.

PPO clipped surrogate objective
Plotted over the probability ratio r = π_new/π_old: the raw term r·A vs PPO's clipped objective. The flat region is the cap.
00.511.522.5r1−ε1+ε
r·A (unclipped) PPO objective current r
r·A (unclipped)
1.00
PPO objective
1.00
Region
active
PPO maximizes min( r·A, clip(r, 1−ε, 1+ε)·A ). When A > 0 (the action was good), the objective stops rising once r > 1+ε — so a single good sample can't shove the policy arbitrarily far. When A < 0 (the action was bad), the cap is symmetric on the r < 1−ε side. In the flat clipped zone the gradient is zero, which is how PPO approximates a trust region without the expensive second-order math of TRPO.

Once the clipped objective clicks — linear improvement up to the band, flat beyond it, and the min keeping a recovery path open — PPO holds no more mysteries. The remaining question is purely practical: how do you wire this into an actual RLHF pipeline, with a reward model, a reference model, and a KL penalty keeping the policy honest? That’s the next chapter.