Section 13

TRPO to PPO

Trust regions and the clipped surrogate

We now have a clean, low-variance learning signal — the advantage $\hat{A}_t$ from the last chapter. The remaining danger is the step size. Policy gradients tell you a direction, not a safe distance, and in RL a single over-eager step can be catastrophic in a way it never is in supervised learning. This chapter is about why, and about the fix that became the workhorse of RLHF: PPO and its clipped surrogate objective. If you internalize one algorithm from this whole explainer, make it this one.

Why one big step can destroy a policy

In supervised learning, the data is fixed. If you overshoot, the loss goes up, the next gradient points back, and you recover. In RL the data is generated by the policy itself. Take too large a step and you change the very distribution that produces your training data. The policy might lurch into a region where it generates garbage, every rollout from that broken policy scores near zero, the advantages collapse to noise, and there is no useful gradient left to climb back out. The feedback loop that was helping you is now actively hurting you. RL training can fall off a cliff and never return.

So we want to improve the policy, but only cautiously — never letting the new policy stray too far from the old one in a single update. We need to measure that “distance,” and the natural ruler is the KL divergence between the old and new policy distributions.

TRPO: improve, but stay in a trust region

This is the idea behind Trust Region Policy Optimization ( TRPO , Schulman 2015). Define a trust region — a neighborhood around the current policy within which we trust our local estimate of “improvement” to be reliable. TRPO maximizes the expected advantage subject to a hard KL constraint:

\max_\theta \; \mathbb{E}\!\left[\, \frac{\pi_\theta(a \mid s)}{\pi_{\theta_{\text{old}}}(a \mid s)} \, \hat{A} \,\right] \quad \text{subject to} \quad \mathbb{E}\big[\, \mathrm{KL}(\pi_{\theta_{\text{old}}} \,\|\, \pi_\theta) \,\big] \le \delta

Two pieces to unpack. The constraint says: take the biggest improving step you can, but don’t let the new policy diverge from the old one by more than $\delta$ in KL. That is the guard rail against the cliff. The objective contains a new and important ratio.

The importance-sampling ratio

That fraction in the objective,

r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)},

is the importance-sampling ratio. It exists to solve the off-policy problem from chapter 11. We generated our rollouts with the old policy $\pi_{\theta_{\text{old}}}$ , but we want to optimize the new policy $\pi_\theta$ . Importance sampling corrects for the mismatch: it reweights each old sample by how much more (or less) likely the new policy is to have produced it. If the new policy now favors a good action, $r_t > 1$ and its advantage counts for more; if it has turned away from it, $r_t < 1$ .

This ratio is what lets us take several gradient steps on one batch of rollouts instead of throwing the data away after a single update — a major efficiency win. But it’s also dangerous: if $r_t$ runs away from 1, we’re extrapolating wildly from samples the new policy would rarely generate, and the estimate becomes meaningless. That’s precisely what the trust region must contain.

TRPO enforces the constraint exactly, using second-order (Fisher-matrix) machinery. It works, but it is complicated and expensive to implement at scale. The natural question: can we get the same “don’t move too far” guarantee with plain first-order gradient steps?

PPO: clip the ratio and call it a day

The answer is Proximal Policy Optimization ( PPO , Schulman 2017), and its trick is almost embarrassingly simple. Instead of a hard KL constraint, PPO bakes the “stay close” pressure directly into the objective by clipping the ratio. The clipped surrogate objective is:

L^{\text{CLIP}}(\theta) = \mathbb{E}_t\Big[\, \min\big(\, r_t(\theta)\, \hat{A}_t, \;\; \mathrm{clip}\big(r_t(\theta),\, 1 - \epsilon,\, 1 + \epsilon\big)\, \hat{A}_t \,\big) \Big]

Here $\epsilon$ is a small constant (typically $0.1$ or $0.2$ ). The $\mathrm{clip}$ function pins $r_t$ to the interval $[1-\epsilon, 1+\epsilon]$ . We compute the surrogate two ways — once with the true ratio, once with the clipped ratio — and take the minimum. That minimum is what makes the whole thing work, and it’s worth walking through both cases carefully.

What the min() does, case by case

Case A — the action was good ( $\hat{A}_t > 0$ ). We’d like to increase its probability, pushing $r_t$ above 1. The clip caps the reward of doing so at $r_t = 1+\epsilon$ : once the new policy has raised this action’s probability by a factor of $1+\epsilon$ , the objective flattens and there’s no more gradient to push it higher. You get credit for moving in the right direction, but only up to a point — no single update can blow the probability up arbitrarily.

Case B — the action was bad ( $\hat{A}_t < 0$ ). We’d like to decrease its probability, pushing $r_t$ below 1. The clip floors it at $r_t = 1-\epsilon$ : once the probability has dropped by a factor of $1-\epsilon$ , the objective flattens again.

Why the min, not the clip alone? The $\min$ makes the bound pessimistic in the right direction. When the clipped term is the smaller one (you’ve already moved far enough), the gradient vanishes — good. But if a step somehow overshoots so badly that the unclipped term is smaller (the update is making a good action less likely, or a bad one more likely), the $\min$ selects that unclipped term, and its gradient is live — so the model can always correct a genuinely bad move. Clipping removes the incentive to go too far; the min preserves the incentive to come back.

The effect is a cheap, first-order approximation of TRPO’s trust region. There is no KL constraint to solve, no second-order matrix — just a clamp and a min, both one line of code. Because the gradient goes flat outside the clip band, the policy simply cannot be dragged too far from $\pi_{\theta_{\text{old}}}$ by any single batch, which is exactly the stability TRPO bought with far more machinery. That trade — almost all of the robustness, a fraction of the complexity — is why PPO, not TRPO, became the standard.

Try it

The plot below shows the clipped surrogate as a function of the probability ratio $r_t$ . Flip the sign of the advantage and slide $\epsilon$ . Watch the objective rise linearly with $r_t$ and then go flat at the clip boundary — that flat region is the trust region in disguise. Notice how the flat side switches depending on whether the advantage is positive or negative, and how a wider $\epsilon$ permits bigger steps before the brakes engage.

PPO clipped surrogate objective

Plotted over the probability ratio r = π_new/π_old: the raw term r·A vs PPO's clipped objective. The flat region is the cap.

r·A (unclipped) PPO objective current r

Probability ratio r = 1.00· inside the trust region: the update follows the unclipped objective r·A

Clip ε = 0.20

r·A (unclipped)

1.00

PPO objective

1.00

Region

active

PPO maximizes min( r·A, clip(r, 1−ε, 1+ε)·A ). When A > 0 (the action was good), the objective stops rising once r > 1+ε — so a single good sample can't shove the policy arbitrarily far. When A < 0 (the action was bad), the cap is symmetric on the r < 1−ε side. In the flat clipped zone the gradient is zero, which is how PPO approximates a trust region without the expensive second-order math of TRPO.

Once the clipped objective clicks — linear improvement up to the band, flat beyond it, and the min keeping a recovery path open — PPO holds no more mysteries. The remaining question is purely practical: how do you wire this into an actual RLHF pipeline, with a reward model, a reference model, and a KL penalty keeping the policy honest? That’s the next chapter.