Section 12

Value, advantage, baselines

Critics, GAE, and variance reduction

REINFORCE works, but it shudders. The previous chapter ended on its fatal flaw: the gradient is multiplied by the raw reward, so when every response scores, say, +9+9, the algorithm enthusiastically pushes up the probability of everything it just did, learning almost nothing about which responses were actually better. This chapter is about taming that variance — first with a free trick called the baseline, then with the value functions and advantage estimates that power every modern RLHF system.

A free lunch: subtracting a baseline

Here is the key observation. We can subtract any constant bb from the reward inside the policy gradient without changing what it estimates:

θJ=Eyπθ[(r(y)b)θlogπθ(y)]\nabla_\theta J = \mathbb{E}_{y \sim \pi_\theta}\big[\, (r(y) - b) \, \nabla_\theta \log \pi_\theta(y) \,\big]

Why is this allowed? Because the extra term we introduced, E[bθlogπθ(y)]\mathbb{E}[\, b \, \nabla_\theta \log \pi_\theta(y) \,], is exactly zero. The proof is two lines and worth seeing.

So the baseline is a genuine free lunch: it cannot bias the estimator, but a well-chosen bb slashes its variance. Intuitively, instead of asking “was this response good?” we ask “was this response better than my typical response?” If bb is the average reward, a +9+9 in a sea of +9+9s contributes nothing — no spurious push — while a +9+9 among +2+2s gets a strong upward push and a +2+2 gets pushed down. The signal becomes relative, which is exactly what we want.

The simplest useful baseline is just the mean reward of the current batch of rollouts. That alone helps enormously — and, as a preview, it is essentially what GRPO GRPO Group Relative Policy Optimization (Shao, 2024) — drop PPO’s critic; sample a group of responses per prompt and use their mean reward as the baseline, giving a group-relative advantage. Memory-cheap RL that powered DeepSeek-R1. See in glossary → (chapter 23) does to avoid training a separate network at all.

The value function and the critic

We can do better than a single batch-wide constant. The ideal baseline is state-dependent: the reward we expect from a given starting point. That is the value function value function The expected return from a given state under the current policy. A learned value function (the critic) provides a baseline that reduces the variance of policy-gradient updates. See in glossary → :

V(s)=Eyπθ[rstate s]V(s) = \mathbb{E}_{y \sim \pi_\theta}\big[\, r \mid \text{state } s \,\big]

V(s)V(s) answers: “Starting from state ss — this prompt, these tokens so far — how much reward do I expect this policy to earn from here on out?” A strong prompt where the model usually does well has a high VV; a hard one has a low VV. Subtracting V(s)V(s) as the baseline asks the sharpest possible question: did this rollout beat what was expected from this particular state?

We don’t know VV, so we learn it with a second network — the critic critic A model trained to predict the value function. PPO uses an actor (the policy) and a critic; GRPO drops the critic and uses a group average instead. See in glossary → . The critic is typically a copy of the model with a scalar output head, trained by regression to predict the returns the policy actually receives. The policy (the “actor”) proposes; the critic evaluates. This actor–critic split is the structural backbone of PPO PPO Proximal Policy Optimization (Schulman, 2017) — approximates TRPO’s trust region with a simple clipped surrogate objective. The RLHF workhorse: stable, simple, widely used. See in glossary → .

Advantage: better or worse than expected

Put the pieces together and you get the central quantity of policy optimization, the advantage advantage How much better an action was than the baseline expectation: A = reward − value. Positive advantage pushes an action’s probability up, negative pushes it down. See in glossary → :

A(s,a)=Q(s,a)V(s)A(s, a) = Q(s, a) - V(s)

where Q(s,a)Q(s,a) is the expected reward of taking action aa in state ss and then continuing. In words: how much better is this specific action than the policy’s average behavior in this state? A positive advantage means “this was a pleasant surprise — do more of it”; a negative advantage means “worse than usual — do less.” Replacing the raw reward with the advantage gives the modern policy gradient:

θJ=E[A(s,a)θlogπθ(as)]\nabla_\theta J = \mathbb{E}\big[\, A(s, a) \, \nabla_\theta \log \pi_\theta(a \mid s) \,\big]

For a terminal reward with a learned baseline, this is just Ar(y)V(s)A \approx r(y) - V(s) — reward minus the critic’s prediction. The advantage is the cleaned-up, centered learning signal: the variance-inflating level of the reward has been subtracted away, leaving only the informative part.

GAE: trading bias against variance

There’s one more wrinkle, and it’s where the real engineering lives. To compute the advantage we need to estimate returns, and we have a spectrum of ways to do it.

At one extreme, use the actual reward earned over the whole rollout. This is unbiased — it’s what really happened — but high-variance, because a single noisy outcome stands in for the expectation. At the other extreme, lean entirely on the critic’s one-step prediction. This is low-variance (the critic averages over many episodes) but biased (the critic is imperfect). Neither extreme is ideal; we want a dial between them.

That dial is Generalized Advantage Estimation ( GAE GAE Generalized Advantage Estimation — a way to trade bias against variance in advantage estimates using a decay parameter λ. The standard advantage signal inside PPO. See in glossary → , Schulman 2016). It is built from the per-step temporal-difference error:

δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

Each δt\delta_t is a small one-step “surprise”: the reward you just got, plus the discounted value of where you landed, minus the value of where you were. GAE then sums these surprises down the trajectory with an exponentially decaying weight:

A^tGAE(γ,λ)=l=0(γλ)lδt+l\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \, \delta_{t+l}

Two knobs. The discount γ[0,1]\gamma \in [0,1] sets how much future reward counts now. The GAE parameter λ[0,1]\lambda \in [0,1] is the bias/variance dial:

  • λ=0\lambda = 0 collapses the sum to a single term, A^t=δt\hat{A}_t = \delta_t — pure reliance on the critic. Low variance, higher bias.
  • λ=1\lambda = 1 recovers the full Monte-Carlo return minus the baseline — unbiased, but high variance.
  • Intermediate λ\lambda (a value near 0.950.95 is typical for RLHF) blends them, keeping most of the variance reduction while paying only a little bias.

Try it

Below, a short rollout with per-step rewards and a learned value estimate. Turn the γ\gamma and λ\lambda dials and watch the advantage estimates along the trajectory respond: crank λ\lambda toward 1 and the estimates get spikier (high variance); pull it toward 0 and they smooth out toward the critic’s one-step view (low variance, more bias). This single picture is the heart of why PPO is stable.

GAE bias / variance dial
A fixed 6-step rollout. Set γ and λ; the bars are the GAE advantage Â_t at each step. λ slides between one-step TD and Monte-Carlo.
01.26t0r=01.25t1r=10.58t2r=00.52t3r=00.77t4r=2-1.20t5r=-1
Each step's TD error is δ_t = r_t + γ·V(s_{t+1}) − V(s_t). GAE sums future δ's with geometric weights (γλ)^k. At λ = 0 only the immediate δ survives — the estimate leans entirely on the (biased) value function but barely fluctuates. At λ = 1 the weights never decay, so you sum the whole noisy return: unbiased but high-variance. PPO's usual λ ≈ 0.95 keeps most of the variance reduction while adding only a little bias.

Where this is heading

Baselines, a critic, and GAE give us a low-variance advantage signal A^t\hat{A}_t. But we still haven’t fixed the other failure mode of naive policy gradients — taking a step so large that it wrecks the policy in a single update. That is the problem trust regions and clipping solve, and it brings us to the algorithm at the center of RLHF: PPO PPO Proximal Policy Optimization (Schulman, 2017) — approximates TRPO’s trust region with a simple clipped surrogate objective. The RLHF workhorse: stable, simple, widely used. See in glossary → , in the next chapter.