Section 12

Value, advantage, baselines

Critics, GAE, and variance reduction

REINFORCE works, but it shudders. The previous chapter ended on its fatal flaw: the gradient is multiplied by the raw reward, so when every response scores, say, $+9$ , the algorithm enthusiastically pushes up the probability of everything it just did, learning almost nothing about which responses were actually better. This chapter is about taming that variance — first with a free trick called the baseline, then with the value functions and advantage estimates that power every modern RLHF system.

A free lunch: subtracting a baseline

Here is the key observation. We can subtract any constant $b$ from the reward inside the policy gradient without changing what it estimates:

\nabla_\theta J = \mathbb{E}_{y \sim \pi_\theta}\big[\, (r(y) - b) \, \nabla_\theta \log \pi_\theta(y) \,\big]

Why is this allowed? Because the extra term we introduced, $\mathbb{E}[\, b \, \nabla_\theta \log \pi_\theta(y) \,]$ , is exactly zero. The proof is two lines and worth seeing.

Why a baseline adds no bias

Pull the constant $b$ out and run the log-derivative trick in reverse:

\mathbb{E}_{y \sim \pi_\theta}\!\big[\, \nabla_\theta \log \pi_\theta(y) \,\big] = \sum_y \pi_\theta(y)\, \frac{\nabla_\theta \pi_\theta(y)}{\pi_\theta(y)} = \sum_y \nabla_\theta \pi_\theta(y) = \nabla_\theta \sum_y \pi_\theta(y) = \nabla_\theta 1 = 0.

The probabilities always sum to one, so their gradient sums to zero. Therefore $\mathbb{E}[\,b\,\nabla_\theta \log \pi_\theta\,] = b \cdot 0 = 0$ , and subtracting $b$ leaves the gradient’s expectation untouched. It changes the variance, not the direction.

So the baseline is a genuine free lunch: it cannot bias the estimator, but a well-chosen $b$ slashes its variance. Intuitively, instead of asking “was this response good?” we ask “was this response better than my typical response?” If $b$ is the average reward, a $+9$ in a sea of $+9$ s contributes nothing — no spurious push — while a $+9$ among $+2$ s gets a strong upward push and a $+2$ gets pushed down. The signal becomes relative, which is exactly what we want.

The simplest useful baseline is just the mean reward of the current batch of rollouts. That alone helps enormously — and, as a preview, it is essentially what GRPO (chapter 23) does to avoid training a separate network at all.

The value function and the critic

We can do better than a single batch-wide constant. The ideal baseline is state-dependent: the reward we expect from a given starting point. That is the value function :

V(s) = \mathbb{E}_{y \sim \pi_\theta}\big[\, r \mid \text{state } s \,\big]

$V(s)$ answers: “Starting from state $s$ — this prompt, these tokens so far — how much reward do I expect this policy to earn from here on out?” A strong prompt where the model usually does well has a high $V$ ; a hard one has a low $V$ . Subtracting $V(s)$ as the baseline asks the sharpest possible question: did this rollout beat what was expected from this particular state?

We don’t know $V$ , so we learn it with a second network — the critic . The critic is typically a copy of the model with a scalar output head, trained by regression to predict the returns the policy actually receives. The policy (the “actor”) proposes; the critic evaluates. This actor–critic split is the structural backbone of PPO .

Advantage: better or worse than expected

Put the pieces together and you get the central quantity of policy optimization, the advantage :

A(s, a) = Q(s, a) - V(s)

where $Q(s,a)$ is the expected reward of taking action $a$ in state $s$ and then continuing. In words: how much better is this specific action than the policy’s average behavior in this state? A positive advantage means “this was a pleasant surprise — do more of it”; a negative advantage means “worse than usual — do less.” Replacing the raw reward with the advantage gives the modern policy gradient:

\nabla_\theta J = \mathbb{E}\big[\, A(s, a) \, \nabla_\theta \log \pi_\theta(a \mid s) \,\big]

For a terminal reward with a learned baseline, this is just $A \approx r(y) - V(s)$ — reward minus the critic’s prediction. The advantage is the cleaned-up, centered learning signal: the variance-inflating level of the reward has been subtracted away, leaving only the informative part.

GAE: trading bias against variance

There’s one more wrinkle, and it’s where the real engineering lives. To compute the advantage we need to estimate returns, and we have a spectrum of ways to do it.

At one extreme, use the actual reward earned over the whole rollout. This is unbiased — it’s what really happened — but high-variance, because a single noisy outcome stands in for the expectation. At the other extreme, lean entirely on the critic’s one-step prediction. This is low-variance (the critic averages over many episodes) but biased (the critic is imperfect). Neither extreme is ideal; we want a dial between them.

That dial is Generalized Advantage Estimation ( GAE , Schulman 2016). It is built from the per-step temporal-difference error:

\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

Each $\delta_t$ is a small one-step “surprise”: the reward you just got, plus the discounted value of where you landed, minus the value of where you were. GAE then sums these surprises down the trajectory with an exponentially decaying weight:

\hat{A}_t^{\text{GAE}(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \, \delta_{t+l}

Two knobs. The discount $\gamma \in [0,1]$ sets how much future reward counts now. The GAE parameter $\lambda \in [0,1]$ is the bias/variance dial:

$\lambda = 0$ collapses the sum to a single term, $\hat{A}_t = \delta_t$ — pure reliance on the critic. Low variance, higher bias.
$\lambda = 1$ recovers the full Monte-Carlo return minus the baseline — unbiased, but high variance.
Intermediate $\lambda$ (a value near $0.95$ is typical for RLHF) blends them, keeping most of the variance reduction while paying only a little bias.

Try it

Below, a short rollout with per-step rewards and a learned value estimate. Turn the $\gamma$ and $\lambda$ dials and watch the advantage estimates along the trajectory respond: crank $\lambda$ toward 1 and the estimates get spikier (high variance); pull it toward 0 and they smooth out toward the critic’s one-step view (low variance, more bias). This single picture is the heart of why PPO is stable.

GAE bias / variance dial

A fixed 6-step rollout. Set γ and λ; the bars are the GAE advantage Â_t at each step. λ slides between one-step TD and Monte-Carlo.

Discount γ = 0.99GAE λ = 0.95· λ → 1: near Monte-Carlo — low bias, high variance (trusts the noisy returns)

0 · TD1 · Monte-Carlo

Each step's TD error is δ_t = r_t + γ·V(s_{t+1}) − V(s_t). GAE sums future δ's with geometric weights (γλ)^k. At λ = 0 only the immediate δ survives — the estimate leans entirely on the (biased) value function but barely fluctuates. At λ = 1 the weights never decay, so you sum the whole noisy return: unbiased but high-variance. PPO's usual λ ≈ 0.95 keeps most of the variance reduction while adding only a little bias.

Where this is heading

Baselines, a critic, and GAE give us a low-variance advantage signal $\hat{A}_t$ . But we still haven’t fixed the other failure mode of naive policy gradients — taking a step so large that it wrecks the policy in a single update. That is the problem trust regions and clipping solve, and it brings us to the algorithm at the center of RLHF: PPO , in the next chapter.