Section 11

Policy gradients & REINFORCE

Policy, rollout, return, and the score function

So far we have learned a reward model that scores a response, and we want the language model to produce responses the reward model likes. But the reward model isn’t differentiable in the way the next-token loss is: there is no clean chain rule running from “this 300-token answer scored 0.8” back to the logits. We need a new kind of gradient — one that can learn from a number handed back at the end, even a number produced by a black box. That gradient is the policy gradient, and this chapter derives it from scratch.

The LLM as a policy

In reinforcement learning, a policy is a rule for choosing actions. We write it $\pi_\theta(a \mid s)$ : given a state $s$ , it gives a probability distribution over actions $a$ , controlled by parameters $\theta$ . The whole vocabulary of RL maps cleanly onto a language model:

The state is the prompt-so-far: the input $x$ plus whatever tokens have already been generated.
An action is choosing the next token.
The policy is the model itself. At each step it emits logits , the softmax turns them into a distribution over the vocabulary, and we sample.

Because the model generates one token at a time and feeds each choice back in, the probability of a complete response $y = (y_1, \ldots, y_T)$ given prompt $x$ factorizes:

\pi_\theta(y \mid x) = \prod_{t=1}^{T} \pi_\theta(y_t \mid x, y_{<t})

This is the same autoregressive factorization you already know from next-token prediction — nothing about the network has changed. What changes is how we score the output and what gradient we push through it.

A single act of generating a full response is a rollout . The sequence of states and actions it produces — prompt, token, new state, token, … — is a trajectory . At the end, a scalar reward $r(y)$ (here, the reward-model score) judges the whole thing. Summed over the trajectory, that’s the return ; for a single terminal reward at the end of a response, return and reward coincide, and we’ll just write $r(y)$ .

The objective: maximize expected reward

We want a policy whose rollouts tend to score high. Formally, maximize the expected reward over the responses the policy itself generates:

J(\theta) = \mathbb{E}_{y \sim \pi_\theta(\cdot \mid x)}\big[\, r(y) \,\big]

Read it carefully: the expectation is taken over $y$ drawn from the current policy. That is the crux of the difficulty. The thing we are differentiating with respect to, $\theta$ , also controls the distribution we are averaging over. Change $\theta$ and you don’t just change $r(y)$ — you change which $y$ ‘s show up at all. Worse, $r$ may be a non-differentiable black box: a reward model, a unit-test pass/fail, a human thumbs-up. We can’t simply backpropagate through $r$ .

The log-derivative trick

The way out is a small, beautiful identity. Write the expectation as a sum (over all possible responses) and differentiate:

\nabla_\theta J(\theta) = \nabla_\theta \sum_y \pi_\theta(y) \, r(y) = \sum_y r(y) \, \nabla_\theta \pi_\theta(y)

The reward $r(y)$ doesn’t depend on $\theta$ , so it slides outside the gradient. We’re left with $\nabla_\theta \pi_\theta(y)$ , which is still awkward — it’s not an expectation we can sample. The trick is to multiply and divide by $\pi_\theta(y)$ :

\nabla_\theta \pi_\theta(y) = \pi_\theta(y) \, \frac{\nabla_\theta \pi_\theta(y)}{\pi_\theta(y)} = \pi_\theta(y) \, \nabla_\theta \log \pi_\theta(y)

That last step uses the identity $\nabla \log f = \nabla f / f$ . Substituting back, the $\pi_\theta(y)$ out front turns the sum back into an expectation:

\boxed{\,\nabla_\theta J(\theta) = \mathbb{E}_{y \sim \pi_\theta}\big[\, r(y) \, \nabla_\theta \log \pi_\theta(y) \,\big]\,}

This is the policy gradient , also called the score-function estimator (the quantity $\nabla_\theta \log \pi_\theta$ is the “score” in statistics). It is the entire foundation of RL for LLMs.

Why the log-derivative trick is the whole game

We wanted the gradient of an average, but the average is over a distribution that itself depends on $\theta$ . Naively that’s intractable. The identity $\nabla_\theta \pi_\theta = \pi_\theta \, \nabla_\theta \log \pi_\theta$ rewrites the gradient as an expectation of $r(y)\,\nabla_\theta \log \pi_\theta(y)$ — and an expectation is exactly what we can estimate by sampling.

So in practice we draw $N$ rollouts $y^{(1)}, \ldots, y^{(N)}$ from the current policy and average:

\nabla_\theta J \approx \frac{1}{N} \sum_{i=1}^{N} r(y^{(i)}) \, \nabla_\theta \log \pi_\theta(y^{(i)})

Crucially, $r(y)$ appears only as a scalar multiplier. We never differentiate through it. That is precisely why the reward is allowed to be a black box.

REINFORCE: push up what worked

Turning that estimator into a learning rule gives the REINFORCE algorithm (Williams, 1992). One step is:

Sample a batch of rollouts from the current policy.
Score each one with the reward function to get $r(y^{(i)})$ .
Update by ascending the estimated gradient:

\theta \leftarrow \theta + \eta \, \frac{1}{N} \sum_i r(y^{(i)}) \, \nabla_\theta \log \pi_\theta(y^{(i)})

The intuition is exactly what you’d hope. $\nabla_\theta \log \pi_\theta(y)$ is the direction in parameter space that makes response $y$ more likely. Multiply it by the reward and sum: rollouts with high reward get pushed up in probability, and (if rewards can be negative) low-reward rollouts get pushed down. The policy reshapes itself to put more mass on the responses that scored well. It is supervised learning where the model writes its own training examples and the reward decides how hard to learn from each one — weighted maximum likelihood , with the weights handed down by the reward.

Because the factorization is autoregressive, $\nabla_\theta \log \pi_\theta(y) = \sum_t \nabla_\theta \log \pi_\theta(y_t \mid x, y_{<t})$ — a sum of per-token log-prob gradients, every one of which you already compute during an ordinary forward/backward pass.

On-policy vs. off-policy

One subtlety is hiding in the expectation $\mathbb{E}_{y \sim \pi_\theta}$ . The data must come from the policy we are currently updating. An algorithm that learns only from its own fresh rollouts is on-policy ; REINFORCE is the canonical example. The moment you take a gradient step, $\theta$ moves, your old samples are stale, and strictly speaking you must collect new rollouts before the next step.

The alternative is off-policy learning: reusing data generated by a different (older, or entirely separate) policy. That’s more sample-efficient — you don’t throw rollouts away after a single step — but it requires a correction, because you’re now averaging over the wrong distribution. The fix, importance sampling, is exactly the bridge that leads from REINFORCE to PPO in chapter 13.

Try it

Below is a tiny world: a “bandit” with a handful of actions, each with its own reward you control. The policy is a single softmax over the actions. Watch REINFORCE in action — sample, reward, update — and see the action distribution migrate toward whatever you reward. Then turn the baseline on and watch the same learning happen far more smoothly.

Policy gradient sandbox (REINFORCE bandit)

Click an arm to sample it. Each pull collects a noisy reward and applies one update: logit ← logit + η·(reward − baseline). Watch the policy shift.

steps 0avg reward 0.000

Policy π(arm) — click a bar to sample

Last sampled arm

—

Reward received

—

Advantage (reward − baseline)

—

REINFORCE pushes probability toward actions that beat expectation and away from those that underperform. With the baseline ON, the update uses reward − running mean, so a merely-OK arm produces a small signed nudge instead of a big positive shove — that is lower variance, and the policy locks onto arm C faster and more stably. With the baseline OFF, every positive reward inflates its arm, so early lucky pulls on a mediocre arm can derail learning. Toggle and compare from a fresh Reset.

Play with it and one thing jumps out: the updates are noisy. With a small number of samples, a couple of lucky high-reward draws can yank the distribution around — sometimes in the wrong direction — before later samples correct it. That noise is not a quirk of the toy. It is the central weakness of REINFORCE.

The catch: variance

The policy-gradient estimator is unbiased (on average it points the right way), but it has punishing variance. Two reasons. First, we estimate an expectation from only a handful of samples. Second, and more insidiously, the raw reward $r(y)$ scales every update. If all your rewards happen to be large and positive — say every response scores between $+8$ and $+10$ — then REINFORCE pushes up the log-prob of every rollout, hard, and only weakly distinguishes the $+10$ from the $+8$ . The gradient is dominated by the level of the reward rather than by which responses were better than average. With billions of parameters and long sequences, that variance makes training slow and brittle.

The good news: there is a clean fix that costs nothing in bias. We can subtract a reference value from the reward — a baseline — so that what multiplies the log-prob becomes “how much better than expected was this rollout?” rather than the raw score. That single idea, and the value functions and advantage estimates built on top of it, is the subject of the next chapter.