Policy gradients & REINFORCE
Policy, rollout, return, and the score function
So far we have learned a reward model reward model (RM) A model trained from human preference data to output a scalar score for how good a response is. Stands in for a human judge so RL can query reward millions of times. See in glossary → that scores a response, and we want the language model to produce responses the reward model likes. But the reward model isn’t differentiable in the way the next-token loss is: there is no clean chain rule running from “this 300-token answer scored 0.8” back to the logits. We need a new kind of gradient — one that can learn from a number handed back at the end, even a number produced by a black box. That gradient is the policy gradient, and this chapter derives it from scratch.
The LLM as a policy
In reinforcement learning, a policy policy In RL, the thing that chooses actions — here, the language model itself, viewed as a distribution over next tokens given the context. RL post-training optimizes the policy. See in glossary → is a rule for choosing actions. We write it : given a state , it gives a probability distribution over actions , controlled by parameters . The whole vocabulary of RL maps cleanly onto a language model:
- The state is the prompt-so-far: the input plus whatever tokens have already been generated.
- An action is choosing the next token.
- The policy is the model itself. At each step it emits logits logits The raw, pre-softmax scores the model produces — one per vocabulary token, per position. Bigger logit = the model finds that token more likely; the actual value can be any real number, positive or negative. Applying softmax across the vocabulary turns logits into a probability distribution that sums to 1. Sampling then picks one token from that distribution. See in glossary → , the softmax softmax Function that turns any vector into a probability distribution (positive, sums to 1) by exponentiating and normalizing. See in glossary → turns them into a distribution over the vocabulary, and we sample.
Because the model generates one token at a time and feeds each choice back in, the probability of a complete response given prompt factorizes:
This is the same autoregressive factorization you already know from next-token prediction next-token prediction The pre-training objective for GPT-style models: given the tokens so far, predict a probability distribution over the next token. Also called causal or autoregressive language modeling. See in glossary → — nothing about the network has changed. What changes is how we score the output and what gradient we push through it.
A single act of generating a full response is a rollout rollout A complete generated sample from the policy — for an LLM, one full response to a prompt. RL collects rollouts, scores them, and updates the policy. See in glossary → . The sequence of states and actions it produces — prompt, token, new state, token, … — is a trajectory trajectory The sequence of states and actions in a rollout. For text generation, the tokens generated one after another, each conditioned on those before it. See in glossary → . At the end, a scalar reward reward A scalar signal saying how good an outcome was. In post-training it can come from a learned reward model, a verifier, or a rule, and is what RL maximizes. See in glossary → (here, the reward-model score) judges the whole thing. Summed over the trajectory, that’s the return return The total (often discounted) reward accumulated over a trajectory. Policy-gradient methods push up the probability of actions that led to high return. See in glossary → ; for a single terminal reward at the end of a response, return and reward coincide, and we’ll just write .
The objective: maximize expected reward
We want a policy whose rollouts tend to score high. Formally, maximize the expected reward over the responses the policy itself generates:
Read it carefully: the expectation is taken over drawn from the current policy. That is the crux of the difficulty. The thing we are differentiating with respect to, , also controls the distribution we are averaging over. Change and you don’t just change — you change which ‘s show up at all. Worse, may be a non-differentiable black box: a reward model, a unit-test pass/fail, a human thumbs-up. We can’t simply backpropagate through .
The log-derivative trick
The way out is a small, beautiful identity. Write the expectation as a sum (over all possible responses) and differentiate:
The reward doesn’t depend on , so it slides outside the gradient. We’re left with , which is still awkward — it’s not an expectation we can sample. The trick is to multiply and divide by :
That last step uses the identity . Substituting back, the out front turns the sum back into an expectation:
This is the policy gradient policy gradient A family of RL methods that directly adjust the policy’s parameters in the direction that increases expected reward, using the score-function (REINFORCE) estimator. See in glossary → , also called the score-function estimator score-function estimator The identity ∇E[R] = E[R · ∇log π] that lets us estimate a reward gradient by sampling, even though the reward itself isn’t differentiable in the model’s parameters. See in glossary → (the quantity is the “score” in statistics). It is the entire foundation of RL for LLMs.
REINFORCE: push up what worked
Turning that estimator into a learning rule gives the REINFORCE REINFORCE The basic Monte-Carlo policy-gradient estimator (Williams, 1992): scale the gradient of each action’s log-probability by the reward (or advantage) it earned. Everything else builds on it. See in glossary → algorithm (Williams, 1992). One step is:
- Sample a batch of rollouts from the current policy.
- Score each one with the reward function to get .
- Update by ascending the estimated gradient:
The intuition is exactly what you’d hope. is the direction in parameter space that makes response more likely. Multiply it by the reward and sum: rollouts with high reward get pushed up in probability, and (if rewards can be negative) low-reward rollouts get pushed down. The policy reshapes itself to put more mass on the responses that scored well. It is supervised learning where the model writes its own training examples and the reward decides how hard to learn from each one — weighted maximum likelihood likelihood The probability a model assigns to observed data. Supervised fine-tuning maximizes the likelihood of human-written target responses given their prompts. See in glossary → , with the weights handed down by the reward.
Because the factorization is autoregressive, — a sum of per-token log-prob gradients, every one of which you already compute during an ordinary forward/backward pass.
On-policy vs. off-policy
One subtlety is hiding in the expectation . The data must come from the policy we are currently updating. An algorithm that learns only from its own fresh rollouts is on-policy on-policy RL where the data used to update the policy was generated by the current policy. PPO and GRPO are (approximately) on-policy; they resample as the policy changes. See in glossary → ; REINFORCE is the canonical example. The moment you take a gradient step, moves, your old samples are stale, and strictly speaking you must collect new rollouts before the next step.
The alternative is off-policy off-policy RL that learns from data generated by a different (older or separate) policy. DPO and rejection-sampling methods are off-policy / offline. See in glossary → learning: reusing data generated by a different (older, or entirely separate) policy. That’s more sample-efficient — you don’t throw rollouts away after a single step — but it requires a correction, because you’re now averaging over the wrong distribution. The fix, importance sampling, is exactly the bridge that leads from REINFORCE to PPO PPO Proximal Policy Optimization (Schulman, 2017) — approximates TRPO’s trust region with a simple clipped surrogate objective. The RLHF workhorse: stable, simple, widely used. See in glossary → in chapter 13.
Try it
Below is a tiny world: a “bandit” with a handful of actions, each with its own reward you control. The policy is a single softmax over the actions. Watch REINFORCE in action — sample, reward, update — and see the action distribution migrate toward whatever you reward. Then turn the baseline on and watch the same learning happen far more smoothly.
Play with it and one thing jumps out: the updates are noisy. With a small number of samples, a couple of lucky high-reward draws can yank the distribution around — sometimes in the wrong direction — before later samples correct it. That noise is not a quirk of the toy. It is the central weakness of REINFORCE.
The catch: variance
The policy-gradient estimator is unbiased (on average it points the right way), but it has punishing variance. Two reasons. First, we estimate an expectation from only a handful of samples. Second, and more insidiously, the raw reward scales every update. If all your rewards happen to be large and positive — say every response scores between and — then REINFORCE pushes up the log-prob of every rollout, hard, and only weakly distinguishes the from the . The gradient is dominated by the level of the reward rather than by which responses were better than average. With billions of parameters and long sequences, that variance makes training slow and brittle.
The good news: there is a clean fix that costs nothing in bias. We can subtract a reference value from the reward — a baseline — so that what multiplies the log-prob becomes “how much better than expected was this rollout?” rather than the raw score. That single idea, and the value functions and advantage estimates built on top of it, is the subject of the next chapter.