Section 09

Reward models

Bradley–Terry and what an RM really learns

Statistical foundation: Bradley & Terry, Rank Analysis of Incomplete Block Designs (1952)

Step 2 of the RLHF recipe asks for something that sounds almost paradoxical: turn a pile of human “this beats that” judgments into a function that can score any response with a single number, including responses no human ever ranked. That function is the reward model reward model (RM) A model trained from human preference data to output a scalar score for how good a response is. Stands in for a human judge so RL can query reward millions of times. See in glossary → , and this chapter is about what it is, how it’s trained, and — just as important — what it secretly is and isn’t.

What a reward model is, physically

A reward model is, almost always, the SFT model with its head swapped out. Recall that a language model ends in a head that projects the final hidden state into logits logits The raw, pre-softmax scores the model produces — one per vocabulary token, per position. Bigger logit = the model finds that token more likely; the actual value can be any real number, positive or negative. Applying softmax across the vocabulary turns logits into a probability distribution that sums to 1. Sampling then picks one token from that distribution. See in glossary → over the vocabulary. For a reward model, you throw that away and bolt on a tiny scalar head: a single linear layer that maps the final hidden state (typically at the last token of the response) to one number — the reward rθ(x,y)r_\theta(x, y) for response yy given prompt xx.

Starting from the SFT model matters. The reward model needs to understand language to judge it, and the SFT model already does; we’re only teaching it a new, much smaller skill — converting that understanding into a quality score. The body is initialized from the SFT checkpoint; only the scalar head starts fresh.

The Bradley–Terry model: from scores to preferences

Here’s the central problem. Our data is comparisons — ywy_w beat yly_l — but we want to output a scalar score. We need a bridge that says: given two scores, what’s the probability a human prefers one over the other? The Bradley–Terry Bradley–Terry model A statistical model that turns pairwise preferences into latent scalar scores: the probability A beats B is the logistic of the score difference, σ(s_A − s_B). The core of most reward models. See in glossary → model, from 1952, is exactly that bridge.

It says the probability that the winner is preferred to the loser is the softmax softmax Function that turns any vector into a probability distribution (positive, sums to 1) by exponentiating and normalizing. See in glossary → — here in its two-item form, the logistic sigmoid σ\sigma — of the difference in their scores:

P(ywyl)=σ(r(yw)r(yl))=11+e(r(yw)r(yl))P(y_w \succ y_l) = \sigma\big(r(y_w) - r(y_l)\big) = \frac{1}{1 + e^{-(r(y_w) - r(y_l))}}

Read it intuitively. If the two scores are equal, the difference is zero and σ(0)=0.5\sigma(0) = 0.5 — a coin flip, as it should be. As the winner’s score pulls ahead, the difference grows positive and the probability climbs toward 1. The model is a smooth, probabilistic statement that bigger score means more likely to be preferred, with the gap controlling how confident the preference is.

Training the reward model

Now training is a one-liner. We have a labeled preference (x,yw,yl)(x, y_w, y_l) — we observed the human prefer ywy_w. The Bradley–Terry model gives us the probability our reward function assigns to that observation. We just maximize that probability, which is the same as minimizing its negative log — a maximum-likelihood fit. For a single example:

L(θ)=logσ(rθ(x,yw)rθ(x,yl))\mathcal{L}(\theta) = -\log \sigma\big(r_\theta(x, y_w) - r_\theta(x, y_l)\big)

This is the entire reward-model objective. It is exactly a binary cross-entropy cross-entropy loss The standard LM loss: the negative log-probability the model assigned to the actual next token, averaged over all positions. Zero would mean perfect confidence in every correct token. See in glossary → loss on the preference label, and it does precisely what you’d want: it pushes the score of the chosen response up and the score of the rejected response down, until the gap between them is large enough that σ\sigma of the difference is close to 1. Average it over your whole dataset of triples, run gradient descent, and you have a reward model.

What the reward model actually learns — and a subtlety

It’s tempting to think the reward model learns “the quality” of a response in some absolute sense. It doesn’t. Look again at the loss: only the difference rθ(x,yw)rθ(x,yl)r_\theta(x, y_w) - r_\theta(x, y_l) ever appears. If you added the same constant to every score the model produces, every difference — and therefore every preference probability and the entire loss — would be completely unchanged.

This means the reward model is only identified up to a shift. The absolute value of a reward is meaningless; only relative comparisons carry information. A reward of 3.03.0 tells you nothing on its own — it’s only ”2.02.0 better than that other response” that means anything. Preferences are inherently relative, so the scores that encode them are too. (In practice people often normalize rewards to mean zero per prompt batch precisely to pin down this free constant.)

And what the model has actually learned is a proxy for human judgment — a compression of thousands of human “this beats that” clicks into a function that generalizes to new responses. It is not truth, not a ground-truth quality oracle; it’s a learned approximation, and a flawed one. That flaw is not a footnote. Optimizing too hard against an imperfect proxy is reward hacking reward hacking When a policy finds ways to score high on the reward model without actually being better — exploiting quirks of an imperfect proxy. A central danger of RL post-training. See in glossary → , the failure mode that gets its own chapter (15), and measuring how good a reward model even is launched a benchmark, RewardBench RewardBench A standard benchmark for evaluating reward models across chat, safety, and reasoning, making reward-model quality measurable and comparable. See in glossary → , that we cover in chapter 27.

Try it

The widget below makes Bradley–Terry tangible. Set a latent score for each of two responses and watch how the preference probability P(AB)=σ(sAsB)P(A \succ B) = \sigma(s_A - s_B) responds. Notice the two facts from above: equal scores give exactly 0.5, and only the gap between the scores matters — shift both by the same amount and the probability doesn’t budge.

Bradley–Terry: scores to preferences
A reward model gives each response a hidden scalar score. The chance a human prefers one over the other is the logistic of the score gap.
Response A · latent score sA = 1.20
Response B · latent score sB = -0.40
−30+3
P(A ≻ B) = σ(sA − sB)
−60+6
score gap sA − sB = 1.60
P(A beats B)
83.2%
P(B beats A) = 16.8%
A reward model turns pairwise human preferences into scalar scores via exactly this Bradley–Terry / logistic relationship. Pressing Train on "A > B" applies one gradient step of the loss −log σ(sA − sB): it nudges sA up and sB down, and the step shrinks as the model already becomes confident that A wins.

Where this is going

We now have a learned, automatic stand-in for human preference: a scalar reward we can query on any response. Chapter 10 asks what happens when even the comparisons feeding this model come from an AI rather than a human — the move to RLAIF and Constitutional AI. After that, Section 4 finally opens the other black box from the recipe: the reinforcement learning that turns this reward signal into an improved policy.