Section 09

Reward models

Bradley–Terry and what an RM really learns

Statistical foundation: Bradley & Terry, Rank Analysis of Incomplete Block Designs (1952)

Step 2 of the RLHF recipe asks for something that sounds almost paradoxical: turn a pile of human “this beats that” judgments into a function that can score any response with a single number, including responses no human ever ranked. That function is the reward model , and this chapter is about what it is, how it’s trained, and — just as important — what it secretly is and isn’t.

What a reward model is, physically

A reward model is, almost always, the SFT model with its head swapped out. Recall that a language model ends in a head that projects the final hidden state into logits over the vocabulary. For a reward model, you throw that away and bolt on a tiny scalar head: a single linear layer that maps the final hidden state (typically at the last token of the response) to one number — the reward $r_\theta(x, y)$ for response $y$ given prompt $x$ .

Starting from the SFT model matters. The reward model needs to understand language to judge it, and the SFT model already does; we’re only teaching it a new, much smaller skill — converting that understanding into a quality score. The body is initialized from the SFT checkpoint; only the scalar head starts fresh.

The Bradley–Terry model: from scores to preferences

Here’s the central problem. Our data is comparisons — $y_w$ beat $y_l$ — but we want to output a scalar score. We need a bridge that says: given two scores, what’s the probability a human prefers one over the other? The Bradley–Terry model, from 1952, is exactly that bridge.

It says the probability that the winner is preferred to the loser is the softmax — here in its two-item form, the logistic sigmoid $\sigma$ — of the difference in their scores:

P(y_w \succ y_l) = \sigma\big(r(y_w) - r(y_l)\big) = \frac{1}{1 + e^{-(r(y_w) - r(y_l))}}

Read it intuitively. If the two scores are equal, the difference is zero and $\sigma(0) = 0.5$ — a coin flip, as it should be. As the winner’s score pulls ahead, the difference grows positive and the probability climbs toward 1. The model is a smooth, probabilistic statement that bigger score means more likely to be preferred, with the gap controlling how confident the preference is.

Training the reward model

Now training is a one-liner. We have a labeled preference $(x, y_w, y_l)$ — we observed the human prefer $y_w$ . The Bradley–Terry model gives us the probability our reward function assigns to that observation. We just maximize that probability, which is the same as minimizing its negative log — a maximum-likelihood fit. For a single example:

\mathcal{L}(\theta) = -\log \sigma\big(r_\theta(x, y_w) - r_\theta(x, y_l)\big)

This is the entire reward-model objective. It is exactly a binary cross-entropy loss on the preference label, and it does precisely what you’d want: it pushes the score of the chosen response up and the score of the rejected response down, until the gap between them is large enough that $\sigma$ of the difference is close to 1. Average it over your whole dataset of triples, run gradient descent, and you have a reward model.

Where the loss comes from

The derivation is three short moves. (1) Model the data: assume each comparison follows Bradley–Terry, so $P(y_w \succ y_l) = \sigma(r_\theta(x,y_w) - r_\theta(x,y_l))$ . (2) Write the likelihood: the probability of the whole dataset is the product of that term over every observed preference (assuming independence). (3) Take negative log: products become sums, and maximizing the likelihood becomes minimizing $\sum -\log\sigma(\Delta r)$ . That’s it — the RM loss is just maximum-likelihood estimation under the Bradley–Terry model, identical in form to logistic regression where the “feature” is the score difference the network itself learns to compute.

What the reward model actually learns — and a subtlety

It’s tempting to think the reward model learns “the quality” of a response in some absolute sense. It doesn’t. Look again at the loss: only the difference $r_\theta(x, y_w) - r_\theta(x, y_l)$ ever appears. If you added the same constant to every score the model produces, every difference — and therefore every preference probability and the entire loss — would be completely unchanged.

This means the reward model is only identified up to a shift. The absolute value of a reward is meaningless; only relative comparisons carry information. A reward of $3.0$ tells you nothing on its own — it’s only ” $2.0$ better than that other response” that means anything. Preferences are inherently relative, so the scores that encode them are too. (In practice people often normalize rewards to mean zero per prompt batch precisely to pin down this free constant.)

And what the model has actually learned is a proxy for human judgment — a compression of thousands of human “this beats that” clicks into a function that generalizes to new responses. It is not truth, not a ground-truth quality oracle; it’s a learned approximation, and a flawed one. That flaw is not a footnote. Optimizing too hard against an imperfect proxy is reward hacking , the failure mode that gets its own chapter (15), and measuring how good a reward model even is launched a benchmark, RewardBench , that we cover in chapter 27.

Try it

The widget below makes Bradley–Terry tangible. Set a latent score for each of two responses and watch how the preference probability $P(A \succ B) = \sigma(s_A - s_B)$ responds. Notice the two facts from above: equal scores give exactly 0.5, and only the gap between the scores matters — shift both by the same amount and the probability doesn’t budge.

Bradley–Terry: scores to preferences

A reward model gives each response a hidden scalar score. The chance a human prefers one over the other is the logistic of the score gap.

Response A · latent score s_A = 1.20

Response B · latent score s_B = -0.40

−30+3

P(A ≻ B) = σ(s_A − s_B)

score gap s_A − s_B = 1.60

P(A beats B)

83.2%

P(B beats A) = 16.8%

A reward model turns pairwise human preferences into scalar scores via exactly this Bradley–Terry / logistic relationship. Pressing Train on "A > B" applies one gradient step of the loss −log σ(s_A − s_B): it nudges s_A up and s_B down, and the step shrinks as the model already becomes confident that A wins.

Where this is going

We now have a learned, automatic stand-in for human preference: a scalar reward we can query on any response. Chapter 10 asks what happens when even the comparisons feeding this model come from an AI rather than a human — the move to RLAIF and Constitutional AI. After that, Section 4 finally opens the other black box from the recipe: the reinforcement learning that turns this reward signal into an improved policy.