Section 07

Learning from human preferences

Christiano 2017 and the pairwise idea

Papers: Deep Reinforcement Learning from Human Preferences — Christiano et al., 2017 · Fine-Tuning Language Models from Human Preferences — Ziegler et al., 2019

Supervised fine-tuning teaches a model to imitate good answers. But it has a ceiling: it can only ever copy the demonstrations it was shown, and writing a perfect demonstration for every prompt is hard, slow, and often impossible. How do you demonstrate a joke that lands, a summary that’s faithful, or an answer that’s helpful without being preachy? You usually can’t — but you can recognize one when you see it. That gap, between producing the ideal answer and recognizing it, is the whole opening for this section of the explainer.

The founding idea: judge, don’t write

The breakthrough was to stop asking humans for the right answer and instead ask them a much easier question: which of these two is better? You show a person two model outputs for the same prompt and they pick the one they prefer. That single bit of information — A beats B — turns out to be enough, in aggregate, to steer a model toward behavior no one could have written down directly.

This judgment data is called preference data , and the act of choosing between two candidates is a pairwise comparison . The standard notation writes a labeled example as a triple $(x, y_w, y_l)$ : a prompt $x$ , the winning (chosen) response $y_w$ , and the losing (rejected) response $y_l$ . A dataset is just thousands of these triples, each one a human saying “for this prompt, this beats that.”

Why comparison beats demonstration

Recognizing quality is easier than generating it. Three concrete payoffs follow:

Cheaper and faster. Reading two summaries and clicking the better one takes seconds; writing a gold-standard summary from scratch takes minutes and real expertise.
More reliable. Two annotators asked to write the ideal answer will produce wildly different text. Asked to rank the same pair, they agree far more often — there’s a single right-ish answer to “which is better?”
It can exceed the demonstrator. SFT is capped at the quality of the humans who wrote the demonstrations. Preference learning is capped only by the quality of the humans who can judge — and people can reliably recognize answers far better than they could ever produce. This is how the method eventually surpasses its teachers.

Christiano 2017: rewards you never had to specify

The idea didn’t start with language. In Deep Reinforcement Learning from Human Preferences (Christiano et al., 2017), the problem was classic RL — teaching simulated robots to move, or an agent to play Atari — where the usual stumbling block is writing a reward function. How exactly do you score a backflip? Hand-coded rewards are brittle and easy to game.

Their answer reframed everything. Rather than hand-write a reward, they showed humans short pairs of video clips of the agent’s behavior and asked which clip looked more like the goal. From those comparisons they learned a reward function — a small neural network that predicts how much a human would like a given behavior — and then optimized the agent against that learned reward with ordinary RL. As the agent improved, they collected fresh comparisons on its new behavior and refined the reward, looping the two together.

The headline result was the efficiency. With feedback on less than 1% of the agent’s interactions — under an hour of human time in some cases — agents learned complex behaviors, including some the researchers couldn’t easily have scripted a reward for at all, like a simulated Hopper performing a backflip. Humans never specified what the reward was; they only ever said this, not that, and the reward fell out of those judgments.

The loop, in three moves

Stripped to its skeleton, the recipe that every method in this section inherits is a loop over three steps:

Collect comparisons. Sample outputs from the current model and have humans pick winners over losers, building up preference data $(x, y_w, y_l)$ .
Fit a reward. Train a model to assign a scalar score that’s higher for chosen responses than for rejected ones — a learned stand-in for human judgment. (The next chapters make this precise.)
Optimize the policy. Improve the model so it produces outputs the learned reward scores highly — then go back to step 1 with the improved model.

This loop is the engine of RLHF — Reinforcement Learning from Human Feedback — the technique that turned raw base models into the assistants people actually use. The name says exactly what it is: reinforcement learning, where the reward signal comes from human feedback instead of a hand-written scoring function.

Ziegler 2019: bringing it to language — and the leash

Christiano’s work proved the principle on robots and games. Two years later, Fine-Tuning Language Models from Human Preferences (Ziegler et al., 2019) carried it to text for the first time, applying the same reward-from-comparisons loop to a GPT-2-scale language model on tasks like stylistic continuation and summarization.

Moving to language exposed a new failure mode. A language model optimized hard against a learned reward will discover that the reward model is imperfect and exploit it — drifting into degenerate, repetitive, or bizarre text that scores well under the proxy reward but reads terribly to an actual human. The optimizer finds the cracks in the reward.

Ziegler’s fix is so important that we still use it in essentially every RLHF system today: penalize the policy for straying too far from where it started. Concretely, you keep a frozen copy of the model from before RL — the reference model — and add a KL penalty that grows as the trained model’s output distribution diverges from the reference’s. The model is rewarded for pleasing the learned reward, but punished for wandering off into territory the reference would never have produced.

Where this is going

We now have the founding idea (compare, don’t demonstrate), the artifact it produces (preference data), the loop it powers (RLHF), and the safety leash that keeps the loop stable (the KL penalty to a reference model). What we haven’t yet shown is that any of this scales — that it works not on toy continuations but on real, useful language tasks at production scale. That’s the next chapter: how RLHF went from summarization research to the recipe behind ChatGPT.