Learning from human preferences
Christiano 2017 and the pairwise idea
Papers: Deep Reinforcement Learning from Human Preferences — Christiano et al., 2017 · Fine-Tuning Language Models from Human Preferences — Ziegler et al., 2019
Supervised fine-tuning teaches a model to imitate good answers. But it has a ceiling: it can only ever copy the demonstrations it was shown, and writing a perfect demonstration for every prompt is hard, slow, and often impossible. How do you demonstrate a joke that lands, a summary that’s faithful, or an answer that’s helpful without being preachy? You usually can’t — but you can recognize one when you see it. That gap, between producing the ideal answer and recognizing it, is the whole opening for this section of the explainer.
The founding idea: judge, don’t write
The breakthrough was to stop asking humans for the right answer and instead ask them a much easier question: which of these two is better? You show a person two model outputs for the same prompt and they pick the one they prefer. That single bit of information — A beats B — turns out to be enough, in aggregate, to steer a model toward behavior no one could have written down directly.
This judgment data is called preference data preference data Data where humans (or an AI) compare two or more model responses to the same prompt and mark which is better. The training signal for reward models and DPO. See in glossary → , and the act of choosing between two candidates is a pairwise comparison pairwise comparison Asking a labeler which of two responses is better, rather than scoring each on an absolute scale. Easier and more reliable for humans, and the basis of the Bradley–Terry model. See in glossary → . The standard notation writes a labeled example as a triple : a prompt , the winning (chosen) response , and the losing (rejected) response . A dataset is just thousands of these triples, each one a human saying “for this prompt, this beats that.”
Christiano 2017: rewards you never had to specify
The idea didn’t start with language. In Deep Reinforcement Learning from Human Preferences (Christiano et al., 2017), the problem was classic RL — teaching simulated robots to move, or an agent to play Atari — where the usual stumbling block is writing a reward function. How exactly do you score a backflip? Hand-coded rewards are brittle and easy to game.
Their answer reframed everything. Rather than hand-write a reward, they showed humans short pairs of video clips of the agent’s behavior and asked which clip looked more like the goal. From those comparisons they learned a reward function — a small neural network that predicts how much a human would like a given behavior — and then optimized the agent against that learned reward with ordinary RL. As the agent improved, they collected fresh comparisons on its new behavior and refined the reward, looping the two together.
The headline result was the efficiency. With feedback on less than 1% of the agent’s interactions — under an hour of human time in some cases — agents learned complex behaviors, including some the researchers couldn’t easily have scripted a reward for at all, like a simulated Hopper performing a backflip. Humans never specified what the reward was; they only ever said this, not that, and the reward fell out of those judgments.
The loop, in three moves
Stripped to its skeleton, the recipe that every method in this section inherits is a loop over three steps:
- Collect comparisons. Sample outputs from the current model and have humans pick winners over losers, building up preference data .
- Fit a reward. Train a model to assign a scalar score that’s higher for chosen responses than for rejected ones — a learned stand-in for human judgment. (The next chapters make this precise.)
- Optimize the policy. Improve the model so it produces outputs the learned reward scores highly — then go back to step 1 with the improved model.
This loop is the engine of RLHF RLHF Reinforcement Learning from Human Feedback — train a reward model on human preference comparisons, then optimize the policy against that reward with RL (typically PPO), with a KL leash to a reference. See in glossary → — Reinforcement Learning from Human Feedback — the technique that turned raw base models base model A model straight out of pre-training — a powerful text continuator that has not yet been taught to follow instructions, hold a conversation, or refuse harmful requests. See in glossary → into the assistants people actually use. The name says exactly what it is: reinforcement learning, where the reward signal comes from human feedback instead of a hand-written scoring function.
Ziegler 2019: bringing it to language — and the leash
Christiano’s work proved the principle on robots and games. Two years later, Fine-Tuning Language Models from Human Preferences (Ziegler et al., 2019) carried it to text for the first time, applying the same reward-from-comparisons loop to a GPT-2-scale language model on tasks like stylistic continuation and summarization.
Moving to language exposed a new failure mode. A language model optimized hard against a learned reward will discover that the reward model is imperfect and exploit it — drifting into degenerate, repetitive, or bizarre text that scores well under the proxy reward but reads terribly to an actual human. The optimizer finds the cracks in the reward.
Ziegler’s fix is so important that we still use it in essentially every RLHF system today: penalize the policy for straying too far from where it started. Concretely, you keep a frozen copy of the model from before RL — the reference model reference model A frozen copy of the policy (usually the SFT model) that RLHF and DPO stay close to via a KL penalty, preventing the optimized policy from drifting into degenerate text. See in glossary → — and add a KL penalty KL penalty A term added to the RLHF reward that subtracts β times the KL divergence from the reference policy, keeping the optimized model from drifting too far while chasing reward. See in glossary → that grows as the trained model’s output distribution diverges from the reference’s. The model is rewarded for pleasing the learned reward, but punished for wandering off into territory the reference would never have produced.
Where this is going
We now have the founding idea (compare, don’t demonstrate), the artifact it produces (preference data), the loop it powers (RLHF), and the safety leash that keeps the loop stable (the KL penalty to a reference model). What we haven’t yet shown is that any of this scales — that it works not on toy continuations but on real, useful language tasks at production scale. That’s the next chapter: how RLHF went from summarization research to the recipe behind ChatGPT.