RLHF scales to language
Summarization, InstructGPT, and the 3-step recipe
Papers: Learning to Summarize from Human Feedback — Stiennon et al., 2020 · Training Language Models to Follow Instructions with Human Feedback (InstructGPT) — Ouyang et al., 2022
The previous chapter gave us the principle: learn a reward from comparisons, then optimize against it. But a principle that works on Atari and toy text continuations is a long way from a principle that builds a useful assistant. This chapter is the story of how RLHF crossed that gap — first on a single hard language task, then on the open-ended job of following any instruction — and arrived at a three-step recipe that became the template for the entire industry.
Stiennon 2020: the proof that it scales
The first convincing demonstration on a real, hard NLP task was Learning to Summarize from Human Feedback (Stiennon et al., 2020). The task was abstractive summarization of Reddit posts and news articles — somewhere a faithful, concise summary matters and is genuinely hard to write.
The recipe was exactly the loop from the last chapter. Collect human comparisons between candidate summaries, train a reward model reward model (RM) A model trained from human preference data to output a scalar score for how good a response is. Stands in for a human judge so RL can query reward millions of times. See in glossary → to predict which summary a human would prefer, then fine-tune the summarizer with PPO PPO Proximal Policy Optimization (Schulman, 2017) — approximates TRPO’s trust region with a simple clipped surrogate objective. The RLHF workhorse: stable, simple, widely used. See in glossary → to maximize that reward (with the KL-to-reference leash holding it steady). The striking part was the result. The RLHF-tuned model didn’t just beat a strongly supervised model trained on the same data — its summaries were preferred by humans over the reference summaries written by people, the very gold standard the supervised model was trained to imitate.
That last point is the punchline of the whole approach made concrete. By learning from judgments of quality rather than demonstrations of it, the model climbed past the ceiling of its human-written training targets. RLHF wasn’t a toy; it scaled to a real language task and exceeded the humans on it.
InstructGPT 2022: from one task to following instructions
Summarization is one task. The real prize was a model that follows any instruction — answer this, rewrite that, explain this, refuse that — the behavior we now expect from a chat assistant. That’s what InstructGPT (Ouyang et al., 2022) delivered, and in doing so it crystallized the canonical RLHF recipe into three crisp steps.
Step 1 — Supervised fine-tuning
Start from a pre-trained base model base model A model straight out of pre-training — a powerful text continuator that has not yet been taught to follow instructions, hold a conversation, or refuse harmful requests. See in glossary → and fine-tune it on a dataset of human-written demonstrations: prompts paired with high-quality responses that show the model what a good answer looks like. This is plain supervised fine-tuning supervised fine-tuning (SFT) Training a pre-trained model on curated (prompt, response) pairs with the ordinary next-token objective, so it imitates demonstrated assistant behavior. The first stage of post-training. See in glossary → — the subject of the previous section — and it gives the model a basic grasp of the instruction-following format. The result is the SFT model, the starting point for everything that follows.
Step 2 — Train a reward model
Sample several responses from the SFT model for each of many prompts, and have human labelers rank them from best to worst. Break those rankings into pairwise comparisons pairwise comparison Asking a labeler which of two responses is better, rather than scoring each on an absolute scale. Easier and more reliable for humans, and the basis of the Bradley–Terry model. See in glossary → and train a reward model reward model (RM) A model trained from human preference data to output a scalar score for how good a response is. Stands in for a human judge so RL can query reward millions of times. See in glossary → to assign a higher scalar score to the responses humans preferred. The RM is a learned, automatic proxy for human judgment — and the next chapter is devoted entirely to how it works.
Step 3 — Optimize the policy with PPO
Now run reinforcement learning. The SFT model becomes the policy policy In RL, the thing that chooses actions — here, the language model itself, viewed as a distribution over next tokens given the context. RL post-training optimizes the policy. See in glossary → ; it generates responses, the reward model scores them, and PPO PPO Proximal Policy Optimization (Schulman, 2017) — approximates TRPO’s trust region with a simple clipped surrogate objective. The RLHF workhorse: stable, simple, widely used. See in glossary → nudges the policy toward responses the RM rates highly. Crucially, the objective also includes a KL penalty KL penalty A term added to the RLHF reward that subtracts β times the KL divergence from the reference policy, keeping the optimized model from drifting too far while chasing reward. See in glossary → that pulls the policy back toward the frozen SFT model — the reference model reference model A frozen copy of the policy (usually the SFT model) that RLHF and DPO stay close to via a KL penalty, preventing the optimized policy from drifting into degenerate text. See in glossary → — so it improves on the reward without drifting into degenerate text that merely games the RM.
The result that shocked everyone
The headline finding of InstructGPT is still one of the most cited facts in post-training. Human labelers preferred the outputs of a 1.3-billion-parameter InstructGPT model to those of the 175-billion-parameter GPT-3 — a base model more than 100× larger. Aligning a small model with human preferences beat scaling a raw model by two orders of magnitude.
The lesson is foundational for everything in this explainer: a base model has the knowledge, but post-training supplies the behavior. GPT-3 could continue text brilliantly; what it lacked was the disposition to take an instruction and actually try to be helpful, honest, and harmless helpful, honest, harmless The "HHH" framing (from Anthropic) of what an aligned assistant should be: useful to the user, truthful, and unlikely to cause harm. See in glossary → . RLHF didn’t make the model smarter — it made the model’s existing intelligence usable. That realignment, not raw scale, is what turned a powerful but unwieldy text-completer into the assistant that launched a thousand products.
Where this is going
We now have the recipe — SFT, then a reward model, then PPO against it with a KL leash — and the historical proof that it works and scales. The next two chapters open up the two pieces we’ve so far treated as black boxes. First, in chapter 9, the reward model: how a single scalar head, trained with the Bradley–Terry loss, learns to stand in for human judgment. Then the RL machinery itself in Section 4 — what PPO actually is, and why it’s shaped the way it is.