Section 08

RLHF scales to language

Summarization, InstructGPT, and the 3-step recipe

Papers: Learning to Summarize from Human Feedback — Stiennon et al., 2020 · Training Language Models to Follow Instructions with Human Feedback (InstructGPT) — Ouyang et al., 2022

The previous chapter gave us the principle: learn a reward from comparisons, then optimize against it. But a principle that works on Atari and toy text continuations is a long way from a principle that builds a useful assistant. This chapter is the story of how RLHF crossed that gap — first on a single hard language task, then on the open-ended job of following any instruction — and arrived at a three-step recipe that became the template for the entire industry.

Stiennon 2020: the proof that it scales

The first convincing demonstration on a real, hard NLP task was Learning to Summarize from Human Feedback (Stiennon et al., 2020). The task was abstractive summarization of Reddit posts and news articles — somewhere a faithful, concise summary matters and is genuinely hard to write.

The recipe was exactly the loop from the last chapter. Collect human comparisons between candidate summaries, train a reward model to predict which summary a human would prefer, then fine-tune the summarizer with PPO to maximize that reward (with the KL-to-reference leash holding it steady). The striking part was the result. The RLHF-tuned model didn’t just beat a strongly supervised model trained on the same data — its summaries were preferred by humans over the reference summaries written by people, the very gold standard the supervised model was trained to imitate.

That last point is the punchline of the whole approach made concrete. By learning from judgments of quality rather than demonstrations of it, the model climbed past the ceiling of its human-written training targets. RLHF wasn’t a toy; it scaled to a real language task and exceeded the humans on it.

InstructGPT 2022: from one task to following instructions

Summarization is one task. The real prize was a model that follows any instruction — answer this, rewrite that, explain this, refuse that — the behavior we now expect from a chat assistant. That’s what InstructGPT (Ouyang et al., 2022) delivered, and in doing so it crystallized the canonical RLHF recipe into three crisp steps.

Step 1 — Supervised fine-tuning

Start from a pre-trained base model and fine-tune it on a dataset of human-written demonstrations: prompts paired with high-quality responses that show the model what a good answer looks like. This is plain supervised fine-tuning — the subject of the previous section — and it gives the model a basic grasp of the instruction-following format. The result is the SFT model, the starting point for everything that follows.

Step 2 — Train a reward model

Sample several responses from the SFT model for each of many prompts, and have human labelers rank them from best to worst. Break those rankings into pairwise comparisons and train a reward model to assign a higher scalar score to the responses humans preferred. The RM is a learned, automatic proxy for human judgment — and the next chapter is devoted entirely to how it works.

Step 3 — Optimize the policy with PPO

Now run reinforcement learning. The SFT model becomes the policy ; it generates responses, the reward model scores them, and PPO nudges the policy toward responses the RM rates highly. Crucially, the objective also includes a KL penalty that pulls the policy back toward the frozen SFT model — the reference model — so it improves on the reward without drifting into degenerate text that merely games the RM.

The result that shocked everyone

The headline finding of InstructGPT is still one of the most cited facts in post-training. Human labelers preferred the outputs of a 1.3-billion-parameter InstructGPT model to those of the 175-billion-parameter GPT-3 — a base model more than 100× larger. Aligning a small model with human preferences beat scaling a raw model by two orders of magnitude.

The lesson is foundational for everything in this explainer: a base model has the knowledge, but post-training supplies the behavior. GPT-3 could continue text brilliantly; what it lacked was the disposition to take an instruction and actually try to be helpful, honest, and harmless . RLHF didn’t make the model smarter — it made the model’s existing intelligence usable. That realignment, not raw scale, is what turned a powerful but unwieldy text-completer into the assistant that launched a thousand products.

Where this is going

We now have the recipe — SFT, then a reward model, then PPO against it with a KL leash — and the historical proof that it works and scales. The next two chapters open up the two pieces we’ve so far treated as black boxes. First, in chapter 9, the reward model: how a single scalar head, trained with the Bradley–Terry loss, learns to stand in for human judgment. Then the RL machinery itself in Section 4 — what PPO actually is, and why it’s shaped the way it is.