Rejection-sampling alignment
RAFT, RRHF, and best-of-N fine-tuning
Paper: RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment — Dong et al., 2023
Here is the simplest alignment idea in this entire explainer, and it might be the most underrated. Forget reward models in a loop, forget PPO PPO Proximal Policy Optimization (Schulman, 2017) — approximates TRPO’s trust region with a simple clipped surrogate objective. The RLHF workhorse: stable, simple, widely used. See in glossary → , forget even the closed-form algebra of DPO. Just do this: ask the model to generate several answers, keep the best ones, and fine-tune on them. Repeat. That’s the whole method — and it’s hiding inside the post-training pipeline of nearly every frontier model.
Generate, filter, fine-tune
The recipe goes by a few names but the loop is always the same:
- For each prompt, sample candidate answers from the current policy.
- Score each candidate with some reward signal — a reward model reward model (RM) A model trained from human preference data to output a scalar score for how good a response is. Stands in for a human judge so RL can query reward millions of times. See in glossary → , a human rating, or, as we’ll see, a verifier verifier An automatic, often rule-based checker that returns whether a response is correct (e.g. runs unit tests, compares to a known answer). Provides the reward in RLVR. See in glossary → .
- Keep the best (the top-scoring one, or all that clear a threshold) and fine-tune the model on those winners with a plain supervised cross-entropy cross-entropy loss The standard LM loss: the negative log-probability the model assigned to the actual next token, averaged over all positions. Zero would mean perfect confidence in every correct token. See in glossary → loss.
- Iterate: the improved model generates better candidates next round.
This is rejection sampling rejection sampling Generate several candidate responses, keep only the best-scoring one(s) by some reward or verifier, and fine-tune on those. A simple, stable, RL-free way to improve a model. See in glossary → — a classical statistical idea (propose samples, accept the good ones, reject the rest) repurposed for alignment. Keeping the single best of candidates is exactly best-of-N best-of-N Sampling N responses and selecting the highest-reward one. Used both at inference time and as the data-generation step in rejection-sampling fine-tuning. See in glossary → sampling; the twist here is that instead of just returning the best answer at inference time, you train on it, baking the best-of-N behavior back into the weights.
RAFT RAFT Reward-rAnked Fine-Tuning — iteratively sample, rank by reward, and fine-tune on the top responses. Offline, RL-free preference alignment. See in glossary → (Reward-rAnked FineTuning, Dong et al. 2023) is the clean formulation of this loop: sample a batch, rank by reward, fine-tune on the top-ranked, repeat. RRHF (Yuan et al. 2023) is a close cousin from the same year — it scores a set of candidates and uses a ranking loss to make the model prefer the higher-reward ones, blending the rejection-sampling idea with a contrast between candidates.
Why it’s so appealing
Rejection-sampling alignment is offline offline RL Optimizing from a fixed dataset of responses and preferences without generating new rollouts during training. DPO and rejection-sampling methods are offline. See in glossary → in the same sense DPO is — each round trains on a frozen batch of pre-scored samples — but it’s even more bare-bones, and the simplicity buys real advantages:
- Dead simple. It’s just sampling plus standard supervised fine-tuning. Any team that can run generation and an SFT job can do it; there’s no new optimizer, no special loss to debug.
- Stable. There’s no importance sampling importance sampling Reweighting samples from one distribution to estimate expectations under another, via the probability ratio π_new/π_old. The ratio PPO clips comes from here. See in glossary → , no clipped surrogate clipped surrogate objective PPO’s loss: maximize the probability-ratio-weighted advantage, but clip the ratio to [1−ε, 1+ε] so a single update can’t move the policy too far. See in glossary → , no critic critic A model trained to predict the value function. PPO uses an actor (the policy) and a critic; GRPO drops the critic and uses a group average instead. See in glossary → , no on-policy sampling loop to diverge. You can’t blow up a training run the way PPO can; the worst case is that it simply doesn’t improve.
- Trivially parallel and reusable. Generation is embarrassingly parallel, the filtered data is just text you can inspect, mix, dedup, and reuse, and you can throw a stronger reward model at it later without changing the training code.
The contrast with the methods around it is sharp. Against PPO, rejection sampling drops the entire RL apparatus — no clipping, no critic, no KL KL penalty A term added to the RLHF reward that subtracts β times the KL divergence from the reference policy, keeping the optimized model from drifting too far while chasing reward. See in glossary → -penalized policy gradient — at the cost of being off-policy and only as good as your best-of-N samples. Against DPO, the difference is what signal it consumes: DPO needs pairwise preferences and learns from both the chosen and the rejected; rejection sampling needs only a score, and in its basic form learns from the positives only (or best-vs-rest), never explicitly pushing down the losers. Less information per example — but a much easier signal to obtain.
The bridge to reasoning
Now change one thing about the recipe and watch the whole reasoning era fall out of it. In everything above, the “reward” was a learned reward model — soft, fuzzy, and (per the reward-hacking chapter) exploitable. But suppose the task is math or code, where you can check the answer. Now the filter isn’t a reward model at all — it’s a verifier verifier An automatic, often rule-based checker that returns whether a response is correct (e.g. runs unit tests, compares to a known answer). Provides the reward in RLVR. See in glossary → that returns a hard, unhackable correct/incorrect.
With a verifier as the filter, “sample N, keep the ones that got the right answer, fine-tune, repeat” becomes a method for bootstrapping reasoning. The model generates many chains of thought chain-of-thought (CoT) Having a model write out intermediate reasoning steps before its final answer. Improves accuracy on multi-step problems and is the substrate reasoning RL optimizes. See in glossary → , you keep only the ones that reached the verified answer, and you fine-tune on those correct reasoning traces. The model learns to reason better by training on its own successful reasoning. This is exactly the idea behind STaR STaR Self-Taught Reasoner (Zelikman, 2022) — generate chain-of-thought rationales, keep those that reach the correct answer, fine-tune on them, and repeat. Bootstraps reasoning from a model’s own correct attempts. See in glossary → (Self-Taught Reasoner), the subject of the next chapter — rejection-sampling fine-tuning with a verifier in the loop.
And it doesn’t stop there. Keep that same verifier but, instead of fine-tuning on filtered samples once per round, optimize against it online with a policy-gradient method, and you’ve arrived at RL from verifiable rewards RLVR Reinforcement Learning from Verifiable Rewards — use an automatic checker (unit tests, an answer key, a math grader) as the reward instead of a learned reward model. No reward hacking of a neural proxy. See in glossary → — the engine behind DeepSeek-R1 and the modern reasoning models, which we reach in chapter 22. Rejection sampling is the gateway: the same generate-and-filter skeleton, with the reward swapped for a verifier and the fine-tune swapped for an RL update.
That’s the thread for the rest of this explainer. We came into Section 5 trying to avoid an exploitable reward-model loop, and we leave it having found the simplest possible alternative — and discovered that, with the reward replaced by a verifier, that same simple idea is the on-ramp to reasoning. Section 6 takes the on-ramp.