Section 18

Rejection-sampling alignment

RAFT, RRHF, and best-of-N fine-tuning

Paper: RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment — Dong et al., 2023

Here is the simplest alignment idea in this entire explainer, and it might be the most underrated. Forget reward models in a loop, forget PPO , forget even the closed-form algebra of DPO. Just do this: ask the model to generate several answers, keep the best ones, and fine-tune on them. Repeat. That’s the whole method — and it’s hiding inside the post-training pipeline of nearly every frontier model.

Generate, filter, fine-tune

The recipe goes by a few names but the loop is always the same:

For each prompt, sample $N$ candidate answers from the current policy.
Score each candidate with some reward signal — a reward model , a human rating, or, as we’ll see, a verifier .
Keep the best (the top-scoring one, or all that clear a threshold) and fine-tune the model on those winners with a plain supervised cross-entropy loss.
Iterate: the improved model generates better candidates next round.

This is rejection sampling — a classical statistical idea (propose samples, accept the good ones, reject the rest) repurposed for alignment. Keeping the single best of $N$ candidates is exactly best-of-N sampling; the twist here is that instead of just returning the best answer at inference time, you train on it, baking the best-of-N behavior back into the weights.

RAFT (Reward-rAnked FineTuning, Dong et al. 2023) is the clean formulation of this loop: sample a batch, rank by reward, fine-tune on the top-ranked, repeat. RRHF (Yuan et al. 2023) is a close cousin from the same year — it scores a set of candidates and uses a ranking loss to make the model prefer the higher-reward ones, blending the rejection-sampling idea with a contrast between candidates.

Why it’s so appealing

Rejection-sampling alignment is offline in the same sense DPO is — each round trains on a frozen batch of pre-scored samples — but it’s even more bare-bones, and the simplicity buys real advantages:

Dead simple. It’s just sampling plus standard supervised fine-tuning. Any team that can run generation and an SFT job can do it; there’s no new optimizer, no special loss to debug.
Stable. There’s no importance sampling , no clipped surrogate , no critic , no on-policy sampling loop to diverge. You can’t blow up a training run the way PPO can; the worst case is that it simply doesn’t improve.
Trivially parallel and reusable. Generation is embarrassingly parallel, the filtered data is just text you can inspect, mix, dedup, and reuse, and you can throw a stronger reward model at it later without changing the training code.

The contrast with the methods around it is sharp. Against PPO, rejection sampling drops the entire RL apparatus — no clipping, no critic, no KL -penalized policy gradient — at the cost of being off-policy and only as good as your best-of-N samples. Against DPO, the difference is what signal it consumes: DPO needs pairwise preferences and learns from both the chosen and the rejected; rejection sampling needs only a score, and in its basic form learns from the positives only (or best-vs-rest), never explicitly pushing down the losers. Less information per example — but a much easier signal to obtain.

The bridge to reasoning

Now change one thing about the recipe and watch the whole reasoning era fall out of it. In everything above, the “reward” was a learned reward model — soft, fuzzy, and (per the reward-hacking chapter) exploitable. But suppose the task is math or code, where you can check the answer. Now the filter isn’t a reward model at all — it’s a verifier that returns a hard, unhackable correct/incorrect.

With a verifier as the filter, “sample N, keep the ones that got the right answer, fine-tune, repeat” becomes a method for bootstrapping reasoning. The model generates many chains of thought , you keep only the ones that reached the verified answer, and you fine-tune on those correct reasoning traces. The model learns to reason better by training on its own successful reasoning. This is exactly the idea behind STaR (Self-Taught Reasoner), the subject of the next chapter — rejection-sampling fine-tuning with a verifier in the loop.

And it doesn’t stop there. Keep that same verifier but, instead of fine-tuning on filtered samples once per round, optimize against it online with a policy-gradient method, and you’ve arrived at RL from verifiable rewards — the engine behind DeepSeek-R1 and the modern reasoning models, which we reach in chapter 22. Rejection sampling is the gateway: the same generate-and-filter skeleton, with the reward swapped for a verifier and the fine-tune swapped for an RL update.

That’s the thread for the rest of this explainer. We came into Section 5 trying to avoid an exploitable reward-model loop, and we leave it having found the simplest possible alternative — and discovered that, with the reward replaced by a verifier, that same simple idea is the on-ramp to reasoning. Section 6 takes the on-ramp.