Direct Preference Optimization
Collapsing RLHF into a single loss
Paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafailov et al., 2023
By 2023, the RLHF recipe was the standard but also the headache: train a reward model reward model (RM) A model trained from human preference data to output a scalar score for how good a response is. Stands in for a human judge so RL can query reward millions of times. See in glossary → , then run a delicate PPO PPO Proximal Policy Optimization (Schulman, 2017) — approximates TRPO’s trust region with a simple clipped surrogate objective. The RLHF workhorse: stable, simple, widely used. See in glossary → loop against it, juggling a policy, a reference reference model A frozen copy of the policy (usually the SFT model) that RLHF and DPO stay close to via a KL penalty, preventing the optimized policy from drifting into degenerate text. See in glossary → , a reward model, and a critic critic A model trained to predict the value function. PPO uses an actor (the policy) and a critic; GRPO drops the critic and uses a group average instead. See in glossary → all at once — four networks, a sampling loop, and every instability we’ve spent the last chapters cataloguing. Then a paper arrived with a startling claim: you can throw all of that away. No reward model. No RL loop. Just one supervised loss on your preference pairs. This is Direct Preference Optimization DPO Direct Preference Optimization (Rafailov, 2023) — a closed-form supervised loss that optimizes the RLHF objective directly from preference pairs, with no separate reward model and no RL loop. See in glossary → , and it reorganized the field almost overnight.
The trick in one sentence
DPO’s insight is that the reward model and the policy are not really two separate things. The RLHF objective already implies a relationship between them — and once you write that relationship down, you can express the reward entirely in terms of the policy, substitute it into the preference loss, and the reward model simply vanishes. The policy, it turns out, is secretly a reward model. Let’s derive it, because the derivation is short and the payoff is the whole chapter.
Step 1: the optimal RLHF policy has a closed form
Recall the RLHF objective from the PPO chapter: maximize reward while staying close to the reference, with a KL penalty KL penalty A term added to the RLHF reward that subtracts β times the KL divergence from the reference policy, keeping the optimized model from drifting too far while chasing reward. See in glossary → of strength holding you near :
This particular objective has a known closed-form solution. The policy that maximizes it is the reference distribution, reweighted by the exponentiated reward:
where is a normalizing constant (the partition function) that makes it a valid distribution. Intuitively: start from the reference, then up-weight high-reward answers and down-weight low-reward ones, with controlling how aggressively. This is just the familiar fact that the KL-regularized reward objective is solved by a softmax softmax Function that turns any vector into a probability distribution (positive, sums to 1) by exponentiating and normalizing. See in glossary → -like Boltzmann distribution.
Step 2: invert it to read off the reward
Here’s the move. That equation relates the optimal policy to the reward. So solve it for the reward. Take logs and rearrange:
Read that carefully — it is the heart of DPO. The reward of any answer is just times the log-ratio between the optimal policy and the reference, plus a term that depends only on the prompt , not on the answer . The policy encodes the reward. If you know the optimal policy and the reference, you already know the reward up to a per-prompt constant. We call this quantity the implicit reward implicit reward In DPO, the reward is never trained explicitly; it is implied by the log-ratio between the policy and the reference. Optimizing the DPO loss is equivalent to RLHF under that implied reward. See in glossary → : .
Step 3: substitute into the preference likelihood
Now bring in the Bradley–Terry Bradley–Terry model A statistical model that turns pairwise preferences into latent scalar scores: the probability A beats B is the logistic of the score difference, σ(s_A − s_B). The core of most reward models. See in glossary → model from the reward-models chapter. It says the probability that a “winning” answer beats a “losing” one is a logistic function of their reward difference:
Substitute our expression for from Step 2. The magic: the troublesome term depends only on , so it is identical for and and cancels in the difference. The intractable partition function — the thing that normally forces you into RL — disappears completely. What’s left is written purely in terms of the policy and the reference:
Maximize the likelihood of the observed preferences — equivalently, minimize its negative log — and you have the DPO loss:
That’s it. A plain supervised loss over preference triples — no reward model reward model (RM) A model trained from human preference data to output a scalar score for how good a response is. Stands in for a human judge so RL can query reward millions of times. See in glossary → , no sampling, no RL loop, no critic. You compute four log-probabilities (the chosen and rejected answers under both and the frozen ), form the difference, push it through , and backpropagate.
Try it
Toggle between the two-stage RLHF pipeline (reward model + PPO loop) and the single DPO loss on the same preference pair. Watch the implicit reward move the chosen and rejected log-probabilities apart, and see that it lands in the same place the RM-plus-PPO route was trying to reach — just without ever building the reward model.
What DPO trades away
DPO is simpler, cheaper, and far more stable — which is why it became the default for open post-training and shows up in Llama 3, Tülu, Zephyr, and countless others. But the simplification is not free, and the trade is worth naming precisely.
DPO is offline offline RL Optimizing from a fixed dataset of responses and preferences without generating new rollouts during training. DPO and rejection-sampling methods are offline. See in glossary → . It learns only from the fixed set of preference pairs you collected up front; it never generates new samples and never gets fresh feedback on them. PPO, by contrast, is on-policy on-policy RL where the data used to update the policy was generated by the current policy. PPO and GRPO are (approximately) on-policy; they resample as the policy changes. See in glossary → — at every step it samples from the current policy and scores those samples, so it can discover and reinforce good behaviors that weren’t in any human-written dataset. DPO cannot explore beyond its data. If the best answer to a prompt was never in the preference set, DPO has no way to find it.
This makes DPO sensitive to distribution shift. As training pushes away from the distribution that generated the preference pairs, those pairs describe a region the policy has already left, and the implicit-reward signal gets stale. PPO re-samples and stays current; DPO is stuck with the snapshot it was handed. In practice this shows up as DPO over-fitting to quirks of the dataset, and it’s a recurring motivation for the variants in the next chapter — and for iterating DPO on freshly generated, freshly labeled pairs.
The next chapter tours the DPO zoo — IPO, KTO, ORPO, SimPO — each one targeting a specific weakness we just identified: the deterministic-preference overfitting, the need for paired data, the separate SFT stage, and the lingering length bias from the previous chapter.