The DPO zoo
IPO, KTO, ORPO, and SimPO
DPO was so clean that within a year the literature filled with descendants — each one a small surgical edit to the loss, each fixing one specific weakness we flagged at the end of the last chapter. Together they form what people half-jokingly call the DPO zoo. You don’t need to memorize the menagerie, but you should know the four that matter and exactly which problem each one solves.
IPO: stop trusting deterministic preferences
The first crack in DPO is subtle. Its loss pushes the implicit-reward gap between chosen and rejected ever wider — and when a preference pair is labeled with certainty (always , never the reverse), there’s nothing stopping that gap from running off to infinity. The model can drive toward zero and the KL KL penalty A term added to the RLHF reward that subtracts β times the KL divergence from the reference policy, keeping the optimized model from drifting too far while chasing reward. See in glossary → regularization, in this limit, fails to hold it back. The result is over-fitting to the exact preference labels you happened to collect.
IPO IPO Identity Preference Optimization — a DPO variant that replaces the logistic loss with a squared loss to avoid overfitting to deterministic preferences. See in glossary → (Identity Preference Optimization, Azar et al. 2023) fixes this by replacing the logistic loss with a squared loss that targets a finite margin rather than an ever-growing one. Instead of “make the gap as large as possible,” IPO says “make the implicit-reward gap equal to , and no larger.” That bounded target keeps the regularization meaningful even when preferences are deterministic, so IPO is harder to over-fit and degrades more gracefully on small or noisy preference sets.
KTO: drop the pairs entirely
DPO needs paired data: for every prompt, a chosen and a rejected answer, judged against each other. But a lot of real feedback isn’t paired — it’s a thumbs-up or thumbs-down on a single response, with no matched counterpart. Collecting clean pairs is expensive; collecting binary labels is cheap and abundant.
KTO KTO Kahneman–Tversky Optimization — a preference method using a prospect-theory loss on unpaired, binary good/bad labels, so you don’t need matched preference pairs. See in glossary → (Kahneman–Tversky Optimization, Ethayarajh et al. 2024) throws out the pairing requirement. It works on unpaired binary good/bad labels, and it borrows its loss shape from prospect theory — Kahneman and Tversky’s model of how humans weigh gains and losses asymmetrically (losses loom larger). Each example is scored relative to a reference point, with desirable and undesirable outputs handled by separate, asymmetric terms. The practical win is enormous: KTO lets you align on the messy, plentiful thumbs-up/thumbs-down signal that real products actually generate, rather than the curated comparison data DPO demands.
ORPO: fold SFT and preference into one stage
Both DPO and KTO still assume you’ve already run supervised fine-tuning supervised fine-tuning (SFT) Training a pre-trained model on curated (prompt, response) pairs with the ordinary next-token objective, so it imitates demonstrated assistant behavior. The first stage of post-training. See in glossary → , and both still need a frozen reference model reference model A frozen copy of the policy (usually the SFT model) that RLHF and DPO stay close to via a KL penalty, preventing the optimized policy from drifting into degenerate text. See in glossary → sitting in memory for every forward pass. That’s two training stages and two copies of the model.
ORPO ORPO Odds-Ratio Preference Optimization — folds SFT and preference optimization into a single reference-free stage using an odds-ratio penalty term. See in glossary → (Odds-Ratio Preference Optimization, Hong et al. 2024) collapses both. It adds an odds-ratio penalty term directly onto the ordinary SFT loss: alongside the standard next-token objective on the chosen answer, a term that increases the odds of the chosen response relative to the rejected one. Because the odds ratio is a self-contained contrast between chosen and rejected, ORPO needs no reference model at all — it is reference-free. The payoff is a single-stage, reference-free recipe: one pass over your data does instruction-following and preference alignment together, with half the memory of a DPO setup.
SimPO: kill the length bias, kill the reference model
The last lingering problem is one we met two chapters ago: length bias. DPO’s implicit reward is a sum of per-token log-probabilities, so longer sequences accumulate larger magnitudes — the loss has a built-in thumb on the scale for length, exactly the hack we want to avoid.
SimPO SimPO Simple Preference Optimization — a reference-free DPO variant using a length-normalized implicit reward plus a target margin, removing the need for a reference model. See in glossary → (Simple Preference Optimization, Meng et al. 2024) makes two changes. First, it length-normalizes the implicit reward — dividing by the number of tokens, so the reward is an average log-probability rather than a sum, neutralizing the length advantage. Second, like ORPO it drops the reference model entirely (reference-free), and adds an explicit target margin the chosen answer must clear. The result is a strikingly simple, memory-light loss that, on many benchmarks, matches or beats DPO and produces noticeably less length-inflated output.
The zoo at a glance
| Method | Reference-free? | Paired data? | Key idea |
|---|---|---|---|
| DPO | No | Yes | Implicit reward = ; logistic preference loss |
| IPO | No | Yes | Squared loss to a bounded margin; resists deterministic-preference over-fitting |
| KTO | No | No (binary) | Prospect-theory loss on unpaired good/bad labels |
| ORPO | Yes | Yes | Odds-ratio term folds SFT + preference into one stage |
| SimPO | Yes | Yes | Length-normalized, reference-free reward + target margin |
Try it
Pick a variant and watch how its loss responds as you vary the preference margin and the response length. Notice how IPO’s squared loss bottoms out at a finite target instead of pushing forever, how SimPO’s length-normalized curve refuses to reward sheer length, and which methods need that frozen reference curve at all.
Where this leaves us
The offline-preference family — DPO and its zoo — gives you alignment without a reward model and without an RL loop: cheap, stable, and reference-free in its most modern forms. What it still requires is preference data: someone, human or model, deciding which of two answers is better. The next chapter steps to an even simpler RL-free idea that needs only a score, not a comparison — rejection-sampling alignment — and it turns out to be the bridge straight into the reasoning era.