Section 17

The DPO zoo

IPO, KTO, ORPO, and SimPO

DPO was so clean that within a year the literature filled with descendants — each one a small surgical edit to the loss, each fixing one specific weakness we flagged at the end of the last chapter. Together they form what people half-jokingly call the DPO zoo. You don’t need to memorize the menagerie, but you should know the four that matter and exactly which problem each one solves.

IPO: stop trusting deterministic preferences

The first crack in DPO is subtle. Its loss pushes the implicit-reward gap between chosen and rejected ever wider — and when a preference pair is labeled with certainty (always $y_w \succ y_l$ , never the reverse), there’s nothing stopping that gap from running off to infinity. The model can drive $\pi_\theta(y_l)$ toward zero and the KL regularization, in this limit, fails to hold it back. The result is over-fitting to the exact preference labels you happened to collect.

IPO (Identity Preference Optimization, Azar et al. 2023) fixes this by replacing the logistic loss with a squared loss that targets a finite margin rather than an ever-growing one. Instead of “make the gap as large as possible,” IPO says “make the implicit-reward gap equal to $\tfrac{1}{2\beta}$ , and no larger.” That bounded target keeps the regularization meaningful even when preferences are deterministic, so IPO is harder to over-fit and degrades more gracefully on small or noisy preference sets.

KTO: drop the pairs entirely

DPO needs paired data: for every prompt, a chosen and a rejected answer, judged against each other. But a lot of real feedback isn’t paired — it’s a thumbs-up or thumbs-down on a single response, with no matched counterpart. Collecting clean pairs is expensive; collecting binary labels is cheap and abundant.

KTO (Kahneman–Tversky Optimization, Ethayarajh et al. 2024) throws out the pairing requirement. It works on unpaired binary good/bad labels, and it borrows its loss shape from prospect theory — Kahneman and Tversky’s model of how humans weigh gains and losses asymmetrically (losses loom larger). Each example is scored relative to a reference point, with desirable and undesirable outputs handled by separate, asymmetric terms. The practical win is enormous: KTO lets you align on the messy, plentiful thumbs-up/thumbs-down signal that real products actually generate, rather than the curated comparison data DPO demands.

ORPO: fold SFT and preference into one stage

Both DPO and KTO still assume you’ve already run supervised fine-tuning , and both still need a frozen reference model sitting in memory for every forward pass. That’s two training stages and two copies of the model.

ORPO (Odds-Ratio Preference Optimization, Hong et al. 2024) collapses both. It adds an odds-ratio penalty term directly onto the ordinary SFT loss: alongside the standard next-token objective on the chosen answer, a term that increases the odds of the chosen response relative to the rejected one. Because the odds ratio is a self-contained contrast between chosen and rejected, ORPO needs no reference model at all — it is reference-free. The payoff is a single-stage, reference-free recipe: one pass over your data does instruction-following and preference alignment together, with half the memory of a DPO setup.

SimPO: kill the length bias, kill the reference model

The last lingering problem is one we met two chapters ago: length bias. DPO’s implicit reward is a sum of per-token log-probabilities, so longer sequences accumulate larger magnitudes — the loss has a built-in thumb on the scale for length, exactly the hack we want to avoid.

SimPO (Simple Preference Optimization, Meng et al. 2024) makes two changes. First, it length-normalizes the implicit reward — dividing by the number of tokens, so the reward is an average log-probability rather than a sum, neutralizing the length advantage. Second, like ORPO it drops the reference model entirely (reference-free), and adds an explicit target margin $\gamma$ the chosen answer must clear. The result is a strikingly simple, memory-light loss that, on many benchmarks, matches or beats DPO and produces noticeably less length-inflated output.

The zoo at a glance

Method	Reference-free?	Paired data?	Key idea
DPO	No	Yes	Implicit reward = $\beta\log\frac{\pi_\theta}{\pi_{\text{ref}}}$ ; logistic preference loss
IPO	No	Yes	Squared loss to a bounded margin; resists deterministic-preference over-fitting
KTO	No	No (binary)	Prospect-theory loss on unpaired good/bad labels
ORPO	Yes	Yes	Odds-ratio term folds SFT + preference into one stage
SimPO	Yes	Yes	Length-normalized, reference-free reward + target margin $\gamma$

Try it

Pick a variant and watch how its loss responds as you vary the preference margin and the response length. Notice how IPO’s squared loss bottoms out at a finite target instead of pushing forever, how SimPO’s length-normalized curve refuses to reward sheer length, and which methods need that frozen reference curve at all.

The DPO variant zoo

Each variant tweaks one ingredient of direct preference optimization. Pick one to see its loss shape and trade-offs.

Logistic loss on the implicit reward margin; the original direct-preference objective.

reference-free: noneeds paired data: yes

loss vs reward marginqualitative · logistic: −log σ(margin)

y-axis auto-scaled · peak shown ≈ 4.22

The DPO "zoo": each variant changes one ingredient — the loss shape, the reference model, or paired-vs-unpaired data — to fix a specific weakness. IPO swaps the logistic for a squared loss to a target margin; KTO drops the need for pairs; ORPO and SimPO remove the reference model. The curves above are qualitative illustrations of each loss's character, not the exact published formulas.

The two themes the whole zoo is chasing

Step back and almost every variant is pulling on one of two levers. Drop the reference model (ORPO, SimPO) — it halves memory and removes a moving part, and it turns out you often don’t need it. Fix the length bias (SimPO’s normalization, and length-controlled evaluation everywhere) — because the log-probability sum quietly rewards verbosity, the exact reward hack from chapter 15. IPO and KTO add a third: be robust to the shape and supply of your data — deterministic labels for IPO, unpaired labels for KTO. The zoo isn’t five rival religions; it’s five edits to one loss, each targeting a named flaw.

Where this leaves us

The offline-preference family — DPO and its zoo — gives you alignment without a reward model and without an RL loop: cheap, stable, and reference-free in its most modern forms. What it still requires is preference data: someone, human or model, deciding which of two answers is better. The next chapter steps to an even simpler RL-free idea that needs only a score, not a comparison — rejection-sampling alignment — and it turns out to be the bridge straight into the reasoning era.