Section 25

Scaling open post-training

Tülu 3, Llama 3, Qwen, and Kimi

Papers: Tülu 3: Pushing Frontiers in Open Language Model Post-Training — Lambert et al., 2024 · The Llama 3 Herd of Models — Grattafiori et al., 2024 · Qwen2.5 / Qwen3 — Qwen Team, 2024–2025 · Kimi k1.5 — Moonshot AI, 2025

For most of this explainer we’ve studied techniques in isolation — a reward model here, a DPO loss there, GRPO over there. But a frontier assistant isn’t one technique; it’s a pipeline of them, stacked in a specific order, fed by carefully curated data, and run for several rounds. This chapter is about how the open ecosystem went from publishing individual tricks to publishing entire, reproducible recipes — and how those recipes quietly converged on the same shape.

Tülu 3: the open recipe, fully spelled out

The clearest place to start is Tülu 3 (Allen AI, Lambert et al., 2024), because it’s the rare report that releases everything: the data, the code, the evaluation suite, and the exact pipeline. And that pipeline is the canonical modern stack in three stages:

SFT on a large, deliberately mixed instruction dataset — carefully balanced across chat, math, code, safety, and precise-instruction-following.
DPO on preference data, doing the cheap, stable offline preference optimization we covered earlier.
RLVR — Reinforcement Learning from Verifiable Rewards — as the final polish, where a verifier (a math answer-checker, a unit-test harness) replaces the learned reward model entirely on tasks where correctness is checkable.

Tülu 3’s contribution wasn’t a new algorithm; it was naming and operationalizing RLVR as a distinct stage, plus showing that meticulous data curation and a strong evaluation harness matter as much as the loss function. It became the open post-training reference manual — the thing people read to learn the order and the data, not just the math.

Llama 3: offline preference at production scale

Llama 3’s post-training (Grattafiori et al., 2024) is the same SFT-then-preference idea, but pushed to industrial scale and run iteratively. Each round looks like: take the current best model, generate many candidate responses, score them, keep the best via rejection sampling , fine-tune on those (SFT), then run DPO on the resulting preference pairs — and repeat for several rounds, with fresh data each time. Notably, Llama 3 leaned on DPO rather than PPO for its preference stage at scale, betting on the stability and simplicity of offline methods over the moving target of an online RL loop. It’s the production-grade proof that you can get very far with rejection sampling and DPO, no critic in sight.

Qwen2.5 and Qwen3: multi-stage SFT + RL, and two minds in one model

The Qwen series (Qwen Team, 2024–2025) shows the recipe with more RL ambition. Qwen2.5 runs large-scale SFT followed by multi-stage RL combining offline (DPO-style) and online (GRPO-style) optimization. Qwen3 adds the headline trick of the reasoning era: a single model with thinking and non-thinking modes. The same weights can either emit a long internal chain-of-thought before answering or respond directly, with the user (or a budget) choosing which — letting one deployment trade latency for accuracy on demand. Getting both behaviors into one model without them interfering is a post-training data-and-RL balancing act, and Qwen3’s recipe is the leading open demonstration of it.

Kimi k1.5: long-CoT RL with a length leash

Kimi k1.5 (Moonshot AI, 2025) is the parallel-frontier reasoning recipe, developed alongside DeepSeek-R1 but with its own emphasis. Its signature is explicit length control during RL: long chains of thought are powerful but expensive and prone to rambling, so Kimi shapes the reward to reward getting there efficiently, not just getting there. It’s online policy optimization on verifiable reasoning tasks with verbosity held on a leash — the same length-reward concern that DAPO’s overlong shaping addresses, here promoted to a first-class design goal.

The recipes side by side

Recipe	SFT	Preference stage	Verifiable-reward RL	Distinguishing move
Tülu 3	mixed instruction data	DPO	RLVR (verifier rewards)	fully open data + code + evals
Llama 3	iterative, from rejection sampling	DPO, multiple rounds	—	offline preference at scale
Qwen2.5 / 3	multi-stage	DPO + online RL	GRPO-style RL	thinking / non-thinking modes
Kimi k1.5	cold-start	online policy opt	long-CoT RL	explicit length control

Read the columns, not the rows: everyone does SFT, everyone does some preference optimization, and the frontier players add a verifiable-reward RL stage on top. The order is the same; the dials differ.

Distillation: the small-model shortcut

One more finding reshaped the open ecosystem, and it’s the surprising one. When DeepSeek-R1 distilled its long-CoT reasoning into smaller dense models (Qwen and Llama backbones) via plain supervised distillation — just SFT on R1’s reasoning traces — those small models often beat running GRPO directly on the same small model. The reasoning ability of a large RL-trained model transfers through its outputs more cheaply than it can be discovered from scratch by a small model’s own RL. For most teams, the practical lesson is stark: if a strong reasoning teacher exists, distill first and run expensive RL only where you must.

Where this is going

These recipes optimize the model’s response — one answer to one prompt, however long the chain of thought inside it. The next frontier breaks that frame: models that take many actions — searching, running code, calling tools — and get rewarded for the outcome of a whole multi-step interaction. That’s agentic and tool-use RL, and it’s where the field is heading next.