Section 25

Scaling open post-training

Tülu 3, Llama 3, Qwen, and Kimi

Papers: Tülu 3: Pushing Frontiers in Open Language Model Post-Training — Lambert et al., 2024 · The Llama 3 Herd of Models — Grattafiori et al., 2024 · Qwen2.5 / Qwen3 — Qwen Team, 2024–2025 · Kimi k1.5 — Moonshot AI, 2025

For most of this explainer we’ve studied techniques in isolation — a reward model here, a DPO loss there, GRPO over there. But a frontier assistant isn’t one technique; it’s a pipeline of them, stacked in a specific order, fed by carefully curated data, and run for several rounds. This chapter is about how the open ecosystem went from publishing individual tricks to publishing entire, reproducible recipes — and how those recipes quietly converged on the same shape.

Tülu 3: the open recipe, fully spelled out

The clearest place to start is Tülu 3 Tülu 3 Allen AI’s fully open post-training recipe (2024) — SFT, then DPO, then RLVR — released with data, code, and evals. A reference manual for open post-training. See in glossary → (Allen AI, Lambert et al., 2024), because it’s the rare report that releases everything: the data, the code, the evaluation suite, and the exact pipeline. And that pipeline is the canonical modern stack in three stages:

  1. SFT supervised fine-tuning (SFT) Training a pre-trained model on curated (prompt, response) pairs with the ordinary next-token objective, so it imitates demonstrated assistant behavior. The first stage of post-training. See in glossary → on a large, deliberately mixed instruction dataset — carefully balanced across chat, math, code, safety, and precise-instruction-following.
  2. DPO DPO Direct Preference Optimization (Rafailov, 2023) — a closed-form supervised loss that optimizes the RLHF objective directly from preference pairs, with no separate reward model and no RL loop. See in glossary → on preference data, doing the cheap, stable offline preference optimization we covered earlier.
  3. RLVR RLVR Reinforcement Learning from Verifiable Rewards — use an automatic checker (unit tests, an answer key, a math grader) as the reward instead of a learned reward model. No reward hacking of a neural proxy. See in glossary → — Reinforcement Learning from Verifiable Rewards — as the final polish, where a verifier verifier An automatic, often rule-based checker that returns whether a response is correct (e.g. runs unit tests, compares to a known answer). Provides the reward in RLVR. See in glossary → (a math answer-checker, a unit-test harness) replaces the learned reward model entirely on tasks where correctness is checkable.

Tülu 3’s contribution wasn’t a new algorithm; it was naming and operationalizing RLVR as a distinct stage, plus showing that meticulous data curation and a strong evaluation harness matter as much as the loss function. It became the open post-training reference manual — the thing people read to learn the order and the data, not just the math.

Llama 3: offline preference at production scale

Llama 3’s post-training (Grattafiori et al., 2024) is the same SFT-then-preference idea, but pushed to industrial scale and run iteratively. Each round looks like: take the current best model, generate many candidate responses, score them, keep the best via rejection sampling rejection sampling Generate several candidate responses, keep only the best-scoring one(s) by some reward or verifier, and fine-tune on those. A simple, stable, RL-free way to improve a model. See in glossary → , fine-tune on those (SFT), then run DPO on the resulting preference pairs — and repeat for several rounds, with fresh data each time. Notably, Llama 3 leaned on DPO rather than PPO for its preference stage at scale, betting on the stability and simplicity of offline offline RL Optimizing from a fixed dataset of responses and preferences without generating new rollouts during training. DPO and rejection-sampling methods are offline. See in glossary → methods over the moving target of an online online RL RL that generates fresh rollouts from the current policy during training (e.g. PPO, GRPO). Expensive but adaptive, since the data tracks the improving policy. See in glossary → RL loop. It’s the production-grade proof that you can get very far with rejection sampling and DPO, no critic in sight.

Qwen2.5 and Qwen3: multi-stage SFT + RL, and two minds in one model

The Qwen series (Qwen Team, 2024–2025) shows the recipe with more RL ambition. Qwen2.5 runs large-scale SFT followed by multi-stage RL combining offline (DPO-style) and online (GRPO-style) optimization. Qwen3 adds the headline trick of the reasoning era: a single model with thinking and non-thinking modes. The same weights can either emit a long internal chain-of-thought chain-of-thought (CoT) Having a model write out intermediate reasoning steps before its final answer. Improves accuracy on multi-step problems and is the substrate reasoning RL optimizes. See in glossary → before answering or respond directly, with the user (or a budget) choosing which — letting one deployment trade latency for accuracy on demand. Getting both behaviors into one model without them interfering is a post-training data-and-RL balancing act, and Qwen3’s recipe is the leading open demonstration of it.

Kimi k1.5: long-CoT RL with a length leash

Kimi k1.5 (Moonshot AI, 2025) is the parallel-frontier reasoning recipe, developed alongside DeepSeek-R1 but with its own emphasis. Its signature is explicit length control during RL: long chains of thought are powerful but expensive and prone to rambling, so Kimi shapes the reward to reward getting there efficiently, not just getting there. It’s online policy optimization on verifiable reasoning tasks with verbosity held on a leash — the same length-reward length / format reward Auxiliary reward terms that shape output length or enforce a required format (e.g. putting reasoning in tags, the answer in a box) — used to keep reasoning-RL outputs usable. See in glossary → concern that DAPO’s overlong shaping addresses, here promoted to a first-class design goal.

The recipes side by side

RecipeSFTPreference stageVerifiable-reward RLDistinguishing move
Tülu 3mixed instruction dataDPORLVR (verifier rewards)fully open data + code + evals
Llama 3iterative, from rejection samplingDPO, multiple roundsoffline preference at scale
Qwen2.5 / 3multi-stageDPO + online RLGRPO-style RLthinking / non-thinking modes
Kimi k1.5cold-startonline policy optlong-CoT RLexplicit length control

Read the columns, not the rows: everyone does SFT, everyone does some preference optimization, and the frontier players add a verifiable-reward RL stage on top. The order is the same; the dials differ.

Distillation: the small-model shortcut

One more finding reshaped the open ecosystem, and it’s the surprising one. When DeepSeek-R1 distilled its long-CoT reasoning into smaller dense models (Qwen and Llama backbones) via plain supervised distillation knowledge distillation Training a smaller "student" model to match the full output probability distribution of a larger "teacher" model, rather than just the one-hot next token. Richer targets let the student learn more per token. See in glossary → — just SFT on R1’s reasoning traces — those small models often beat running GRPO directly on the same small model. The reasoning ability of a large RL-trained model transfers through its outputs more cheaply than it can be discovered from scratch by a small model’s own RL. For most teams, the practical lesson is stark: if a strong reasoning teacher exists, distill first and run expensive RL only where you must.

Where this is going

These recipes optimize the model’s response — one answer to one prompt, however long the chain of thought inside it. The next frontier breaks that frame: models that take many actions — searching, running code, calling tools — and get rewarded for the outcome of a whole multi-step interaction. That’s agentic and tool-use RL, and it’s where the field is heading next.