LLM Post-training, from the ground up

A long-form, interactive explainer

A pre-trained language model is a brilliant autocomplete and a terrible assistant. It will happily continue your prompt with more questions, drift off topic, or produce something confidently wrong. Post-training is the second half of the story — the set of techniques that take that raw next-token predictor and turn it into something helpful, honest, and (lately) able to reason.

We follow the field in the order it actually developed: supervised fine-tuning and instruction tuning, then reinforcement learning from human feedback (reward models, PPO, and the whole policy-gradient family), then the offline-preference wave that collapsed RLHF into a single loss (DPO and its cousins), and finally the reasoning era — RL from verifiable rewards, o1 and DeepSeek-R1, GRPO and its 2026 refinements, and agentic tool-use RL.

Only basic machine-learning knowledge is assumed — and a little familiarity with how a transformer is pre-trained helps. Every term gets defined the first time it shows up; hover any underlined word for a tooltip, or jump to the glossary at any time. There are interactive widgets throughout: a PPO clipping objective you can bend, a KL leash you can loosen, a reward you can hack until it collapses, and a GRPO group you can resample.

Scope note: this explainer is about post-training only. How the base model was built is covered in the sibling LLM Pre-training explainer; how the finished model runs on a GPU is covered in LLM & vLLM Inference.

Start reading → ~3–4 hours, 27 sections

Contents

Foundations & framing

  1. 01 What is post-training? — Turning a base model into an assistant
  2. 02 From next-token to behavior — Likelihood, KL divergence, and entropy
  3. 03 The alignment problem — Helpful, honest, harmless — and why imitation isn’t enough

Instruction tuning & supervised fine-tuning

  1. 04 Instruction tuning is born — FLAN, T0, and zero-shot generalization
  2. 05 The SFT stage in practice — Demonstrations, chat templates, and data quality
  3. 06 Synthetic & self-generated data — Self-Instruct, Alpaca, and distillation

RLHF & the preference era

  1. 07 Learning from human preferences — Christiano 2017 and the pairwise idea
  2. 08 RLHF scales to language — Summarization, InstructGPT, and the 3-step recipe
  3. 09 Reward models — Bradley–Terry and what an RM really learns
  4. 10 RLAIF & Constitutional AI — AI feedback and scalable oversight

RL fundamentals & PPO

  1. 11 Policy gradients & REINFORCE — Policy, rollout, return, and the score function
  2. 12 Value, advantage, baselines — Critics, GAE, and variance reduction
  3. 13 TRPO to PPO — Trust regions and the clipped surrogate
  4. 14 PPO for RLHF in practice — The KL-to-reference penalty and the loop

Offline & direct preference optimization

  1. 15 Reward hacking & over-optimization — Goodhart’s law and why more RL can hurt
  2. 16 Direct Preference Optimization — Collapsing RLHF into a single loss
  3. 17 The DPO zoo — IPO, KTO, ORPO, and SimPO
  4. 18 Rejection-sampling alignment — RAFT, RRHF, and best-of-N fine-tuning

RLVR & the reasoning era

  1. 19 Bootstrapping reasoning — STaR, self-consistency, and rejection-sampling FT
  2. 20 Process vs outcome rewards — Let’s Verify Step by Step, PRM vs ORM
  3. 21 Inference scaling & o1 — Test-time compute and the reasoning model
  4. 22 RL from verifiable rewards — Verifiers, graders, and RLVR
  5. 23 GRPO & DeepSeek-R1 — Group-relative advantage and critic-free RL

Modern algorithms, agentic RL & the frontier

  1. 24 GRPO refinements — DAPO, Dr.GRPO, VAPO, RLOO, REINFORCE++
  2. 25 Scaling open post-training — Tülu 3, Llama 3, Qwen, and Kimi
  3. 26 Agentic & tool-use RL — Multi-turn trajectories and the 2026 frontier
  1. 27 Recap — The pipeline reassembled, and further reading