LLM Post-training, from the ground up
A long-form, interactive explainer
A pre-trained language model is a brilliant autocomplete and a terrible assistant. It will happily continue your prompt with more questions, drift off topic, or produce something confidently wrong. Post-training is the second half of the story — the set of techniques that take that raw next-token predictor and turn it into something helpful, honest, and (lately) able to reason.
We follow the field in the order it actually developed: supervised fine-tuning and instruction tuning, then reinforcement learning from human feedback (reward models, PPO, and the whole policy-gradient family), then the offline-preference wave that collapsed RLHF into a single loss (DPO and its cousins), and finally the reasoning era — RL from verifiable rewards, o1 and DeepSeek-R1, GRPO and its 2026 refinements, and agentic tool-use RL.
Only basic machine-learning knowledge is assumed — and a little familiarity with how a transformer is pre-trained helps. Every term gets defined the first time it shows up; hover any underlined word for a tooltip, or jump to the glossary at any time. There are interactive widgets throughout: a PPO clipping objective you can bend, a KL leash you can loosen, a reward you can hack until it collapses, and a GRPO group you can resample.
Scope note: this explainer is about post-training only. How the base model was built is covered in the sibling LLM Pre-training explainer; how the finished model runs on a GPU is covered in LLM & vLLM Inference.
Contents
Foundations & framing
Instruction tuning & supervised fine-tuning
RLHF & the preference era
RL fundamentals & PPO
Offline & direct preference optimization
RLVR & the reasoning era
- 19 Bootstrapping reasoning — STaR, self-consistency, and rejection-sampling FT
- 20 Process vs outcome rewards — Let’s Verify Step by Step, PRM vs ORM
- 21 Inference scaling & o1 — Test-time compute and the reasoning model
- 22 RL from verifiable rewards — Verifiers, graders, and RLVR
- 23 GRPO & DeepSeek-R1 — Group-relative advantage and critic-free RL