LLM Post-training, from the ground up

A long-form, interactive explainer

A pre-trained language model is a brilliant autocomplete and a terrible assistant. It will happily continue your prompt with more questions, drift off topic, or produce something confidently wrong. Post-training is the second half of the story — the set of techniques that take that raw next-token predictor and turn it into something helpful, honest, and (lately) able to reason.

We follow the field in the order it actually developed: supervised fine-tuning and instruction tuning, then reinforcement learning from human feedback (reward models, PPO, and the whole policy-gradient family), then the offline-preference wave that collapsed RLHF into a single loss (DPO and its cousins), and finally the reasoning era — RL from verifiable rewards, o1 and DeepSeek-R1, GRPO and its 2026 refinements, and agentic tool-use RL.

Only basic machine-learning knowledge is assumed — and a little familiarity with how a transformer is pre-trained helps. Every term gets defined the first time it shows up; hover any underlined word for a tooltip, or jump to the glossary at any time. There are interactive widgets throughout: a PPO clipping objective you can bend, a KL leash you can loosen, a reward you can hack until it collapses, and a GRPO group you can resample.

Scope note: this explainer is about post-training only. How the base model was built is covered in the sibling LLM Pre-training explainer; how the finished model runs on a GPU is covered in LLM & vLLM Inference.

Start reading → ~3–4 hours, 27 sections

Published June 21, 2026

Foundations & framing

Instruction tuning & supervised fine-tuning

RLHF & the preference era

RL fundamentals & PPO

Offline & direct preference optimization

RLVR & the reasoning era

Modern algorithms, agentic RL & the frontier

27 Recap — The pipeline reassembled, and further reading