Section 05

The SFT stage in practice

Demonstrations, chat templates, and data quality

Paper: Training language models to follow instructions with human feedback (InstructGPT) — Ouyang et al., 2022

The previous chapter established why instruction tuning works. This one is about the unglamorous machinery of actually doing it in a production pipeline — the part labeled “Step 1: SFT” in every modern post-training diagram. There is very little new math here. The interesting questions turn out to be about data and plumbing: what format you feed the model, which tokens you compute loss on, and how few examples you can get away with.

SFT is just next-token training, on curated data

When InstructGPT (Ouyang et al., 2022) laid out the three-step recipe — SFT → reward model → PPO — that became the template for ChatGPT and everything after, the first step was deliberately ordinary. Supervised fine-tuning takes the pre-trained base model base model A model straight out of pre-training — a powerful text continuator that has not yet been taught to follow instructions, hold a conversation, or refuse harmful requests. See in glossary → and continues training it with the exact same objective it was pre-trained on: next-token prediction next-token prediction The pre-training objective for GPT-style models: given the tokens so far, predict a probability distribution over the next token. Also called causal or autoregressive language modeling. See in glossary → , maximizing the likelihood likelihood The probability a model assigns to observed data. Supervised fine-tuning maximizes the likelihood of human-written target responses given their prompts. See in glossary → of the target text.

The only thing that changes is the diet. Instead of raw web text, the model sees curated (prompt, response) demonstrations — high-quality examples of a prompt followed by the ideal answer, often written or edited by human labelers. For a demonstration with prompt xx and response y=(y1,,yT)y = (y_1, \ldots, y_T), the loss is the familiar negative log-likelihood:

LSFT=t=1Tlogpθ(ytx,y<t)\mathcal{L}_{\text{SFT}} = -\sum_{t=1}^{T} \log p_\theta\big(y_t \mid x,\, y_{<t}\big)

That is it. No reward, no sampling, no RL. SFT is teaching by demonstration: here is a good answer; make answers like this more probable.

The chat template: turning a document model into a conversation

A base model only knows how to continue a document. To make it act like a turn-taking assistant, we need a fixed convention for marking who is speaking — a chat template chat template The fixed formatting (with special tokens marking roles like system/user/assistant) that turns a multi-turn conversation into the single token stream a model is trained and served on. See in glossary → . The template wraps each message in special tokens special tokens Reserved tokens (e.g. role markers and end-of-turn markers) added to the vocabulary to delimit structure that ordinary text tokens cannot express. See in glossary → : reserved symbols, added to the vocabulary, that act as role and turn markers the model learns to recognize and emit.

A typical exchange, serialized, looks like this:

That opening <|system|> block is the system prompt system prompt A special leading instruction that sets the assistant’s persona, rules, and constraints for a conversation, separate from the user’s turns. See in glossary → : a special, usually-hidden instruction that sets the assistant’s persona, rules, and constraints for the whole conversation. Because the model is trained with system prompts in front of many different behaviors, it learns to treat them as standing orders that override the user’s turn-by-turn requests.

Loss masking: only learn the assistant’s words

Here is the single most important practical detail, and the one most often gotten wrong. The serialized example above contains the system prompt, the user’s question, and the assistant’s reply. If you computed the language-modeling loss over all of those tokens, you would be teaching the model to generate user questions and system prompts too — which is not its job, and which dilutes the signal you actually care about.

The fix is loss masking (sometimes “prompt masking”): you compute the next-token loss only on the assistant’s response tokens, and mask out (zero the loss on) the system and user tokens. The prompt tokens are still fed in as context — the model conditions on them — but they contribute nothing to the gradient. Concretely, the masked objective trains only on the spans the assistant is responsible for producing:

LSFT=tassistantlogpθ(ytcontext<t)\mathcal{L}_{\text{SFT}} = -\sum_{t \in \text{assistant}} \log p_\theta\big(y_t \mid \text{context}_{<t}\big)

Get this wrong and your model gets measurably worse, for a subtle reason: it spends capacity modeling text it will never need to write.

Data quality over quantity

The instinct from pre-training — more tokens is more better — does not carry over cleanly to SFT. The landmark demonstration was LIMA (“Less Is More for Alignment,” Zhou et al., 2023): the authors fine-tuned a strong base model on just 1,000 carefully curated prompt–response pairs and got a model competitive with ones tuned on hundreds of thousands of examples.

The interpretation they proposed has become a working assumption of the field: a base model has already learned almost everything it knows during pre-training; SFT mostly teaches it the format and style in which to expose that knowledge — how to be helpful, how to lay out an answer, when to stop. That is a thin layer, and a thin layer is best taught by a small set of pristine examples. A few thousand excellent demonstrations consistently beat a noisy ocean of mediocre ones, because every sloppy example teaches the model to be sloppy too.

With the supervised stage understood, the next question is where all those high-quality demonstrations come from when you don’t have an army of labelers. That is the story of synthetic and self-generated data — the next chapter.