Section 05

The SFT stage in practice

Demonstrations, chat templates, and data quality

Paper: Training language models to follow instructions with human feedback (InstructGPT) — Ouyang et al., 2022

The previous chapter established why instruction tuning works. This one is about the unglamorous machinery of actually doing it in a production pipeline — the part labeled “Step 1: SFT” in every modern post-training diagram. There is very little new math here. The interesting questions turn out to be about data and plumbing: what format you feed the model, which tokens you compute loss on, and how few examples you can get away with.

SFT is just next-token training, on curated data

When InstructGPT (Ouyang et al., 2022) laid out the three-step recipe — SFT → reward model → PPO — that became the template for ChatGPT and everything after, the first step was deliberately ordinary. Supervised fine-tuning takes the pre-trained base model and continues training it with the exact same objective it was pre-trained on: next-token prediction , maximizing the likelihood of the target text.

The only thing that changes is the diet. Instead of raw web text, the model sees curated (prompt, response) demonstrations — high-quality examples of a prompt followed by the ideal answer, often written or edited by human labelers. For a demonstration with prompt $x$ and response $y = (y_1, \ldots, y_T)$ , the loss is the familiar negative log-likelihood:

\mathcal{L}_{\text{SFT}} = -\sum_{t=1}^{T} \log p_\theta\big(y_t \mid x,\, y_{<t}\big)

That is it. No reward, no sampling, no RL. SFT is teaching by demonstration: here is a good answer; make answers like this more probable.

The chat template: turning a document model into a conversation

A base model only knows how to continue a document. To make it act like a turn-taking assistant, we need a fixed convention for marking who is speaking — a chat template . The template wraps each message in special tokens : reserved symbols, added to the vocabulary, that act as role and turn markers the model learns to recognize and emit.

A typical exchange, serialized, looks like this:

A tiny chat template

<|system|>
You are a helpful, concise assistant.
<|user|>
What is the capital of France?
<|assistant|>
Paris.<|end|>

That opening <|system|> block is the system prompt : a special, usually-hidden instruction that sets the assistant’s persona, rules, and constraints for the whole conversation. Because the model is trained with system prompts in front of many different behaviors, it learns to treat them as standing orders that override the user’s turn-by-turn requests.

Loss masking: only learn the assistant’s words

Here is the single most important practical detail, and the one most often gotten wrong. The serialized example above contains the system prompt, the user’s question, and the assistant’s reply. If you computed the language-modeling loss over all of those tokens, you would be teaching the model to generate user questions and system prompts too — which is not its job, and which dilutes the signal you actually care about.

The fix is loss masking (sometimes “prompt masking”): you compute the next-token loss only on the assistant’s response tokens, and mask out (zero the loss on) the system and user tokens. The prompt tokens are still fed in as context — the model conditions on them — but they contribute nothing to the gradient. Concretely, the masked objective trains only on the spans the assistant is responsible for producing:

\mathcal{L}_{\text{SFT}} = -\sum_{t \in \text{assistant}} \log p_\theta\big(y_t \mid \text{context}_{<t}\big)

Get this wrong and your model gets measurably worse, for a subtle reason: it spends capacity modeling text it will never need to write.

Data quality over quantity

The instinct from pre-training — more tokens is more better — does not carry over cleanly to SFT. The landmark demonstration was LIMA (“Less Is More for Alignment,” Zhou et al., 2023): the authors fine-tuned a strong base model on just 1,000 carefully curated prompt–response pairs and got a model competitive with ones tuned on hundreds of thousands of examples.

The interpretation they proposed has become a working assumption of the field: a base model has already learned almost everything it knows during pre-training; SFT mostly teaches it the format and style in which to expose that knowledge — how to be helpful, how to lay out an answer, when to stop. That is a thin layer, and a thin layer is best taught by a small set of pristine examples. A few thousand excellent demonstrations consistently beat a noisy ocean of mediocre ones, because every sloppy example teaches the model to be sloppy too.

Why SFT alone plateaus

SFT can only ever push the model toward imitating its demonstrations. That ceiling is real and it has a name in this explainer: imitation. Three consequences follow.

1. It cannot exceed its demonstrators. If every demonstration is human-written, the model is chasing average-human quality, not the best possible answer.

2. It gets no signal about what’s bad. SFT only ever shows good answers; it never tells the model “this response was worse than that one.” It learns what to do, never what to avoid.

3. It can amplify confident wrongness. Train on demonstrations that state facts crisply, and the model learns the style of confidence even on questions it should be unsure about — a contributor to sycophancy and hallucination.

This plateau is exactly the gap that learning from human preferences — comparisons, rewards, and RL — was invented to close.

With the supervised stage understood, the next question is where all those high-quality demonstrations come from when you don’t have an army of labelers. That is the story of synthetic and self-generated data — the next chapter.