Section 06

Synthetic & self-generated data

Self-Instruct, Alpaca, and distillation

Papers: Self-Instruct: Aligning Language Models with Self-Generated Instructions (Wang et al., 2022) · Stanford Alpaca (Taori et al., 2023)

The last chapter ended on a quiet problem: SFT needs thousands of high-quality (prompt, response) demonstrations, and writing those by hand is slow and expensive. InstructGPT had OpenAI’s labeler workforce; almost nobody else did. The technique that broke this bottleneck — and launched the entire open instruction-tuned-model wave of 2023 — was to stop writing demonstrations by hand and start generating them with a language model.

Self-Instruct: bootstrapping data from the model itself

Self-Instruct (Wang et al., 2022) asked a provocative question: can a model generate its own instruction-tuning data? The pipeline is a bootstrap loop:

Seed. Start with a small pool of human-written seed tasks — the original paper used 175.
Generate. Prompt the model to write new instructions in the style of the seeds, then to generate input–output pairs for each new instruction.
Filter. Throw out instructions too similar to existing ones (to keep diversity), and drop malformed or low-quality generations.
Fine-tune. Add the survivors to the task pool and fine-tune the model on the accumulated set — then optionally repeat.

Run this loop and a model effectively teaches itself to follow instructions, manufacturing tens of thousands of self-generated examples from a few hundred human ones. The diversity filter in step 3 is what makes it work — without it, the model generates near-duplicates and the dataset collapses to a handful of patterns, the exact failure mode the previous chapter warned about.

Alpaca: distillation from a stronger teacher, for $600

Stanford Alpaca (Taori et al., 2023) took Self-Instruct and made one pragmatic change that turned a clever idea into a movement: instead of having the target model generate its own data, it used a stronger model as the teacher. The recipe was almost comically cheap:

Take a capable base model (LLaMA-7B).
Use Self-Instruct’s bootstrap, but have GPT-3.5 (text-davinci-003) generate the 52,000 instruction-following examples.
Fine-tune LLaMA-7B on those 52K examples.

Total cost: under ** $600** (roughly$ 500 of API calls and a few hours of GPU time). The resulting model behaved qualitatively like a much larger, much more expensive assistant. Alpaca’s release — weights, data, and training code — set off a Cambrian explosion of open instruction-tuned models built on the same template.

This is distillation by another name

Alpaca is knowledge distillation wearing instruction-tuning clothes. In classic distillation, a small “student” learns to match a large “teacher.” Here the teacher (GPT-3.5) never exposes its weights or probabilities — it only produces text — so the student learns from the teacher’s outputs: the demonstrations themselves are the synthetic data . You are pouring a strong model’s behavior into a cheaper model through the narrow straw of generated examples. It is astonishingly effective, and as we’ll see, astonishingly limited.

The benefits, and the catch

The upside is obvious and was the whole point: synthetic instruction data is cheap, fast, and scalable. You can generate a million examples overnight, target any domain you like, and never schedule a human labeler. For getting an open model to behave like an assistant, it works remarkably well.

But there are three structural problems, and they matter more the more you rely on this data.

You inherit the teacher’s flaws. Every bias, factual error, and stylistic tic in the teacher’s outputs gets baked into the student. Distillation copies the bad with the good.
You cannot exceed the teacher. A student trained purely to imitate a teacher’s outputs is chasing the teacher’s ceiling, never surpassing it. Pure imitation is a copy operation, not a creativity operation.
Legal and terms-of-service issues. Most commercial model APIs explicitly forbid using their outputs to train competing models. Alpaca-style distillation lives in a gray (often clearly prohibited) zone, which is why the open-data projects that followed worked hard to use permissively-licensed or human-sourced generations.

The imitation gap

There is a subtle trap that early enthusiasm for Alpaca-style models papered over. Imitating a stronger model’s outputs is not the same as matching its capabilities. A small model fine-tuned on a frontier model’s answers learns to mimic the style — the confident tone, the formatting, the helpful preamble — far faster than it learns the underlying competence. The result sounds like the teacher while being meaningfully worse at hard tasks, and shallow benchmarks (or a human skimming a few replies) often fail to notice. This is the imitation gap: cheap distillation buys you a convincing impression of capability, and the gap between impression and reality grows precisely on the inputs that matter most. The lesson is that you cannot distill your way past the teacher — to genuinely exceed your demonstrations, you need a signal beyond imitation.

A signal beyond imitation — twice

That last point is the thread that ties this whole section to the rest of the explainer. Imitation has a ceiling, and there are two ways past it.

The first is to stop imitating and start optimizing a preference or reward signal — which answer is better? — rather than copy this answer. That is RLHF and the preference era, where the next several chapters go.

The second is more surprising, and it’s a forward reference worth planting now: self-generated data makes a triumphant return in the reasoning era. There, instead of distilling from a stronger external teacher, a model generates many candidate solutions, keeps only the ones that are verifiably correct, and fine-tunes on those — rejection sampling and STaR-style bootstrapping (Chapter 19). The difference is decisive: a correctness filter lets a model learn from its own best outputs and genuinely improve, sidestepping the imitation gap entirely. Self-generated data isn’t a dead end — it just needed a way to tell good from bad.