Section 06

Synthetic & self-generated data

Self-Instruct, Alpaca, and distillation

Papers: Self-Instruct: Aligning Language Models with Self-Generated Instructions (Wang et al., 2022) · Stanford Alpaca (Taori et al., 2023)

The last chapter ended on a quiet problem: SFT needs thousands of high-quality (prompt, response) demonstrations, and writing those by hand is slow and expensive. InstructGPT had OpenAI’s labeler workforce; almost nobody else did. The technique that broke this bottleneck — and launched the entire open instruction-tuned-model wave of 2023 — was to stop writing demonstrations by hand and start generating them with a language model.

Self-Instruct: bootstrapping data from the model itself

Self-Instruct (Wang et al., 2022) asked a provocative question: can a model generate its own instruction-tuning data? The pipeline is a bootstrap loop:

  1. Seed. Start with a small pool of human-written seed tasks — the original paper used 175.
  2. Generate. Prompt the model to write new instructions in the style of the seeds, then to generate input–output pairs for each new instruction.
  3. Filter. Throw out instructions too similar to existing ones (to keep diversity), and drop malformed or low-quality generations.
  4. Fine-tune. Add the survivors to the task pool and fine-tune the model on the accumulated set — then optionally repeat.

Run this loop and a model effectively teaches itself to follow instructions, manufacturing tens of thousands of self-generated Self-Instruct A method that bootstraps instruction-tuning data from a model itself: seed it with a few tasks, have it generate many more, filter, and fine-tune. Made instruction data cheap and synthetic. See in glossary → examples from a few hundred human ones. The diversity filter in step 3 is what makes it work — without it, the model generates near-duplicates and the dataset collapses to a handful of patterns, the exact failure mode the previous chapter warned about.

Alpaca: distillation from a stronger teacher, for $600

Stanford Alpaca (Taori et al., 2023) took Self-Instruct and made one pragmatic change that turned a clever idea into a movement: instead of having the target model generate its own data, it used a stronger model as the teacher. The recipe was almost comically cheap:

  • Take a capable base model (LLaMA-7B).
  • Use Self-Instruct’s bootstrap, but have GPT-3.5 (text-davinci-003) generate the 52,000 instruction-following examples.
  • Fine-tune LLaMA-7B on those 52K examples.

Total cost: under **600(roughly600** (roughly 500 of API calls and a few hours of GPU time). The resulting model behaved qualitatively like a much larger, much more expensive assistant. Alpaca’s release — weights, data, and training code — set off a Cambrian explosion of open instruction-tuned models built on the same template.

The benefits, and the catch

The upside is obvious and was the whole point: synthetic instruction data is cheap, fast, and scalable. You can generate a million examples overnight, target any domain you like, and never schedule a human labeler. For getting an open model to behave like an assistant, it works remarkably well.

But there are three structural problems, and they matter more the more you rely on this data.

  • You inherit the teacher’s flaws. Every bias, factual error, and stylistic tic in the teacher’s outputs gets baked into the student. Distillation copies the bad with the good.
  • You cannot exceed the teacher. A student trained purely to imitate a teacher’s outputs is chasing the teacher’s ceiling, never surpassing it. Pure imitation is a copy operation, not a creativity operation.
  • Legal and terms-of-service issues. Most commercial model APIs explicitly forbid using their outputs to train competing models. Alpaca-style distillation lives in a gray (often clearly prohibited) zone, which is why the open-data projects that followed worked hard to use permissively-licensed or human-sourced generations.

A signal beyond imitation — twice

That last point is the thread that ties this whole section to the rest of the explainer. Imitation has a ceiling, and there are two ways past it.

The first is to stop imitating and start optimizing a preference or reward signal — which answer is better? — rather than copy this answer. That is RLHF RLHF Reinforcement Learning from Human Feedback — train a reward model on human preference comparisons, then optimize the policy against that reward with RL (typically PPO), with a KL leash to a reference. See in glossary → and the preference era, where the next several chapters go.

The second is more surprising, and it’s a forward reference worth planting now: self-generated data makes a triumphant return in the reasoning era. There, instead of distilling from a stronger external teacher, a model generates many candidate solutions, keeps only the ones that are verifiably correct, and fine-tunes on those — rejection sampling rejection sampling Generate several candidate responses, keep only the best-scoring one(s) by some reward or verifier, and fine-tune on those. A simple, stable, RL-free way to improve a model. See in glossary → and STaR-style bootstrapping (Chapter 19). The difference is decisive: a correctness filter lets a model learn from its own best outputs and genuinely improve, sidestepping the imitation gap entirely. Self-generated data isn’t a dead end — it just needed a way to tell good from bad.