Bootstrapping reasoning
STaR, self-consistency, and rejection-sampling FT
Papers: Chain-of-Thought Prompting — Wei et al., 2022 · Self-Consistency — Wang et al., 2022 · STaR: Self-Taught Reasoner — Zelikman et al., 2022
Everything so far has been about alignment — turning a base model into a helpful, honest assistant. This section turns to a different and, in 2024–2025, electrifying question: can we make a model genuinely reason? Not just retrieve an answer, but work toward it, step by step, the way a person scratches out a derivation on paper? The reasoning revolution that produced o1 and DeepSeek-R1 didn’t appear from nowhere. It was built on three deceptively simple pre-RL ideas, all from 2022, that this chapter assembles. By the end you’ll see the gap they leave — and the gap is exactly the shape of reinforcement learning.
Chain-of-thought: write the steps down
The first idea is almost embarrassingly simple. Ask a model a hard multi-step question — a word problem, a logic puzzle — and if you force it to emit the final answer immediately, it often flubs it. Ask it to “think step by step” first, and accuracy jumps. That’s chain-of-thought chain-of-thought (CoT) Having a model write out intermediate reasoning steps before its final answer. Improves accuracy on multi-step problems and is the substrate reasoning RL optimizes. See in glossary → (CoT) prompting, introduced by Wei et al. (2022): you prompt the model to produce a sequence of intermediate reasoning steps before the answer, either by showing a few worked examples or just by appending the magic phrase.
Why does this work at all? A transformer does a fixed amount of computation per token. A question that needs ten logical steps cannot be solved in the single forward pass that produces one answer token — there simply isn’t enough serial compute. By emitting reasoning tokens first, the model gives itself more forward passes to work with, and each intermediate conclusion is written back into the context for the next step to read. The chain of thought is, quite literally, scratch space: a place to externalize intermediate state that the architecture can’t hold internally. This is the seed of everything that follows.
Self-consistency: sample many, vote
Chain-of-thought is a single shot of reasoning, and a single shot can go wrong — one arithmetic slip dooms the whole chain. Wang et al. (2022) noticed that a hard problem usually has one correct answer but many paths to it, and many different wrong paths to wrong answers. So instead of generating one chain, generate a whole batch — sampling with temperature so they diverge — and then take the majority vote over the final answers.
This is self-consistency self-consistency Sample many chain-of-thought solutions and take the majority-vote answer. A test-time technique that trades extra compute for accuracy. See in glossary → , and it works remarkably well. The correct answer tends to be a fixed point that multiple independent reasoning paths converge on, while errors scatter. If five of eight samples land on “18” and the other three disagree with each other, “18” is very likely right. Crucially, self-consistency buys accuracy by spending more test-time compute test-time compute Compute spent at inference — longer chains of thought, more samples — to improve answer quality, as opposed to compute spent during training. See in glossary → — you run the model times instead of once. It’s the first clear instance of a trade we’ll see again and again: pay more compute at inference, get more accuracy, no retraining required. (For why those extra generations cost real GPU time, see the LLM & vLLM Inference explainer.)
Try it
Below, sample several chains of thought for a word problem and watch the majority vote form. Notice how individual chains disagree, but the aggregate sharpens as you add samples — and how the gains taper off (the third correct vote matters less than the first).
STaR: bootstrap your own reasoning
Chain-of-thought and self-consistency are inference-time tricks — they make a fixed model perform better right now. STaR (Zelikman et al., 2022) asks the next question: can a model learn to reason better by training on its own good chains?
The Self-Taught Reasoner loop is a bootstrap, and it should feel familiar from the previous chapter on rejection sampling rejection sampling Generate several candidate responses, keep only the best-scoring one(s) by some reward or verifier, and fine-tune on those. A simple, stable, RL-free way to improve a model. See in glossary → :
- Generate. For each problem in a training set (with known answers), prompt the model to produce a chain-of-thought rationale and an answer.
- Filter. Keep only the rationales whose final answer is correct. Throw the rest away.
- Fine-tune. Fine-tune supervised fine-tuning (SFT) Training a pre-trained model on curated (prompt, response) pairs with the ordinary next-token objective, so it imitates demonstrated assistant behavior. The first stage of post-training. See in glossary → the model on the surviving (problem → correct rationale → answer) examples.
- Repeat. The improved model generates better rationales next round, solving problems it couldn’t before.
The model teaches itself, using nothing but a problem set with answer keys — no human ever writes a single rationale. STaR added one clever patch, rationalization: for problems the model can’t solve, give it the correct answer as a hint, let it generate a rationale that reaches that answer, and train on those too — so the hardest problems still contribute signal.
The thing to notice is what STaR uses to filter. It doesn’t ask a human, and it doesn’t ask a learned reward model reward model (RM) A model trained from human preference data to output a scalar score for how good a response is. Stands in for a human judge so RL can query reward millions of times. See in glossary → . It just checks: did the answer match the answer key? Correctness itself is the filter.
What these three share — and what they’re missing
Step back and the lineage is clear. Chain-of-thought says reason in tokens. Self-consistency says reason many times and aggregate. STaR says train on the reasoning that worked. Together they establish that reasoning is something a model can produce, that more reasoning compute buys accuracy, and that a model can improve its own reasoning from a correctness signal alone.
But STaR is still offline and crude. It keeps a rationale only if the final answer is right — a coarse, all-or-nothing filter. A chain with nine perfect steps and one fatal slip is discarded whole; a chain that stumbles into the right answer by a lucky cancellation of two errors is kept and trained on. STaR can’t tell a good chain from a lucky one, and it can’t give partial credit. It’s a sledgehammer where we’d like a scalpel.
That limitation splits into the two threads that drive the rest of this section. One thread asks: can we score the reasoning itself, step by step, instead of just the final answer? That’s process versus outcome rewards, the next chapter. The other asks: can we turn the correctness signal into a proper RL reward and optimize against it directly, rather than just filtering and fine-tuning? That thread runs through inference scaling and o1, RL from verifiable rewards, and culminates in GRPO and DeepSeek-R1. STaR set the table. Now we eat.