Section 19

Bootstrapping reasoning

STaR, self-consistency, and rejection-sampling FT

Papers: Chain-of-Thought Prompting — Wei et al., 2022 · Self-Consistency — Wang et al., 2022 · STaR: Self-Taught Reasoner — Zelikman et al., 2022

Everything so far has been about alignment — turning a base model into a helpful, honest assistant. This section turns to a different and, in 2024–2025, electrifying question: can we make a model genuinely reason? Not just retrieve an answer, but work toward it, step by step, the way a person scratches out a derivation on paper? The reasoning revolution that produced o1 and DeepSeek-R1 didn’t appear from nowhere. It was built on three deceptively simple pre-RL ideas, all from 2022, that this chapter assembles. By the end you’ll see the gap they leave — and the gap is exactly the shape of reinforcement learning.

Chain-of-thought: write the steps down

The first idea is almost embarrassingly simple. Ask a model a hard multi-step question — a word problem, a logic puzzle — and if you force it to emit the final answer immediately, it often flubs it. Ask it to “think step by step” first, and accuracy jumps. That’s chain-of-thought (CoT) prompting, introduced by Wei et al. (2022): you prompt the model to produce a sequence of intermediate reasoning steps before the answer, either by showing a few worked examples or just by appending the magic phrase.

Why does this work at all? A transformer does a fixed amount of computation per token. A question that needs ten logical steps cannot be solved in the single forward pass that produces one answer token — there simply isn’t enough serial compute. By emitting reasoning tokens first, the model gives itself more forward passes to work with, and each intermediate conclusion is written back into the context for the next step to read. The chain of thought is, quite literally, scratch space: a place to externalize intermediate state that the architecture can’t hold internally. This is the seed of everything that follows.

Self-consistency: sample many, vote

Chain-of-thought is a single shot of reasoning, and a single shot can go wrong — one arithmetic slip dooms the whole chain. Wang et al. (2022) noticed that a hard problem usually has one correct answer but many paths to it, and many different wrong paths to wrong answers. So instead of generating one chain, generate a whole batch — sampling with temperature so they diverge — and then take the majority vote over the final answers.

This is self-consistency , and it works remarkably well. The correct answer tends to be a fixed point that multiple independent reasoning paths converge on, while errors scatter. If five of eight samples land on “18” and the other three disagree with each other, “18” is very likely right. Crucially, self-consistency buys accuracy by spending more test-time compute — you run the model $N$ times instead of once. It’s the first clear instance of a trade we’ll see again and again: pay more compute at inference, get more accuracy, no retraining required. (For why those extra generations cost real GPU time, see the LLM & vLLM Inference explainer.)

Try it

Below, sample several chains of thought for a word problem and watch the majority vote form. Notice how individual chains disagree, but the aggregate sharpens as you add samples — and how the gains taper off (the third correct vote matters less than the first).

Self-consistency: sample many chains, then vote

One question, many independent reasoning paths. Individual chains slip up — the majority vote does not.

Question

A baker bakes 3 trays of 8 muffins, then sells 6 of them. How many muffins are left?

Reasoning paths sampled N = 5

chain #1→ 18

3×8 = 24, then 24 − 6 = 18

chain #2→ 24

computed 3×8 = 24, forgot to subtract

chain #3→ 18

24 muffins baked, minus 6 sold = 18

chain #4→ 18

3 trays × 8 = 24; 24 − 6 = 18

chain #5→ 16

3×8 = 22 (slip), 22 − 6 = 16

Vote tally

3 votes

1 vote

Majority vote: 18 (3/5) — correct. Individual chains were right 3/5 of the time.

Self-consistency trades test-time compute for accuracy: sample many independent chains of thought and take the most common final answer. Each chain can wander or slip on arithmetic, but the errors are uncorrelated while the correct path is reinforced — so the mode of the vote lands on the right answer even when many single samples do not. Sweep N down to 1 and back up to watch the majority stabilize.

STaR: bootstrap your own reasoning

Chain-of-thought and self-consistency are inference-time tricks — they make a fixed model perform better right now. STaR (Zelikman et al., 2022) asks the next question: can a model learn to reason better by training on its own good chains?

The Self-Taught Reasoner loop is a bootstrap, and it should feel familiar from the previous chapter on rejection sampling :

Generate. For each problem in a training set (with known answers), prompt the model to produce a chain-of-thought rationale and an answer.
Filter. Keep only the rationales whose final answer is correct. Throw the rest away.
Fine-tune. Fine-tune the model on the surviving (problem → correct rationale → answer) examples.
Repeat. The improved model generates better rationales next round, solving problems it couldn’t before.

The model teaches itself, using nothing but a problem set with answer keys — no human ever writes a single rationale. STaR added one clever patch, rationalization: for problems the model can’t solve, give it the correct answer as a hint, let it generate a rationale that reaches that answer, and train on those too — so the hardest problems still contribute signal.

The thing to notice is what STaR uses to filter. It doesn’t ask a human, and it doesn’t ask a learned reward model . It just checks: did the answer match the answer key? Correctness itself is the filter.

Step back and the lineage is clear. Chain-of-thought says reason in tokens. Self-consistency says reason many times and aggregate. STaR says train on the reasoning that worked. Together they establish that reasoning is something a model can produce, that more reasoning compute buys accuracy, and that a model can improve its own reasoning from a correctness signal alone.

But STaR is still offline and crude. It keeps a rationale only if the final answer is right — a coarse, all-or-nothing filter. A chain with nine perfect steps and one fatal slip is discarded whole; a chain that stumbles into the right answer by a lucky cancellation of two errors is kept and trained on. STaR can’t tell a good chain from a lucky one, and it can’t give partial credit. It’s a sledgehammer where we’d like a scalpel.

That limitation splits into the two threads that drive the rest of this section. One thread asks: can we score the reasoning itself, step by step, instead of just the final answer? That’s process versus outcome rewards, the next chapter. The other asks: can we turn the correctness signal into a proper RL reward and optimize against it directly, rather than just filtering and fine-tuning? That thread runs through inference scaling and o1, RL from verifiable rewards, and culminates in GRPO and DeepSeek-R1. STaR set the table. Now we eat.