Section 20

Process vs outcome rewards

Let’s Verify Step by Step, PRM vs ORM

Paper: Let’s Verify Step by Step — Lightman et al., 2023

STaR left us with a blunt instrument: it judges a whole chain of reasoning by its final answer alone. But a wrong final answer doesn’t tell you where the reasoning broke, and a right one doesn’t tell you whether the path was sound or lucky. If we want to grade reasoning, we need to decide what we’re grading — the destination, or the journey. That choice defines the two great families of reward models for reasoning, and a 2023 OpenAI paper showed, surprisingly clearly, that the journey wins.

Two ways to grade a solution

Imagine a model writes a ten-step solution to a math problem. There are two natural ways to score it.

An outcome reward model (ORM) looks only at the final answer. Right answer, reward 1; wrong answer, reward 0 (or some learned score in between). It’s simple, cheap to label — you only need answer keys — and it maps directly onto STaR’s correctness filter. But it’s sparse: a single number at the very end of a long chain.

A process reward model (PRM) looks at every step. It scores each reasoning step as it goes — this step is valid, that step introduced an error, this one is fine again. Training a PRM requires process supervision : data where a human (or a strong model) has labeled the correctness of individual steps, not just the final answer. That’s far more expensive to collect — but, as we’ll see, far more informative.

The credit-assignment intuition

Why would step-level scoring matter so much? Because of credit assignment — the central difficulty of all reinforcement learning. When a long chain ends in a wrong answer, which step caused it? An ORM can’t say. It saw nine good steps and one bad one and reports a single “wrong” at the end. The error signal is smeared across the whole trajectory; the model has to guess which of its many decisions to blame.

A PRM pinpoints it. If step 6 divided by zero, the PRM flags step 6 with a low score and lets steps 1–5 keep their high scores. The blame lands exactly where the mistake happened. This is the difference between a teacher who writes “wrong” at the bottom of your exam and one who circles the precise line where your algebra slipped. The second teacher makes you a better mathematician far faster.

There’s a subtler benefit too. An ORM can be fooled by false positives — chains that reach the right answer through flawed reasoning (two errors that cancel, a lucky guess). To an ORM those look perfect and get rewarded, teaching the model bad habits. A PRM catches the flawed step even when the final answer happens to be right.

Let’s Verify Step by Step

Lightman et al. (2023), in a paper pointedly titled Let’s Verify Step by Step, ran the head-to-head experiment on the MATH benchmark. They trained an ORM and a PRM on the same base model and used each to rank a large pool of candidate solutions, keeping the top-scored one.

The PRM won, and not by a hair: 78.2% vs 72.4% on a representative MATH subset. Step-level supervision produced a reward model that selected correct solutions substantially more reliably than outcome supervision — a six-point gap that holds up across the difficulty range. The dense signal wasn’t just nicer in theory; it measurably picked better answers.

The paper’s other lasting contribution was the data. To train the PRM, OpenAI collected PRM800K — roughly 800,000 step-level human correctness labels over MATH solutions — and released it. That dataset became the reference corpus for process supervision and seeded a wave of follow-up work on reasoning reward models.

Try it

Below is a multi-step math solution with a wrong step buried in the middle. Toggle between ORM and PRM scoring and watch how each assigns credit. The ORM sees only the final box; the PRM lights up each step and pinpoints exactly where the chain went off the rails.

Process vs outcome reward

A solution with a wrong middle step. An outcome RM sees only the final answer; a process RM scores every step.

Problem

A train travels 240 km in 3 hours, then 150 km in 2 hours. What is its average speed for the whole trip?

Correct answer: 78 km/h

Process reward: 3/5 steps correct — first error at step 3. The dense signal credits the good early steps and pinpoints exactly where reasoning diverged.

An outcome reward model only sees the final answer — sparse credit: a single wrong arithmetic step at step 3 tanks the whole reward with no indication of why. A process reward model scores each step — dense credit: it rewards the correct setup, flags the exact step that went wrong, and is far more useful for teaching a model to reason. Click any step to flip its judged correctness and watch how ORM (final only) and PRM (per-step) assign credit differently.

The cost — and the escape hatch

Process supervision sounds strictly better, so why isn’t every reward model a PRM? Cost. Labeling per-step correctness is laborious and requires expertise — someone (or some strong model) has to read every step of every solution and judge it. PRM800K took enormous human effort. An ORM, by contrast, needs only an answer key, which often already exists. Dense supervision buys you better credit assignment, but you pay for it in annotation.

And here’s the escape hatch that reframes the whole debate. In verifiable domains — math with a known numeric answer, code with a test suite, formal proofs a checker can validate — you can get a correct outcome signal for free, no learned model and no human labeling at all. You don’t need a PRM to avoid being fooled by false positives if your outcome check is ground truth rather than a noisy learned approximation. The expense of process supervision is largely the expense of learning to judge; when judgment is mechanical, much of that cost evaporates.

That observation is the hinge of this entire section. The next chapters pursue it relentlessly: o1 shows what happens when you let a model reason at length and reward the outcome, RL from verifiable rewards formalizes the free-correctness-signal idea, and GRPO turns it into a training algorithm that powered DeepSeek-R1. We spent this chapter learning why dense step-level rewards are better — and we’re about to spend the rest of it discovering that in verifiable domains, a trustworthy sparse reward can be enough.