Process vs outcome rewards
Let’s Verify Step by Step, PRM vs ORM
Paper: Let’s Verify Step by Step — Lightman et al., 2023
STaR left us with a blunt instrument: it judges a whole chain of reasoning by its final answer alone. But a wrong final answer doesn’t tell you where the reasoning broke, and a right one doesn’t tell you whether the path was sound or lucky. If we want to grade reasoning, we need to decide what we’re grading — the destination, or the journey. That choice defines the two great families of reward models for reasoning, and a 2023 OpenAI paper showed, surprisingly clearly, that the journey wins.
Two ways to grade a solution
Imagine a model writes a ten-step solution to a math problem. There are two natural ways to score it.
An outcome reward model outcome reward model (ORM) A reward model that scores only the final answer of a solution, ignoring how it was reached. Simpler than a PRM but gives sparser credit. See in glossary → (ORM) looks only at the final answer. Right answer, reward 1; wrong answer, reward 0 (or some learned score in between). It’s simple, cheap to label — you only need answer keys — and it maps directly onto STaR’s correctness filter. But it’s sparse: a single number at the very end of a long chain.
A process reward model process reward model (PRM) A reward model that scores each step of a reasoning chain, not just the final answer — giving denser, better-targeted credit. Trained on per-step correctness labels. See in glossary → (PRM) looks at every step. It scores each reasoning step as it goes — this step is valid, that step introduced an error, this one is fine again. Training a PRM requires process supervision process supervision Training or rewarding a model on the correctness of intermediate reasoning steps rather than just outcomes — the idea behind PRMs and "Let’s Verify Step by Step." See in glossary → : data where a human (or a strong model) has labeled the correctness of individual steps, not just the final answer. That’s far more expensive to collect — but, as we’ll see, far more informative.
The credit-assignment intuition
Why would step-level scoring matter so much? Because of credit assignment — the central difficulty of all reinforcement learning. When a long chain ends in a wrong answer, which step caused it? An ORM can’t say. It saw nine good steps and one bad one and reports a single “wrong” at the end. The error signal is smeared across the whole trajectory; the model has to guess which of its many decisions to blame.
A PRM pinpoints it. If step 6 divided by zero, the PRM flags step 6 with a low score and lets steps 1–5 keep their high scores. The blame lands exactly where the mistake happened. This is the difference between a teacher who writes “wrong” at the bottom of your exam and one who circles the precise line where your algebra slipped. The second teacher makes you a better mathematician far faster.
There’s a subtler benefit too. An ORM can be fooled by false positives — chains that reach the right answer through flawed reasoning (two errors that cancel, a lucky guess). To an ORM those look perfect and get rewarded, teaching the model bad habits. A PRM catches the flawed step even when the final answer happens to be right.
Let’s Verify Step by Step
Lightman et al. (2023), in a paper pointedly titled Let’s Verify Step by Step, ran the head-to-head experiment on the MATH benchmark. They trained an ORM and a PRM on the same base model and used each to rank a large pool of candidate solutions, keeping the top-scored one.
The PRM won, and not by a hair: 78.2% vs 72.4% on a representative MATH subset. Step-level supervision produced a reward model that selected correct solutions substantially more reliably than outcome supervision — a six-point gap that holds up across the difficulty range. The dense signal wasn’t just nicer in theory; it measurably picked better answers.
The paper’s other lasting contribution was the data. To train the PRM, OpenAI collected PRM800K — roughly 800,000 step-level human correctness labels over MATH solutions — and released it. That dataset became the reference corpus for process supervision and seeded a wave of follow-up work on reasoning reward models.
Try it
Below is a multi-step math solution with a wrong step buried in the middle. Toggle between ORM and PRM scoring and watch how each assigns credit. The ORM sees only the final box; the PRM lights up each step and pinpoints exactly where the chain went off the rails.
The cost — and the escape hatch
Process supervision sounds strictly better, so why isn’t every reward model a PRM? Cost. Labeling per-step correctness is laborious and requires expertise — someone (or some strong model) has to read every step of every solution and judge it. PRM800K took enormous human effort. An ORM, by contrast, needs only an answer key, which often already exists. Dense supervision buys you better credit assignment, but you pay for it in annotation.
And here’s the escape hatch that reframes the whole debate. In verifiable domains — math with a known numeric answer, code with a test suite, formal proofs a checker can validate — you can get a correct outcome signal for free, no learned model and no human labeling at all. You don’t need a PRM to avoid being fooled by false positives if your outcome check is ground truth rather than a noisy learned approximation. The expense of process supervision is largely the expense of learning to judge; when judgment is mechanical, much of that cost evaporates.
That observation is the hinge of this entire section. The next chapters pursue it relentlessly: o1 shows what happens when you let a model reason at length and reward the outcome, RL from verifiable rewards RLVR Reinforcement Learning from Verifiable Rewards — use an automatic checker (unit tests, an answer key, a math grader) as the reward instead of a learned reward model. No reward hacking of a neural proxy. See in glossary → formalizes the free-correctness-signal idea, and GRPO turns it into a training algorithm that powered DeepSeek-R1. We spent this chapter learning why dense step-level rewards are better — and we’re about to spend the rest of it discovering that in verifiable domains, a trustworthy sparse reward can be enough.