RL from verifiable rewards
Verifiers, graders, and RLVR
Paper: Tülu 3: Pushing Frontiers in Open Language Model Post-Training — Lambert et al., 2024
We have a reward problem we’ve been circling for the whole explainer. A learned reward model reward model (RM) A model trained from human preference data to output a scalar score for how good a response is. Stands in for a human judge so RL can query reward millions of times. See in glossary → is a model — and any model can be gamed. Push a policy hard enough against an RM and it finds the cracks: outputs that score high without being good. That’s reward hacking reward hacking When a policy finds ways to score high on the reward model without actually being better — exploiting quirks of an imperfect proxy. A central danger of RL post-training. See in glossary → , the failure we studied back in chapter 15, and it haunts every RLHF pipeline. This chapter is about the idea that, in the right domains, simply deletes the problem — and in doing so, reignited reinforcement learning for LLMs.
The big idea: don’t model the reward, compute it
A learned reward model is a proxy. It was trained to approximate human judgment, so it’s noisy, biased, and — crucially — differentiably exploitable: the policy can discover inputs where the proxy and the truth disagree, and exploit exactly those.
RL from verifiable rewards RLVR Reinforcement Learning from Verifiable Rewards — use an automatic checker (unit tests, an answer key, a math grader) as the reward instead of a learned reward model. No reward hacking of a neural proxy. See in glossary → (RLVR) makes a clean break. Instead of learning a reward, you compute it with an automatic checker — a verifier verifier An automatic, often rule-based checker that returns whether a response is correct (e.g. runs unit tests, compares to a known answer). Provides the reward in RLVR. See in glossary → — that returns a ground-truth-checkable signal:
- Math: does the model’s final answer equal the answer key? Run a symbolic or string match. Reward 1 if it matches, 0 if not.
- Code: does the generated program pass the unit tests? Run the test suite. Reward = fraction (or all/nothing) of tests passed.
- Formal logic / proofs: does a proof checker accept the derivation? Reward by the checker’s verdict.
The reward is typically binary (or close to it), automatic, and — the whole point — correct by construction. There’s no learned approximation between the policy and the truth. The verifier is the truth.
The term RLVR was named and popularized by Lambert et al. in Tülu 3 (2024), the fully-open post-training reference that slotted a verifiable-reward stage alongside SFT and DPO. The technique — RL against a correctness checker — is older and underlies STaR’s filter and o1’s training; Tülu 3 gave it a name, an open recipe, and a place in the standard pipeline.
Why this kills reward hacking
Here is the property that makes RLVR special, stated plainly: you cannot fool a unit test. A learned reward model has a surface the policy can probe for exploits. A verifier has no such surface. The code either passes the tests or it doesn’t; the answer either equals 42 or it doesn’t. There’s no adversarial input that makes a wrong answer register as right, because the check is grounded in ground truth, not in a fallible model of it.
This sidesteps the entire over-optimization reward over-optimization Pushing the policy so hard against a proxy reward that true quality starts to fall even as the proxy keeps rising — the quantitative face of reward hacking (Gao et al., 2022). See in glossary → story from chapter 15. Recall the Goodhart curve: optimize a learned proxy hard enough and true reward eventually falls even as the proxy keeps rising, because the policy is exploiting the gap between proxy and truth. With a verifier there is no gap to exploit — proxy and truth are the same object. You can optimize as hard as you like; the only way to score higher is to actually be more correct.
Try it
Below, submit a math answer or a snippet of code and watch a verifier return a hard, ground-truth reward — then compare it to a noisy learned reward model scoring the same output. Notice how the verifier is crisp and unfoolable while the learned RM wobbles and can be talked into a high score on a wrong answer.
The honest limitations
RLVR is powerful precisely because it’s narrow. Its strengths come with hard edges, and pretending otherwise is how you get burned.
- Only verifiable domains. It works where correctness is mechanically checkable: math, code, formal logic, certain structured tasks. The instant you leave that world — “write a moving poem,” “is this essay persuasive,” “is this medical advice wise and safe” — there’s no checker, and you’re back to learned reward models, preferences, and all their messiness. Verifiable ≠ everything we care about. Much of what we want from an assistant is exactly the un-verifiable part.
- Sparse reward. A verifier typically fires once, at the end: right or wrong. That’s the sparse, hard-credit-assignment regime from chapter 20 — the model gets one bit of feedback for a whole long chain and must figure out which steps to thank. It works because the bit is trustworthy, but it’s still a thin signal that demands clever RL to learn from efficiently.
- Checkable ≠ well-specified. A unit test can have bugs; an answer key can be wrong; “passes the tests” can be satisfied by hard-coding outputs or printing the expected string. The verifier is only as good as its specification. The reward is unhackable relative to its definition — and the definition can still be sloppy.
Tee-up: now we need an algorithm
RLVR gives us a reward we can trust — automatic, unhackable, infinitely scalable in verifiable domains. But a reward is only half a training method. We still need the algorithm that turns “this sample scored 1, that one scored 0” into an updated policy: how to estimate advantages, reduce variance, and keep the update stable, all without the expensive learned critic that PPO drags along.
That’s the final piece, and it’s the climax of the whole RL arc. The next chapter introduces GRPO — a critic-free algorithm built for exactly the binary, sparse, group-sampled rewards that RLVR produces — and then DeepSeek-R1, the open model that combined GRPO with verifiable rewards and, in doing so, reproduced o1 in the open and watched reasoning emerge from nothing but a base model and a correctness checker.