Section 22

RL from verifiable rewards

Verifiers, graders, and RLVR

Paper: Tülu 3: Pushing Frontiers in Open Language Model Post-Training — Lambert et al., 2024

We have a reward problem we’ve been circling for the whole explainer. A learned reward model is a model — and any model can be gamed. Push a policy hard enough against an RM and it finds the cracks: outputs that score high without being good. That’s reward hacking , the failure we studied back in chapter 15, and it haunts every RLHF pipeline. This chapter is about the idea that, in the right domains, simply deletes the problem — and in doing so, reignited reinforcement learning for LLMs.

The big idea: don’t model the reward, compute it

A learned reward model is a proxy. It was trained to approximate human judgment, so it’s noisy, biased, and — crucially — differentiably exploitable: the policy can discover inputs where the proxy and the truth disagree, and exploit exactly those.

RL from verifiable rewards (RLVR) makes a clean break. Instead of learning a reward, you compute it with an automatic checker — a verifier — that returns a ground-truth-checkable signal:

Math: does the model’s final answer equal the answer key? Run a symbolic or string match. Reward 1 if it matches, 0 if not.
Code: does the generated program pass the unit tests? Run the test suite. Reward = fraction (or all/nothing) of tests passed.
Formal logic / proofs: does a proof checker accept the derivation? Reward by the checker’s verdict.

The reward is typically binary (or close to it), automatic, and — the whole point — correct by construction. There’s no learned approximation between the policy and the truth. The verifier is the truth.

The term RLVR was named and popularized by Lambert et al. in Tülu 3 (2024), the fully-open post-training reference that slotted a verifiable-reward stage alongside SFT and DPO. The technique — RL against a correctness checker — is older and underlies STaR’s filter and o1’s training; Tülu 3 gave it a name, an open recipe, and a place in the standard pipeline.

Why this kills reward hacking

Here is the property that makes RLVR special, stated plainly: you cannot fool a unit test. A learned reward model has a surface the policy can probe for exploits. A verifier has no such surface. The code either passes the tests or it doesn’t; the answer either equals 42 or it doesn’t. There’s no adversarial input that makes a wrong answer register as right, because the check is grounded in ground truth, not in a fallible model of it.

This sidesteps the entire over-optimization story from chapter 15. Recall the Goodhart curve: optimize a learned proxy hard enough and true reward eventually falls even as the proxy keeps rising, because the policy is exploiting the gap between proxy and truth. With a verifier there is no gap to exploit — proxy and truth are the same object. You can optimize as hard as you like; the only way to score higher is to actually be more correct.

Try it

Below, submit a math answer or a snippet of code and watch a verifier return a hard, ground-truth reward — then compare it to a noisy learned reward model scoring the same output. Notice how the verifier is crisp and unfoolable while the learned RM wobbles and can be talked into a high score on a wrong answer.

RLVR: verifier vs learned reward model

Type an answer. A verifier checks it against ground truth (binary). A learned RM guesses a fuzzy score that can be gamed.

Problem 1

What is 17 × 23?

Verifier (RLVR)

—

awaiting answer

Learned reward model

—

awaiting answer

RLVR (reinforcement learning from verifiable rewards) replaces a hackable learned reward model with an automatic verifier that checks the answer against ground truth and returns a clean binary reward. There is no neural proxy to game: the learned RM can be fooled by length and confident phrasing, but the verifier only cares whether the answer is actually right. This is why RLVR works best on domains — math, code, formal logic — where correctness is cheaply checkable.

The honest limitations

RLVR is powerful precisely because it’s narrow. Its strengths come with hard edges, and pretending otherwise is how you get burned.

Only verifiable domains. It works where correctness is mechanically checkable: math, code, formal logic, certain structured tasks. The instant you leave that world — “write a moving poem,” “is this essay persuasive,” “is this medical advice wise and safe” — there’s no checker, and you’re back to learned reward models, preferences, and all their messiness. Verifiable ≠ everything we care about. Much of what we want from an assistant is exactly the un-verifiable part.
Sparse reward. A verifier typically fires once, at the end: right or wrong. That’s the sparse, hard-credit-assignment regime from chapter 20 — the model gets one bit of feedback for a whole long chain and must figure out which steps to thank. It works because the bit is trustworthy, but it’s still a thin signal that demands clever RL to learn from efficiently.
Checkable ≠ well-specified. A unit test can have bugs; an answer key can be wrong; “passes the tests” can be satisfied by hard-coding outputs or printing the expected string. The verifier is only as good as its specification. The reward is unhackable relative to its definition — and the definition can still be sloppy.

Tee-up: now we need an algorithm

RLVR gives us a reward we can trust — automatic, unhackable, infinitely scalable in verifiable domains. But a reward is only half a training method. We still need the algorithm that turns “this sample scored 1, that one scored 0” into an updated policy: how to estimate advantages, reduce variance, and keep the update stable, all without the expensive learned critic that PPO drags along.

That’s the final piece, and it’s the climax of the whole RL arc. The next chapter introduces GRPO — a critic-free algorithm built for exactly the binary, sparse, group-sampled rewards that RLVR produces — and then DeepSeek-R1, the open model that combined GRPO with verifiable rewards and, in doing so, reproduced o1 in the open and watched reasoning emerge from nothing but a base model and a correctness checker.