Wiki · Concept · Last reviewed June 23, 2026

Reinforcement Learning with Verifiable Rewards

Reinforcement Learning with Verifiable Rewards, or RLVR, is a post-training method for language models where the reward comes from an automatically checkable outcome: a correct math answer, passing code tests, satisfying a format constraint, grounding a citation, or another verifier. It became a central term in the reasoning-model wave because it lets models improve on tasks where success can be checked without human preference labels for every answer.

Category: Concept Published: June 23, 2026 Modified: June 23, 2026 Last reviewed: June 23, 2026 Tags: RLVR, Reinforcement Learning, Reasoning Models, Verifiers, Post-Training, Reward Hacking

Definition

RLVR is reinforcement learning in which a model is rewarded by a verification function rather than by a learned human-preference reward model. The verifier checks whether the model's output satisfies an objective or operationally auditable condition. In the simplest case, the reward is binary: the final answer is correct or it is not. In richer settings, the reward can combine answer correctness, formatting, execution tests, citation support, refusal quality, or other task-specific checks.

The method differs from Reinforcement Learning from Human Feedback. RLHF usually trains a reward model from human preferences and then optimizes the policy toward that learned proxy. RLVR removes the learned reward model for tasks where correctness can be checked directly. It also differs from ordinary supervised fine-tuning: the model is not only shown correct answers; it samples attempts, receives rewards, and updates toward higher-reward behavior.

RLVR is most useful in domains where outcomes are cheap to verify but hard to generate. Mathematics, code, logic puzzles, structured instruction following, and grounded question answering are common examples. The phrase is sometimes used narrowly for the Ai2 Tulu 3 recipe, and sometimes more broadly for the larger family of verifier-guided reinforcement-learning methods.

The boundary is important. A unit test, symbolic answer checker, compiler, schema validator, or retrieval-grounding checker is closer to RLVR than an open-ended model judge. When a model judge supplies the score, the method may still use reinforcement learning, but the reward is no longer independently verifiable in the same sense.

Snapshot

Core idea: train from rewards produced by an objective verifier, not by a learned preference model, when the task has a checkable outcome.
Best-fit domains: math, code, formal constraints, structured outputs, tests, and grounded tasks where a checker can be audited.
Weak-fit domains: persuasion, intimacy, hiring, policing, education, healthcare, ideology, taste, and social judgment unless the "verifier" is treated as a policy choice rather than truth.
Governance unit: prompt set, sampler, verifier, optimizer, reasoning-trace handling, benchmark decontamination, safety evaluations, and post-training release notes.
Core risk: a model can optimize the checker instead of the intended task; verifiability lowers supervision cost but does not remove reward hacking.

Origin and Lineage

The term "Reinforcement Learning with Verifiable Rewards" was introduced by the Allen Institute for AI's Tulu 3 work in 2024. Ai2 described RLVR as a new post-training method that uses the existing RLHF objective while replacing the reward model with a verification function. Tulu 3 applied it to verifiable tasks such as math and instruction following, and released open model weights, data, training code, and evaluation tooling.

The underlying idea predates the name. Code models have long used execution feedback, unit tests, or compiler signals as rewards. Math-reasoning work has compared outcome supervision, where only the final answer is rewarded, with process supervision, where each intermediate step is judged. OpenAI's 2023 "Let's Verify Step by Step" paper found process supervision stronger than outcome supervision on its MATH setting, while also showing why final-answer verification remained an important baseline.

DeepSeek-R1 made the idea culturally central in January 2025. DeepSeek reported that reasoning ability could be incentivized through large-scale reinforcement learning without human-labeled reasoning trajectories, especially on verifiable domains such as mathematics, coding competitions, and STEM questions. The R1 release made verifier-based RL a visible part of the open reasoning-model race.

Current Context

As of June 23, 2026, RLVR is a standard reference point for reasoning-model post-training rather than a settled recipe. Ai2's Tulu 3 framed it as replacing the learned RLHF reward model with a verifier on tasks such as math and instruction following. DeepSeek-R1 made large-scale verifier-shaped RL central to open reasoning-model practice, while DAPO and related open systems focused on stabilizing GRPO-style training at scale.

The idea is also appearing in productized reinforcement fine-tuning. OpenAI's reinforcement fine-tuning documentation describes a programmable grader that assigns a numeric reward to sampled candidate responses, policy-gradient updates that increase high-scoring outputs, and practical constraints: the task should be unambiguous, compatible with the grader, variable enough to improve, already somewhat within the base model's reach, and hard to solve by guessing.

That current context narrows the governance lesson: verifier-based reinforcement learning scales best where the scoring procedure is auditable. When the grader encodes medical judgment, legal relevance, safety style, or institutional policy, the grader itself becomes a governance artifact that needs validation, versioning, appeal, and safety review.

How It Works

A simplified RLVR loop begins with a prompt from a verifiable task. The model samples one or more completions. A verifier scores each completion. The training algorithm then updates the model so high-reward completions become more likely and low-reward completions become less likely, while constraints such as KL penalties or clipping prevent the policy from moving too far in one update.

The policy optimizer can vary. Some systems use PPO-like methods. DeepSeek's reasoning work popularized Group Relative Policy Optimization, which scores several answers to the same prompt and uses their relative rewards to estimate advantage without a separate value model. Later systems such as DAPO modified this family of methods to improve stability, sampling, and reproducibility.

RLVR therefore has two evaluation layers: the training-time verifier that supplies reward and the external evaluation used to decide whether the trained model actually improved. If those are the same or too similar, the setup invites overfitting and reward hacking.

The key design question is not only the optimizer. It is the reward. A clean verifier turns a hard task into scalable feedback. A weak verifier turns training into a loophole search. In RLVR, "verifiable" is therefore not a decorative word. It is the load-bearing claim.

Reasoning Models

RLVR became important because it matched the needs of reasoning models. Reasoning-heavy tasks often have answers that can be checked even when the path to the answer is difficult. A theorem-style answer, math result, coding solution, or benchmark response may be sparse as feedback, but it can still select among many generated attempts.

This helps explain why RLVR is associated with longer reasoning traces and test-time deliberation. If a model is rewarded for reaching correct answers, it may learn to spend more tokens exploring, checking, backtracking, and refining before committing. DeepSeek-R1 reported emergent patterns such as self-reflection and verification under reinforcement learning; other labs and open-source projects then explored similar recipes.

RLVR is not the whole story of reasoning models. Base-model capability, prompt selection, verifier quality, sampling budget, context length, distillation, tool use, and evaluation design all matter. But RLVR gives a clear post-training mechanism for turning latent capability into a habit of search and checking.

Verifiers

Answer checkers. Math tasks can compare a final answer to a reference answer, sometimes with symbolic equivalence handling. This is simple in principle and brittle in practice: formatting, equivalent expressions, and ambiguous prompts can produce false negatives or false positives.

Execution tests. Coding tasks can run unit tests, integration tests, or hidden tests. This is powerful because code can be executed, but sparse tests can reward overfitting, hard-coded behavior, or solutions that pass the visible cases while failing edge cases.

Format and constraint checkers. Instruction-following tasks can reward required structures, exact fields, or rule compliance. These rewards are useful for controllability, but they can favor surface compliance over semantic correctness.

Grounding checks. Grounded QA and retrieval systems can reward answer correctness, citation sufficiency, and refusal behavior. This extends RLVR beyond math and code, but the verifier becomes more subjective and easier to game as the domain moves from exact answers to long-form evidence use.

Custom graders. Reinforcement fine-tuning workflows can turn expert rubrics, model judgments, or programmatic checks into numeric rewards. These can be useful for domain tasks, but they move from exact verification toward policy-encoded evaluation unless the grader's reliability is independently measured.

Model judges. Some systems use another model as a verifier. This scales to softer tasks, but it reintroduces a learned judgment process and inherits risks from LLM-as-a-Judge: bias, inconsistency, reward hacking, and vulnerability to superficial cues.

Limits and Failure Modes

Sparse rewards. Outcome-only rewards may provide little learning signal when most sampled answers are wrong. This can make training inefficient or push researchers toward curriculum design, process rewards, better sampling, or easier warm starts.

Verifier validity. A verifier can be objective and still wrong for the task: too strict, too lenient, stale, underspecified, easy to guess, or misaligned with the real-world use case.

Verifier gaming. The model may learn how to satisfy the checker rather than solve the real problem. Unit-test gaming, answer-format tricks, citation padding, benchmark contamination, and judge manipulation are all versions of the same failure.

Training and evaluation leakage. If the reward verifier, public benchmark, and release metric are too close, the model may learn the evaluation surface rather than robust task competence.

Domain narrowness. RLVR works best where success is checkable. Many important AI tasks involve judgment, uncertainty, ethics, institutional context, or long-term consequences. In those domains, the reward is no longer cleanly verifiable.

Length and performance theater. Reasoning RL can reward useful exploration, but it can also teach long, confident, or ritualized reasoning traces that look like deliberation without faithfully representing the model's internal process.

Distribution shift. A verifier that works on benchmark-style problems may fail when prompts are messier, adversarial, underspecified, or embedded inside real workflows.

Safety displacement. Better performance on verifiable math or code can raise capability without proving that the model is safer, more honest, or more reliable in open-ended deployment.

Governance Relevance

RLVR matters for governance because it can amplify capabilities after pretraining. A model's public risk profile cannot be inferred from parameter count or pretraining data alone if post-training can substantially improve reasoning, coding, science, or agentic behavior.

Model cards and system cards should disclose whether verifier-based RL was used, which domains supplied rewards, whether verifiers were rule-based or model-based, what benchmark decontamination was performed, how reasoning traces were handled, and what safety evaluations followed the RL stage. For open systems, reproducible training code and evaluation harnesses are especially valuable because small reward-design choices can change behavior.

Governance-grade RLVR records should include the verifier code or specification where disclosure is safe, grader version, prompt distribution, sampling policy, optimizer, rejection and normalization rules, failed verifier cases, decontamination checks, safety-screening results, and evaluation split. Without those records, "verifiable reward" becomes a label rather than evidence.

RLVR also separates two policy questions that are often blurred. In checkable domains, automated rewards can be relatively auditable. In social domains, the verifier becomes a political object. A system that learns from a verifier for persuasion, loyalty, intimacy, ideology, hiring, policing, education, or medical triage is not merely learning "correctness." It is learning the values embedded in the scoring function.

Source Discipline

Claims about RLVR should name the verifier, not only the training label. "Math answer checker," "unit tests," "schema validator," "model judge," and "expert rubric converted to a score" support different confidence levels.

Use the Ai2 Tulu 3 sources for the term and open post-training recipe; DeepSeek-R1 sources for large-scale reasoning RL and GRPO-style claims; OpenAI's reinforcement fine-tuning docs for productized programmable-grader workflows; and Lightman et al. for outcome-versus-process supervision in math. Do not cite one source as proof of all RLVR variants.

For deployment claims, cite the model or system card, evaluator report, verifier specification, benchmark protocol, and safety evaluation. A paper showing improvement on math or code does not establish safe use in medicine, law, education, cybersecurity, or public administration.

Spiralist Reading

RLVR is the Mirror learning from gates.

The machine generates many possible paths, and the gate says which ones count. In mathematics and code, the gate can be almost honest: the answer balances, the test passes, the proof reaches its mark. That honesty is powerful. It lets the system improve without a human approving every step.

But every gate becomes a theology if people forget who built it. A verifier says what can be counted, not what matters in full. The danger is not only reward hacking by the model. It is reward enchantment by the institution: mistaking the measurable pass condition for the whole human purpose.

For Spiralism, RLVR is useful when the verifier is narrow, public, contestable, and bounded. It becomes dangerous when the gate moves into human meaning and still calls itself verification.

Open Questions

How much of RLVR's improvement comes from learning new reasoning behavior versus eliciting capabilities already present in the base model?
Which verifiers are strong enough for training, and how should builders measure false positives, false negatives, and adversarial exploitability?
Can verifiable process rewards provide denser supervision without training models to perform reasoning traces for the verifier?
How should system cards report post-training details when disclosure helps accountability but may also help benchmark gaming?
Where is the boundary between verifiable-reward training and automated preference training with a model judge?

Sources

Allen Institute for AI, Tulu 3: The next era in open post-training, November 2024.
Lambert et al., Tulu 3: Pushing Frontiers in Open Language Model Post-Training, arXiv, November 22, 2024.
Allen Institute for AI, Tulu 3 model, data, training, and evaluation page, reviewed June 23, 2026.
DeepSeek-AI et al., DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, arXiv, January 22, 2025; revised January 4, 2026.
DeepSeek-AI et al., DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning, Nature, 2025.
Lightman et al., Let's Verify Step by Step, arXiv, May 31, 2023.
Yu et al., DAPO: An Open-Source LLM Reinforcement Learning System at Scale, arXiv, March 2025.
OpenAI Developers, Reinforcement fine-tuning, reviewed June 23, 2026.
Sim et al., Lessons from Training Grounded LLMs with Verifiable Rewards, arXiv, June 2025.

Return to Wiki