Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Reinforcement Learning from Verifiable Rewards (RLVR) is a post-training paradigm that uses automatically verifiable signals – such as correct answers, code execution results, or formal proofs – to train language models through reinforcement learning, replacing expensive human judgment with deterministic rule-based reward functions.
| Aspect | RLHF | RLVR |
|---|---|---|
| Reward source | Human annotators + learned reward model | Rule-based automated verification |
| Reward signal | Subjective quality judgment | Binary/graded correctness (0 or 1) |
| Scalability | Limited by annotation cost | Scales with compute |
| Reward hacking | Vulnerable (exploits learned reward model) | More robust (deterministic checks) |
| Domains | Open-ended generation | Tasks with verifiable outcomes |
| Cost | High (human labelers) | Low (automated verification) |
The key insight: for tasks where correctness can be automatically verified, RLVR eliminates both the expense of human annotation and the reward hacking vulnerabilities inherent in learned reward models.
RLVR applies to domains with clear verification mechanisms:
The reward function is typically binary:
$$r(x, y) = \mathbf{1}\!\left[\text{verify}(x, y) = \text{correct}\right]$$
where $x$ is the prompt and $y$ is the model's response. For code generation, this extends to graded rewards based on test pass rates: $r(x, y) = \frac{\text{tests passed}}{\text{total tests}}$.
The RLVR training process follows four steps:
# Simplified RLVR training loop def rlvr_training_step(policy, prompts, verifier): """One step of RLVR training with GRPO.""" for prompt in prompts: # 1. Sample a group of responses responses = [policy.generate(prompt) for _ in range(group_size)] # 2. Verify each response with rule-based reward rewards = [verifier.check(prompt, r) for r in responses] # 3. Compute group-relative advantages (GRPO) mean_r = sum(rewards) / len(rewards) std_r = std(rewards) advantages = [(r - mean_r) / (std_r + eps) for r in rewards] # 4. Update policy with clipped objective + KL penalty policy.update(responses, advantages, kl_weight=0.01)
DeepSeek-R1 is the landmark demonstration of RLVR at scale. It uses Group Relative Policy Optimization (GRPO), which replaces PPO's critic network with group-relative advantage estimation. For a prompt $x$, GRPO samples a group of $G$ responses and computes:
$$\hat{A}_i = \frac{r_i - \mu_G}{\sigma_G}, \quad \mu_G = \frac{1}{G}\sum_{j=1}^{G} r_j, \quad \sigma_G = \sqrt{\frac{1}{G}\sum_{j=1}^{G}(r_j - \mu_G)^2}$$
The policy update uses a clipped surrogate objective with KL penalty:
$$\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}\left[\min\!\left(\rho_i \hat{A}_i,\;\text{clip}(\rho_i, 1{-}\epsilon, 1{+}\epsilon)\hat{A}_i\right) - \beta\,D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]$$
where the importance ratio $\rho_i = \frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)}$.
GRPO advantages over PPO:
DeepSeek-R1's training pipeline: cold-start SFT on reasoning traces → RLVR with GRPO → rejection sampling → final SFT alignment. The result: open-weight models matching o1-level reasoning.
Code generation is an ideal RLVR domain because verification is unambiguous:
This extends naturally to agentic coding where models iteratively write, test, and debug code across multiple turns.
Despite RLVR's robustness, models can still exploit verification loopholes:
Mitigation requires continuous monitoring and diverse test suites.