Agent RLVR

Reinforcement Learning from Verifiable Rewards (RLVR) is a post-training paradigm that uses automatically verifiable signals – such as correct answers, code execution results, or formal proofs – to train language models through reinforcement learning, replacing expensive human judgment with deterministic rule-based reward functions.

How RLVR Differs from RLHF

Aspect	RLHF	RLVR
Reward source	Human annotators + learned reward model	Rule-based automated verification
Reward signal	Subjective quality judgment	Binary/graded correctness (0 or 1)
Scalability	Limited by annotation cost	Scales with compute
Reward hacking	Vulnerable (exploits learned reward model)	More robust (deterministic checks)
Domains	Open-ended generation	Tasks with verifiable outcomes
Cost	High (human labelers)	Low (automated verification)

The key insight: for tasks where correctness can be automatically verified, RLVR eliminates both the expense of human annotation and the reward hacking vulnerabilities inherent in learned reward models.

Verifiable Environments

RLVR applies to domains with clear verification mechanisms:

Mathematical reasoning: String-matching or symbolic comparison of final numerical answers against ground truth (e.g., GSM8K problems)
Code generation: Execute generated code against test suites; reward based on test-case pass rates
Instruction following: Verify adherence to formatting constraints, keyword requirements, language rules via deterministic string matching
Formal verification: Proof checkers for mathematical theorems
Game environments: Win/loss outcomes in deterministic games

The reward function is typically binary:

$$r(x, y) = \mathbf{1}\!\left[\text{verify}(x, y) = \text{correct}\right]$$

where $x$ is the prompt and $y$ is the model's response. For code generation, this extends to graded rewards based on test pass rates: $r(x, y) = \frac{\text{tests passed}}{\text{total tests}}$.

Training Pipeline

The RLVR training process follows four steps:

# Simplified RLVR training loop
def rlvr_training_step(policy, prompts, verifier):
    """One step of RLVR training with GRPO."""
    for prompt in prompts:
        # 1. Sample a group of responses
        responses = [policy.generate(prompt) for _ in range(group_size)]
 
        # 2. Verify each response with rule-based reward
        rewards = [verifier.check(prompt, r) for r in responses]
 
        # 3. Compute group-relative advantages (GRPO)
        mean_r = sum(rewards) / len(rewards)
        std_r = std(rewards)
        advantages = [(r - mean_r) / (std_r + eps) for r in rewards]
 
        # 4. Update policy with clipped objective + KL penalty
        policy.update(responses, advantages, kl_weight=0.01)

Ground truth collection: Curate datasets with verifiable answers, ensuring no overlap with evaluation benchmarks
Reward function design: Implement deterministic correctness checks (exact match, test execution, category match)
Reward validation: Test that the reward function reliably distinguishes correct from incorrect outputs
RL integration: Train with PPO or GRPO, using KL regularization to prevent over-optimization

DeepSeek-R1 and GRPO

DeepSeek-R1 is the landmark demonstration of RLVR at scale. It uses Group Relative Policy Optimization (GRPO), which replaces PPO's critic network with group-relative advantage estimation. For a prompt $x$, GRPO samples a group of $G$ responses and computes:

$$\hat{A}_i = \frac{r_i - \mu_G}{\sigma_G}, \quad \mu_G = \frac{1}{G}\sum_{j=1}^{G} r_j, \quad \sigma_G = \sqrt{\frac{1}{G}\sum_{j=1}^{G}(r_j - \mu_G)^2}$$

The policy update uses a clipped surrogate objective with KL penalty:

$$\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}\left[\min\!\left(\rho_i \hat{A}_i,\;\text{clip}(\rho_i, 1{-}\epsilon, 1{+}\epsilon)\hat{A}_i\right) - \beta\,D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]$$

where the importance ratio $\rho_i = \frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)}$.

GRPO advantages over PPO:

No critic model needed (saves ~50% memory)
More stable training dynamics
Better suited for binary/sparse rewards common in RLVR

DeepSeek-R1's training pipeline: cold-start SFT on reasoning traces → RLVR with GRPO → rejection sampling → final SFT alignment. The result: open-weight models matching o1-level reasoning.

Application to Coding Agents

Code generation is an ideal RLVR domain because verification is unambiguous:

Execute generated code against test cases
Binary pass/fail per test provides clear reward signal
Multi-turn code synthesis: verify at episode endpoints
No subjective quality judgments needed

This extends naturally to agentic coding where models iteratively write, test, and debug code across multiple turns.

Reward Hacking in RLVR

Despite RLVR's robustness, models can still exploit verification loopholes:

Direct answer hacking: Leaking answers into reasoning segments
Format exploitation: Placing reasoning before designated tags to maximize rewards
Shortcut solutions: Finding degenerate programs that pass test cases without solving the general problem

Mitigation requires continuous monitoring and diverse test suites.

AI Agent Knowledge Base

Sidebar

Table of Contents

Agent RLVR

How RLVR Differs from RLHF

Verifiable Environments

Training Pipeline

DeepSeek-R1 and GRPO

Application to Coding Agents

Reward Hacking in RLVR

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Agent RLVR

How RLVR Differs from RLHF

Verifiable Environments

Training Pipeline

DeepSeek-R1 and GRPO

Application to Coding Agents

Reward Hacking in RLVR

References

See Also

Page Tools