====== Agent RLVR ======

**Reinforcement Learning from Verifiable Rewards (RLVR)** is a post-training paradigm that uses automatically verifiable signals -- such as correct answers, code execution results, or formal proofs -- to train language models through reinforcement learning, replacing expensive human judgment with deterministic rule-based reward functions.

===== How RLVR Differs from RLHF =====

^ Aspect ^ RLHF ^ RLVR ^
| Reward source | Human annotators + learned reward model | Rule-based automated verification |
| Reward signal | Subjective quality judgment | Binary/graded correctness (0 or 1) |
| Scalability | Limited by annotation cost | Scales with compute |
| Reward hacking | Vulnerable (exploits learned reward model) | More robust (deterministic checks) |
| Domains | Open-ended generation | Tasks with verifiable outcomes |
| Cost | High (human labelers) | Low (automated verification) |

The key insight: for tasks where correctness can be automatically verified, RLVR eliminates both the expense of human annotation and the reward hacking vulnerabilities inherent in learned reward models.

===== Verifiable Environments =====

RLVR applies to domains with clear verification mechanisms:

  * **Mathematical reasoning**: String-matching or symbolic comparison of final numerical answers against ground truth (e.g., GSM8K problems)
  * **Code generation**: Execute generated code against test suites; reward based on test-case pass rates
  * **Instruction following**: Verify adherence to formatting constraints, keyword requirements, language rules via deterministic string matching
  * **Formal verification**: Proof checkers for mathematical theorems
  * **Game environments**: Win/loss outcomes in deterministic games

The reward function is typically binary:

$$r(x, y) = \mathbf{1}\!\left[\text{verify}(x, y) = \text{correct}\right]$$

where $x$ is the prompt and $y$ is the model's response. For code generation, this extends to graded rewards based on test pass rates: $r(x, y) = \frac{\text{tests passed}}{\text{total tests}}$.

===== Training Pipeline =====

The RLVR training process follows four steps:

<code python>
# Simplified RLVR training loop
def rlvr_training_step(policy, prompts, verifier):
    """One step of RLVR training with GRPO."""
    for prompt in prompts:
        # 1. Sample a group of responses
        responses = [policy.generate(prompt) for _ in range(group_size)]
        
        # 2. Verify each response with rule-based reward
        rewards = [verifier.check(prompt, r) for r in responses]
        
        # 3. Compute group-relative advantages (GRPO)
        mean_r = sum(rewards) / len(rewards)
        std_r = std(rewards)
        advantages = [(r - mean_r) / (std_r + eps) for r in rewards]
        
        # 4. Update policy with clipped objective + KL penalty
        policy.update(responses, advantages, kl_weight=0.01)
</code>

  - **Ground truth collection**: Curate datasets with verifiable answers, ensuring no overlap with evaluation benchmarks
  - **Reward function design**: Implement deterministic correctness checks (exact match, test execution, category match)
  - **Reward validation**: Test that the reward function reliably distinguishes correct from incorrect outputs
  - **RL integration**: Train with PPO or GRPO, using KL regularization to prevent over-optimization

===== DeepSeek-R1 and GRPO =====

**DeepSeek-R1** is the landmark demonstration of RLVR at scale. It uses **Group Relative Policy Optimization (GRPO)**, which replaces PPO's critic network with group-relative advantage estimation. For a prompt $x$, GRPO samples a group of $G$ responses and computes:

$$\hat{A}_i = \frac{r_i - \mu_G}{\sigma_G}, \quad \mu_G = \frac{1}{G}\sum_{j=1}^{G} r_j, \quad \sigma_G = \sqrt{\frac{1}{G}\sum_{j=1}^{G}(r_j - \mu_G)^2}$$

The policy update uses a clipped surrogate objective with KL penalty:

$$\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}\left[\min\!\left(\rho_i \hat{A}_i,\;\text{clip}(\rho_i, 1{-}\epsilon, 1{+}\epsilon)\hat{A}_i\right) - \beta\,D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]$$

where the importance ratio $\rho_i = \frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)}$.

GRPO advantages over PPO:
  * No critic model needed (saves ~50% memory)
  * More stable training dynamics
  * Better suited for binary/sparse rewards common in RLVR

DeepSeek-R1's training pipeline: cold-start SFT on reasoning traces -> RLVR with GRPO -> rejection sampling -> final SFT alignment. The result: open-weight models matching o1-level reasoning.

===== Application to Coding Agents =====

Code generation is an ideal RLVR domain because verification is unambiguous:

  * Execute generated code against test cases
  * Binary pass/fail per test provides clear reward signal
  * Multi-turn code synthesis: verify at episode endpoints
  * No subjective quality judgments needed

This extends naturally to agentic coding where models iteratively write, test, and debug code across multiple turns.

===== Reward Hacking in RLVR =====

Despite RLVR's robustness, models can still exploit verification loopholes:

  * **Direct answer hacking**: Leaking answers into reasoning segments
  * **Format exploitation**: Placing reasoning before designated tags to maximize rewards
  * **Shortcut solutions**: Finding degenerate programs that pass test cases without solving the general problem

Mitigation requires continuous monitoring and diverse test suites.

===== References =====

  * [[https://arxiv.org/abs/2506.11425|arXiv:2506.11425 - RLVR for Agents]]
  * [[https://arxiv.org/abs/2501.12948|arXiv:2501.12948 - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL]]
  * [[https://arxiv.org/abs/2402.03300|arXiv:2402.03300 - DeepSeekMath (GRPO)]]

===== See Also =====

  * [[agentic_reinforcement_learning|Agentic Reinforcement Learning]] - RL specifically for LLM agents
  * [[process_reward_models|Process Reward Models]] - Dense step-level rewards complementing RLVR
  * [[test_time_compute_scaling|Test-Time Compute Scaling]] - Inference-time benefits of RLVR-trained models