====== Agent RLVR ====== **Reinforcement Learning from Verifiable Rewards (RLVR)** is a post-training paradigm that uses automatically verifiable signals -- such as correct answers, code execution results, or formal proofs -- to train language models through reinforcement learning, replacing expensive human judgment with deterministic rule-based reward functions. ===== How RLVR Differs from RLHF ===== ^ Aspect ^ RLHF ^ RLVR ^ | Reward source | Human annotators + learned reward model | Rule-based automated verification | | Reward signal | Subjective quality judgment | Binary/graded correctness (0 or 1) | | Scalability | Limited by annotation cost | Scales with compute | | Reward hacking | Vulnerable (exploits learned reward model) | More robust (deterministic checks) | | Domains | Open-ended generation | Tasks with verifiable outcomes | | Cost | High (human labelers) | Low (automated verification) | The key insight: for tasks where correctness can be automatically verified, RLVR eliminates both the expense of human annotation and the reward hacking vulnerabilities inherent in learned reward models. ===== Verifiable Environments ===== RLVR applies to domains with clear verification mechanisms: * **Mathematical reasoning**: String-matching or symbolic comparison of final numerical answers against ground truth (e.g., GSM8K problems) * **Code generation**: Execute generated code against test suites; reward based on test-case pass rates * **Instruction following**: Verify adherence to formatting constraints, keyword requirements, language rules via deterministic string matching * **Formal verification**: Proof checkers for mathematical theorems * **Game environments**: Win/loss outcomes in deterministic games The reward function is typically binary: $$r(x, y) = \mathbf{1}\!\left[\text{verify}(x, y) = \text{correct}\right]$$ where $x$ is the prompt and $y$ is the model's response. For code generation, this extends to graded rewards based on test pass rates: $r(x, y) = \frac{\text{tests passed}}{\text{total tests}}$. ===== Training Pipeline ===== The RLVR training process follows four steps: # Simplified RLVR training loop def rlvr_training_step(policy, prompts, verifier): """One step of RLVR training with GRPO.""" for prompt in prompts: # 1. Sample a group of responses responses = [policy.generate(prompt) for _ in range(group_size)] # 2. Verify each response with rule-based reward rewards = [verifier.check(prompt, r) for r in responses] # 3. Compute group-relative advantages (GRPO) mean_r = sum(rewards) / len(rewards) std_r = std(rewards) advantages = [(r - mean_r) / (std_r + eps) for r in rewards] # 4. Update policy with clipped objective + KL penalty policy.update(responses, advantages, kl_weight=0.01) - **Ground truth collection**: Curate datasets with verifiable answers, ensuring no overlap with evaluation benchmarks - **Reward function design**: Implement deterministic correctness checks (exact match, test execution, category match) - **Reward validation**: Test that the reward function reliably distinguishes correct from incorrect outputs - **RL integration**: Train with PPO or GRPO, using KL regularization to prevent over-optimization ===== DeepSeek-R1 and GRPO ===== **DeepSeek-R1** is the landmark demonstration of RLVR at scale. It uses **Group Relative Policy Optimization (GRPO)**, which replaces PPO's critic network with group-relative advantage estimation. For a prompt $x$, GRPO samples a group of $G$ responses and computes: $$\hat{A}_i = \frac{r_i - \mu_G}{\sigma_G}, \quad \mu_G = \frac{1}{G}\sum_{j=1}^{G} r_j, \quad \sigma_G = \sqrt{\frac{1}{G}\sum_{j=1}^{G}(r_j - \mu_G)^2}$$ The policy update uses a clipped surrogate objective with KL penalty: $$\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}\left[\min\!\left(\rho_i \hat{A}_i,\;\text{clip}(\rho_i, 1{-}\epsilon, 1{+}\epsilon)\hat{A}_i\right) - \beta\,D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]$$ where the importance ratio $\rho_i = \frac{\pi_\theta(y_i|x)}{\pi_{\text{old}}(y_i|x)}$. GRPO advantages over PPO: * No critic model needed (saves ~50% memory) * More stable training dynamics * Better suited for binary/sparse rewards common in RLVR DeepSeek-R1's training pipeline: cold-start SFT on reasoning traces -> RLVR with GRPO -> rejection sampling -> final SFT alignment. The result: open-weight models matching o1-level reasoning. ===== Application to Coding Agents ===== Code generation is an ideal RLVR domain because verification is unambiguous: * Execute generated code against test cases * Binary pass/fail per test provides clear reward signal * Multi-turn code synthesis: verify at episode endpoints * No subjective quality judgments needed This extends naturally to agentic coding where models iteratively write, test, and debug code across multiple turns. ===== Reward Hacking in RLVR ===== Despite RLVR's robustness, models can still exploit verification loopholes: * **Direct answer hacking**: Leaking answers into reasoning segments * **Format exploitation**: Placing reasoning before designated tags to maximize rewards * **Shortcut solutions**: Finding degenerate programs that pass test cases without solving the general problem Mitigation requires continuous monitoring and diverse test suites. ===== References ===== * [[https://arxiv.org/abs/2506.11425|arXiv:2506.11425 - RLVR for Agents]] * [[https://arxiv.org/abs/2501.12948|arXiv:2501.12948 - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL]] * [[https://arxiv.org/abs/2402.03300|arXiv:2402.03300 - DeepSeekMath (GRPO)]] ===== See Also ===== * [[agentic_reinforcement_learning|Agentic Reinforcement Learning]] - RL specifically for LLM agents * [[process_reward_models|Process Reward Models]] - Dense step-level rewards complementing RLVR * [[test_time_compute_scaling|Test-Time Compute Scaling]] - Inference-time benefits of RLVR-trained models