====== Agentic Reinforcement Learning ====== **Agentic Reinforcement Learning (Agentic RL)** trains LLM agents as autonomous decision-makers in dynamic environments, extending standard LLM RL from single-turn text generation to multi-turn, partially observable Markov decision processes (POMDPs) with long-horizon planning, tool use, and adaptive behavior. ===== Agentic RL vs Standard LLM RL ===== ^ Aspect ^ Standard LLM RL ^ Agentic RL ^ | Interaction | Single-turn (prompt -> response) | Multi-turn (observe -> act -> observe -> ...) | | Observation | Full prompt visible | Partial observability (POMDP) | | Reward | Immediate (quality of one response) | Delayed, sparse (task completion after many steps) | | Actions | Token generation | Semantic actions: tool calls, navigation, API requests | | Planning horizon | Single response | Tens to hundreds of steps | | State | Stateless per query | Stateful (memory, environment state) | | Credit assignment | Per-response | Per-step across long trajectories | ===== Core Challenges ===== Agentic RL faces two fundamental challenges that standard LLM RL does not: **1. Sparse, non-instructive rewards**: Binary 0/1 rewards at the end of long trajectories provide minimal learning signal. The agent must discover which of its many actions contributed to success or failure. **2. Credit assignment over long horizons**: With trajectories spanning dozens of tool calls, attributing reward to specific actions is extremely difficult. For a trajectory of $T$ steps with terminal reward $R$, the policy gradient has variance proportional to $T$: $$\text{Var}\!\left[\nabla_\theta \mathcal{L}\right] \propto T \cdot \text{Var}[R]$$ Standard policy gradient estimators have high variance in this setting, motivating the use of dense intermediate rewards and value baselines. ===== Progressive Reward Shaping ===== **Progressive reward shaping** (arXiv:2512.07478) addresses sparse rewards by building agent capabilities incrementally. The reward function evolves across training stages: $$R_{\text{prog}}(\tau, \alpha) = \alpha \cdot R_{\text{outcome}}(\tau) + (1 - \alpha) \cdot \frac{1}{T}\sum_{t=1}^{T} r_{\text{step}}(s_t, a_t)$$ where $\alpha$ increases from 0 to 1 over training, gradually shifting from dense step-level rewards to sparse outcome rewards. * Start with dense, easy-to-earn intermediate rewards ($\alpha \approx 0$) * Gradually shift toward sparse outcome rewards as the agent improves ($\alpha \to 1$) * Curriculum over reward complexity mirrors curriculum over task complexity # Progressive reward shaping concept def progressive_reward(trajectory, stage): """Reward function that evolves with training stage.""" outcome_reward = verify_final_answer(trajectory) if stage == "early": # Dense rewards: reward correct tool selection, format, partial progress step_rewards = [score_step(s) for s in trajectory.steps] return 0.7 * mean(step_rewards) + 0.3 * outcome_reward elif stage == "middle": # Mixed: some step rewards, heavier outcome weight key_steps = [score_step(s) for s in trajectory.key_milestones] return 0.3 * mean(key_steps) + 0.7 * outcome_reward else: # "late" # Sparse: pure outcome reward (RLVR-style) return outcome_reward The paper introduces **Value-based Sampling Policy Optimization**, which uses a learned value function $V_\psi(s_t)$ to select high-quality training trajectories, improving sample efficiency by filtering out low-value rollouts before policy updates. ===== Verl-Tool Framework ===== **Verl-Tool** (submitted to ICLR 2026) provides an open-source framework for holistic agentic RL with tool use: * **Tool-integrated training**: Supports RL optimization over full tool-interaction trajectories * **Multiple tool types**: Code execution, web search, calculator, API calls * **Scalable infrastructure**: Distributed training across multiple GPUs * **Flexible reward**: Supports both verifiable rewards (RLVR) and learned reward models Verl-Tool bridges the gap between RLVR (which works for single-turn verifiable tasks) and full agentic environments requiring multi-turn tool use. ===== RLVR vs Standard RL for Agents ===== **RLVR** (Reinforcement Learning from Verifiable Rewards) uses deterministic reward functions, but is limited to tasks with clear correct answers. **Agentic RL** extends this to: * Open-ended tasks without single correct answers * Multi-step tool interaction trajectories * Environments with stochastic dynamics * Tasks requiring exploration and discovery The progression: RLHF -> RLVR -> Agentic RL represents increasing automation of the reward signal and increasing complexity of the training environment. ===== Tool-Use Training with RL ===== RL trains LLM agents to autonomously select, sequence, and adapt tool use through environment feedback: * **Tool selection**: Learn when to call which tool based on current state * **Argument generation**: Learn to construct correct tool arguments from context * **Error recovery**: Learn to handle tool failures and retry with different strategies * **Composition**: Learn to chain multiple tools to solve complex tasks Unlike supervised fine-tuning on tool-use demonstrations, RL enables agents to discover novel tool-use strategies through trial and error. ===== Training Architecture ===== A typical agentic RL training loop optimizes the objective: $$\mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\right]$$ where $\gamma$ is the discount factor, and the expectation is over trajectories $\tau$ sampled from the policy $\pi_\theta$. The training loop proceeds as: - **Environment step**: Agent observes state $s_t$ (conversation history, tool results, task description) - **Policy step**: LLM generates next action $a_t \sim \pi_\theta(\cdot | s_t)$ (tool call or text response) - **Execution step**: Tool is executed in sandboxed environment, result appended to context - **Reward step**: At trajectory end, compute reward (verifiable check, or shaped intermediate reward) - **Update step**: Update policy via PPO, GRPO, or REINFORCE with trajectory-level advantages ===== Key Applications ===== * **Coding agents**: Write, test, debug code across multiple files using execution feedback * **Research agents**: Search, read, synthesize information using web tools * **Data analysis agents**: Query databases, run computations, generate visualizations * **Scientific agents**: Design experiments, run simulations, analyze results * **Multi-agent collaboration**: Teams of specialized agents coordinating on complex tasks ===== Recent Developments (2025-2026) ===== * **Era of experience**: Agents learning through self-generated experience rather than human demonstrations (Silver & Sutton, 2025) * **Group reasoning**: Multi-agent RL where agents reason collaboratively * **Slow reasoning augmentation**: Combining structured CoT with RL for deeper planning * **Scientific agents**: NVIDIA's work on training scientific agents with RL for hypothesis generation and experimental design * **Safety and alignment**: Ensuring agentic RL produces agents that respect boundaries and avoid harmful actions ===== References ===== * [[https://arxiv.org/abs/2512.07478|arXiv:2512.07478 - Enhancing Agentic RL with Progressive Reward Shaping]] * [[https://openreview.net/forum?id=oWFtI0cNsE|Verl-Tool: Holistic Agentic RL with Tool Use (ICLR 2026)]] * [[https://arxiv.org/abs/2509.02547|arXiv:2509.02547 - Agentic RL Taxonomy and Survey]] ===== See Also ===== * [[agent_rlvr|Agent RLVR]] - RL from Verifiable Rewards for agents * [[process_reward_models|Process Reward Models]] - Step-level rewards for training signal * [[speculative_tool_execution|Speculative Tool Execution]] - Optimizing agent tool call latency * [[parallel_function_calling|Parallel Function Calling]] - Concurrent tool execution