====== Agentic Reinforcement Learning ======

**Agentic Reinforcement Learning (Agentic RL)** trains LLM agents as autonomous decision-makers in dynamic environments, extending standard LLM RL from single-turn text generation to multi-turn, partially observable Markov decision processes (POMDPs) with long-horizon planning, tool use, and adaptive behavior.

===== Agentic RL vs Standard LLM RL =====

^ Aspect ^ Standard LLM RL ^ Agentic RL ^
| Interaction | Single-turn (prompt -> response) | Multi-turn (observe -> act -> observe -> ...) |
| Observation | Full prompt visible | Partial observability (POMDP) |
| Reward | Immediate (quality of one response) | Delayed, sparse (task completion after many steps) |
| Actions | Token generation | Semantic actions: tool calls, navigation, API requests |
| Planning horizon | Single response | Tens to hundreds of steps |
| State | Stateless per query | Stateful (memory, environment state) |
| Credit assignment | Per-response | Per-step across long trajectories |

===== Core Challenges =====

Agentic RL faces two fundamental challenges that standard LLM RL does not:

**1. Sparse, non-instructive rewards**: Binary 0/1 rewards at the end of long trajectories provide minimal learning signal. The agent must discover which of its many actions contributed to success or failure.

**2. Credit assignment over long horizons**: With trajectories spanning dozens of tool calls, attributing reward to specific actions is extremely difficult. For a trajectory of $T$ steps with terminal reward $R$, the policy gradient has variance proportional to $T$:

$$\text{Var}\!\left[\nabla_\theta \mathcal{L}\right] \propto T \cdot \text{Var}[R]$$

Standard policy gradient estimators have high variance in this setting, motivating the use of dense intermediate rewards and value baselines.

===== Progressive Reward Shaping =====

**Progressive reward shaping** (arXiv:2512.07478) addresses sparse rewards by building agent capabilities incrementally. The reward function evolves across training stages:

$$R_{\text{prog}}(\tau, \alpha) = \alpha \cdot R_{\text{outcome}}(\tau) + (1 - \alpha) \cdot \frac{1}{T}\sum_{t=1}^{T} r_{\text{step}}(s_t, a_t)$$

where $\alpha$ increases from 0 to 1 over training, gradually shifting from dense step-level rewards to sparse outcome rewards.

  * Start with dense, easy-to-earn intermediate rewards ($\alpha \approx 0$)
  * Gradually shift toward sparse outcome rewards as the agent improves ($\alpha \to 1$)
  * Curriculum over reward complexity mirrors curriculum over task complexity

<code python>
# Progressive reward shaping concept
def progressive_reward(trajectory, stage):
    """Reward function that evolves with training stage."""
    outcome_reward = verify_final_answer(trajectory)
    
    if stage == "early":
        # Dense rewards: reward correct tool selection, format, partial progress
        step_rewards = [score_step(s) for s in trajectory.steps]
        return 0.7 * mean(step_rewards) + 0.3 * outcome_reward
    
    elif stage == "middle":
        # Mixed: some step rewards, heavier outcome weight
        key_steps = [score_step(s) for s in trajectory.key_milestones]
        return 0.3 * mean(key_steps) + 0.7 * outcome_reward
    
    else:  # "late"
        # Sparse: pure outcome reward (RLVR-style)
        return outcome_reward
</code>

The paper introduces **Value-based Sampling Policy Optimization**, which uses a learned value function $V_\psi(s_t)$ to select high-quality training trajectories, improving sample efficiency by filtering out low-value rollouts before policy updates.

===== Verl-Tool Framework =====

**Verl-Tool** (submitted to ICLR 2026) provides an open-source framework for holistic agentic RL with tool use:

  * **Tool-integrated training**: Supports RL optimization over full tool-interaction trajectories
  * **Multiple tool types**: Code execution, web search, calculator, API calls
  * **Scalable infrastructure**: Distributed training across multiple GPUs
  * **Flexible reward**: Supports both verifiable rewards (RLVR) and learned reward models

Verl-Tool bridges the gap between RLVR (which works for single-turn verifiable tasks) and full agentic environments requiring multi-turn tool use.

===== RLVR vs Standard RL for Agents =====

**RLVR** (Reinforcement Learning from Verifiable Rewards) uses deterministic reward functions, but is limited to tasks with clear correct answers. **Agentic RL** extends this to:

  * Open-ended tasks without single correct answers
  * Multi-step tool interaction trajectories
  * Environments with stochastic dynamics
  * Tasks requiring exploration and discovery

The progression: RLHF -> RLVR -> Agentic RL represents increasing automation of the reward signal and increasing complexity of the training environment.

===== Tool-Use Training with RL =====

RL trains LLM agents to autonomously select, sequence, and adapt tool use through environment feedback:

  * **Tool selection**: Learn when to call which tool based on current state
  * **Argument generation**: Learn to construct correct tool arguments from context
  * **Error recovery**: Learn to handle tool failures and retry with different strategies
  * **Composition**: Learn to chain multiple tools to solve complex tasks

Unlike supervised fine-tuning on tool-use demonstrations, RL enables agents to discover novel tool-use strategies through trial and error.

===== Training Architecture =====

A typical agentic RL training loop optimizes the objective:

$$\mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\right]$$

where $\gamma$ is the discount factor, and the expectation is over trajectories $\tau$ sampled from the policy $\pi_\theta$.

The training loop proceeds as:

  - **Environment step**: Agent observes state $s_t$ (conversation history, tool results, task description)
  - **Policy step**: LLM generates next action $a_t \sim \pi_\theta(\cdot | s_t)$ (tool call or text response)
  - **Execution step**: Tool is executed in sandboxed environment, result appended to context
  - **Reward step**: At trajectory end, compute reward (verifiable check, or shaped intermediate reward)
  - **Update step**: Update policy via PPO, GRPO, or REINFORCE with trajectory-level advantages

===== Key Applications =====

  * **Coding agents**: Write, test, debug code across multiple files using execution feedback
  * **Research agents**: Search, read, synthesize information using web tools
  * **Data analysis agents**: Query databases, run computations, generate visualizations
  * **Scientific agents**: Design experiments, run simulations, analyze results
  * **Multi-agent collaboration**: Teams of specialized agents coordinating on complex tasks

===== Recent Developments (2025-2026) =====

  * **Era of experience**: Agents learning through self-generated experience rather than human demonstrations (Silver & Sutton, 2025)
  * **Group reasoning**: Multi-agent RL where agents reason collaboratively
  * **Slow reasoning augmentation**: Combining structured CoT with RL for deeper planning
  * **Scientific agents**: NVIDIA's work on training scientific agents with RL for hypothesis generation and experimental design
  * **Safety and alignment**: Ensuring agentic RL produces agents that respect boundaries and avoid harmful actions

===== References =====

  * [[https://arxiv.org/abs/2512.07478|arXiv:2512.07478 - Enhancing Agentic RL with Progressive Reward Shaping]]
  * [[https://openreview.net/forum?id=oWFtI0cNsE|Verl-Tool: Holistic Agentic RL with Tool Use (ICLR 2026)]]
  * [[https://arxiv.org/abs/2509.02547|arXiv:2509.02547 - Agentic RL Taxonomy and Survey]]

===== See Also =====

  * [[agent_rlvr|Agent RLVR]] - RL from Verifiable Rewards for agents
  * [[process_reward_models|Process Reward Models]] - Step-level rewards for training signal
  * [[speculative_tool_execution|Speculative Tool Execution]] - Optimizing agent tool call latency
  * [[parallel_function_calling|Parallel Function Calling]] - Concurrent tool execution