Agentic RL vs Standard LLM RL
Core Challenges
Progressive Reward Shaping
Verl-Tool Framework
RLVR vs Standard RL for Agents
Tool-Use Training with RL
Training Architecture
Key Applications
Recent Developments (2025-2026)
References
See Also

Agentic Reinforcement Learning

Agentic Reinforcement Learning (Agentic RL) trains LLM agents as autonomous decision-makers in dynamic environments, extending standard LLM RL from single-turn text generation to multi-turn, partially observable Markov decision processes (POMDPs) with long-horizon planning, tool use, and adaptive behavior.

Agentic RL vs Standard LLM RL

Aspect	Standard LLM RL	Agentic RL
Interaction	Single-turn (prompt → response)	Multi-turn (observe → act → observe → …)
Observation	Full prompt visible	Partial observability (POMDP)
Reward	Immediate (quality of one response)	Delayed, sparse (task completion after many steps)
Actions	Token generation	Semantic actions: tool calls, navigation, API requests
Planning horizon	Single response	Tens to hundreds of steps
State	Stateless per query	Stateful (memory, environment state)
Credit assignment	Per-response	Per-step across long trajectories

Core Challenges

Agentic RL faces two fundamental challenges that standard LLM RL does not:

1. Sparse, non-instructive rewards: Binary 0/1 rewards at the end of long trajectories provide minimal learning signal. The agent must discover which of its many actions contributed to success or failure.

2. Credit assignment over long horizons: With trajectories spanning dozens of tool calls, attributing reward to specific actions is extremely difficult. For a trajectory of $T$ steps with terminal reward $R$, the policy gradient has variance proportional to $T$:

$$\text{Var}\!\left[\nabla_\theta \mathcal{L}\right] \propto T \cdot \text{Var}[R]$$

Standard policy gradient estimators have high variance in this setting, motivating the use of dense intermediate rewards and value baselines.

Progressive Reward Shaping

Progressive reward shaping (arXiv:2512.07478) addresses sparse rewards by building agent capabilities incrementally. The reward function evolves across training stages:

$$R_{\text{prog}}(\tau, \alpha) = \alpha \cdot R_{\text{outcome}}(\tau) + (1 - \alpha) \cdot \frac{1}{T}\sum_{t=1}^{T} r_{\text{step}}(s_t, a_t)$$

where $\alpha$ increases from 0 to 1 over training, gradually shifting from dense step-level rewards to sparse outcome rewards.

Start with dense, easy-to-earn intermediate rewards ($\alpha \approx 0$)
Gradually shift toward sparse outcome rewards as the agent improves ($\alpha \to 1$)
Curriculum over reward complexity mirrors curriculum over task complexity

# Progressive reward shaping concept
def progressive_reward(trajectory, stage):
    """Reward function that evolves with training stage."""
    outcome_reward = verify_final_answer(trajectory)
 
    if stage == "early":
        # Dense rewards: reward correct tool selection, format, partial progress
        step_rewards = [score_step(s) for s in trajectory.steps]
        return 0.7 * mean(step_rewards) + 0.3 * outcome_reward
 
    elif stage == "middle":
        # Mixed: some step rewards, heavier outcome weight
        key_steps = [score_step(s) for s in trajectory.key_milestones]
        return 0.3 * mean(key_steps) + 0.7 * outcome_reward
 
    else:  # "late"
        # Sparse: pure outcome reward (RLVR-style)
        return outcome_reward

The paper introduces Value-based Sampling Policy Optimization, which uses a learned value function $V_\psi(s_t)$ to select high-quality training trajectories, improving sample efficiency by filtering out low-value rollouts before policy updates.

Verl-Tool Framework

Verl-Tool (submitted to ICLR 2026) provides an open-source framework for holistic agentic RL with tool use:

Tool-integrated training: Supports RL optimization over full tool-interaction trajectories
Multiple tool types: Code execution, web search, calculator, API calls
Scalable infrastructure: Distributed training across multiple GPUs
Flexible reward: Supports both verifiable rewards (RLVR) and learned reward models

Verl-Tool bridges the gap between RLVR (which works for single-turn verifiable tasks) and full agentic environments requiring multi-turn tool use.

RLVR vs Standard RL for Agents

RLVR (Reinforcement Learning from Verifiable Rewards) uses deterministic reward functions, but is limited to tasks with clear correct answers. Agentic RL extends this to:

Open-ended tasks without single correct answers
Multi-step tool interaction trajectories
Environments with stochastic dynamics
Tasks requiring exploration and discovery

The progression: RLHF → RLVR → Agentic RL represents increasing automation of the reward signal and increasing complexity of the training environment.

Tool-Use Training with RL

RL trains LLM agents to autonomously select, sequence, and adapt tool use through environment feedback:

Tool selection: Learn when to call which tool based on current state
Argument generation: Learn to construct correct tool arguments from context
Error recovery: Learn to handle tool failures and retry with different strategies
Composition: Learn to chain multiple tools to solve complex tasks

Unlike supervised fine-tuning on tool-use demonstrations, RL enables agents to discover novel tool-use strategies through trial and error.

Training Architecture

A typical agentic RL training loop optimizes the objective:

$$\mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\right]$$

where $\gamma$ is the discount factor, and the expectation is over trajectories $\tau$ sampled from the policy $\pi_\theta$.

The training loop proceeds as:

Environment step: Agent observes state $s_t$ (conversation history, tool results, task description)
Policy step: LLM generates next action $a_t \sim \pi_\theta(\cdot | s_t)$ (tool call or text response)
Execution step: Tool is executed in sandboxed environment, result appended to context
Reward step: At trajectory end, compute reward (verifiable check, or shaped intermediate reward)
Update step: Update policy via PPO, GRPO, or REINFORCE with trajectory-level advantages

Key Applications

Coding agents: Write, test, debug code across multiple files using execution feedback
Research agents: Search, read, synthesize information using web tools
Data analysis agents: Query databases, run computations, generate visualizations
Scientific agents: Design experiments, run simulations, analyze results
Multi-agent collaboration: Teams of specialized agents coordinating on complex tasks

Recent Developments (2025-2026)

Era of experience: Agents learning through self-generated experience rather than human demonstrations (Silver & Sutton, 2025)
Group reasoning: Multi-agent RL where agents reason collaboratively
Slow reasoning augmentation: Combining structured CoT with RL for deeper planning
Scientific agents: NVIDIA's work on training scientific agents with RL for hypothesis generation and experimental design
Safety and alignment: Ensuring agentic RL produces agents that respect boundaries and avoid harmful actions

Table of Contents