Table of Contents

Agentic Reinforcement Learning

Agentic Reinforcement Learning (Agentic RL) trains LLM agents as autonomous decision-makers in dynamic environments, extending standard LLM RL from single-turn text generation to multi-turn, partially observable Markov decision processes (POMDPs) with long-horizon planning, tool use, and adaptive behavior.

Agentic RL vs Standard LLM RL

Aspect Standard LLM RL Agentic RL
Interaction Single-turn (prompt → response) Multi-turn (observe → act → observe → …)
Observation Full prompt visible Partial observability (POMDP)
Reward Immediate (quality of one response) Delayed, sparse (task completion after many steps)
Actions Token generation Semantic actions: tool calls, navigation, API requests
Planning horizon Single response Tens to hundreds of steps
State Stateless per query Stateful (memory, environment state)
Credit assignment Per-response Per-step across long trajectories

Core Challenges

Agentic RL faces two fundamental challenges that standard LLM RL does not:

1. Sparse, non-instructive rewards: Binary 0/1 rewards at the end of long trajectories provide minimal learning signal. The agent must discover which of its many actions contributed to success or failure.

2. Credit assignment over long horizons: With trajectories spanning dozens of tool calls, attributing reward to specific actions is extremely difficult. For a trajectory of $T$ steps with terminal reward $R$, the policy gradient has variance proportional to $T$:

$$\text{Var}\!\left[\nabla_\theta \mathcal{L}\right] \propto T \cdot \text{Var}[R]$$

Standard policy gradient estimators have high variance in this setting, motivating the use of dense intermediate rewards and value baselines.

Progressive Reward Shaping

Progressive reward shaping (arXiv:2512.07478) addresses sparse rewards by building agent capabilities incrementally. The reward function evolves across training stages:

$$R_{\text{prog}}(\tau, \alpha) = \alpha \cdot R_{\text{outcome}}(\tau) + (1 - \alpha) \cdot \frac{1}{T}\sum_{t=1}^{T} r_{\text{step}}(s_t, a_t)$$

where $\alpha$ increases from 0 to 1 over training, gradually shifting from dense step-level rewards to sparse outcome rewards.

# Progressive reward shaping concept
def progressive_reward(trajectory, stage):
    """Reward function that evolves with training stage."""
    outcome_reward = verify_final_answer(trajectory)
 
    if stage == "early":
        # Dense rewards: reward correct tool selection, format, partial progress
        step_rewards = [score_step(s) for s in trajectory.steps]
        return 0.7 * mean(step_rewards) + 0.3 * outcome_reward
 
    elif stage == "middle":
        # Mixed: some step rewards, heavier outcome weight
        key_steps = [score_step(s) for s in trajectory.key_milestones]
        return 0.3 * mean(key_steps) + 0.7 * outcome_reward
 
    else:  # "late"
        # Sparse: pure outcome reward (RLVR-style)
        return outcome_reward

The paper introduces Value-based Sampling Policy Optimization, which uses a learned value function $V_\psi(s_t)$ to select high-quality training trajectories, improving sample efficiency by filtering out low-value rollouts before policy updates.

Verl-Tool Framework

Verl-Tool (submitted to ICLR 2026) provides an open-source framework for holistic agentic RL with tool use:

Verl-Tool bridges the gap between RLVR (which works for single-turn verifiable tasks) and full agentic environments requiring multi-turn tool use.

RLVR vs Standard RL for Agents

RLVR (Reinforcement Learning from Verifiable Rewards) uses deterministic reward functions, but is limited to tasks with clear correct answers. Agentic RL extends this to:

The progression: RLHF → RLVR → Agentic RL represents increasing automation of the reward signal and increasing complexity of the training environment.

Tool-Use Training with RL

RL trains LLM agents to autonomously select, sequence, and adapt tool use through environment feedback:

Unlike supervised fine-tuning on tool-use demonstrations, RL enables agents to discover novel tool-use strategies through trial and error.

Training Architecture

A typical agentic RL training loop optimizes the objective:

$$\mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\right]$$

where $\gamma$ is the discount factor, and the expectation is over trajectories $\tau$ sampled from the policy $\pi_\theta$.

The training loop proceeds as:

  1. Environment step: Agent observes state $s_t$ (conversation history, tool results, task description)
  2. Policy step: LLM generates next action $a_t \sim \pi_\theta(\cdot | s_t)$ (tool call or text response)
  3. Execution step: Tool is executed in sandboxed environment, result appended to context
  4. Reward step: At trajectory end, compute reward (verifiable check, or shaped intermediate reward)
  5. Update step: Update policy via PPO, GRPO, or REINFORCE with trajectory-level advantages

Key Applications

Recent Developments (2025-2026)

References

See Also