Agentic Reinforcement Learning (Agentic RL) trains LLM agents as autonomous decision-makers in dynamic environments, extending standard LLM RL from single-turn text generation to multi-turn, partially observable Markov decision processes (POMDPs) with long-horizon planning, tool use, and adaptive behavior.
| Aspect | Standard LLM RL | Agentic RL |
|---|---|---|
| Interaction | Single-turn (prompt → response) | Multi-turn (observe → act → observe → …) |
| Observation | Full prompt visible | Partial observability (POMDP) |
| Reward | Immediate (quality of one response) | Delayed, sparse (task completion after many steps) |
| Actions | Token generation | Semantic actions: tool calls, navigation, API requests |
| Planning horizon | Single response | Tens to hundreds of steps |
| State | Stateless per query | Stateful (memory, environment state) |
| Credit assignment | Per-response | Per-step across long trajectories |
Agentic RL faces two fundamental challenges that standard LLM RL does not:
1. Sparse, non-instructive rewards: Binary 0/1 rewards at the end of long trajectories provide minimal learning signal. The agent must discover which of its many actions contributed to success or failure.
2. Credit assignment over long horizons: With trajectories spanning dozens of tool calls, attributing reward to specific actions is extremely difficult. For a trajectory of $T$ steps with terminal reward $R$, the policy gradient has variance proportional to $T$:
$$\text{Var}\!\left[\nabla_\theta \mathcal{L}\right] \propto T \cdot \text{Var}[R]$$
Standard policy gradient estimators have high variance in this setting, motivating the use of dense intermediate rewards and value baselines.
Progressive reward shaping (arXiv:2512.07478) addresses sparse rewards by building agent capabilities incrementally. The reward function evolves across training stages:
$$R_{\text{prog}}(\tau, \alpha) = \alpha \cdot R_{\text{outcome}}(\tau) + (1 - \alpha) \cdot \frac{1}{T}\sum_{t=1}^{T} r_{\text{step}}(s_t, a_t)$$
where $\alpha$ increases from 0 to 1 over training, gradually shifting from dense step-level rewards to sparse outcome rewards.
# Progressive reward shaping concept def progressive_reward(trajectory, stage): """Reward function that evolves with training stage.""" outcome_reward = verify_final_answer(trajectory) if stage == "early": # Dense rewards: reward correct tool selection, format, partial progress step_rewards = [score_step(s) for s in trajectory.steps] return 0.7 * mean(step_rewards) + 0.3 * outcome_reward elif stage == "middle": # Mixed: some step rewards, heavier outcome weight key_steps = [score_step(s) for s in trajectory.key_milestones] return 0.3 * mean(key_steps) + 0.7 * outcome_reward else: # "late" # Sparse: pure outcome reward (RLVR-style) return outcome_reward
The paper introduces Value-based Sampling Policy Optimization, which uses a learned value function $V_\psi(s_t)$ to select high-quality training trajectories, improving sample efficiency by filtering out low-value rollouts before policy updates.
Verl-Tool (submitted to ICLR 2026) provides an open-source framework for holistic agentic RL with tool use:
Verl-Tool bridges the gap between RLVR (which works for single-turn verifiable tasks) and full agentic environments requiring multi-turn tool use.
RLVR (Reinforcement Learning from Verifiable Rewards) uses deterministic reward functions, but is limited to tasks with clear correct answers. Agentic RL extends this to:
The progression: RLHF → RLVR → Agentic RL represents increasing automation of the reward signal and increasing complexity of the training environment.
RL trains LLM agents to autonomously select, sequence, and adapt tool use through environment feedback:
Unlike supervised fine-tuning on tool-use demonstrations, RL enables agents to discover novel tool-use strategies through trial and error.
A typical agentic RL training loop optimizes the objective:
$$\mathcal{J}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\right]$$
where $\gamma$ is the discount factor, and the expectation is over trajectories $\tau$ sampled from the policy $\pi_\theta$.
The training loop proceeds as: