Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
Agentic Reinforcement Learning (Agentic RL) trains LLM agents as autonomous decision-makers in dynamic environments, extending standard LLM RL from single-turn text generation to multi-turn, partially observable Markov decision processes (POMDPs) with long-horizon planning, tool use, and adaptive behavior.
| Aspect | Standard LLM RL | Agentic RL |
|---|---|---|
| Interaction | Single-turn (prompt → response) | Multi-turn (observe → act → observe → …) |
| Observation | Full prompt visible | Partial observability (POMDP) |
| Reward | Immediate (quality of one response) | Delayed, sparse (task completion after many steps) |
| Actions | Token generation | Semantic actions: tool calls, navigation, API requests |
| Planning horizon | Single response | Tens to hundreds of steps |
| State | Stateless per query | Stateful (memory, environment state) |
| Credit assignment | Per-response | Per-step across long trajectories |
Agentic RL faces two fundamental challenges that standard LLM RL does not:
1. Sparse, non-instructive rewards: Binary 0/1 rewards at the end of long trajectories provide minimal learning signal. The agent must discover which of its many actions contributed to success or failure.
2. Credit assignment over long horizons: With trajectories spanning dozens of tool calls, attributing reward to specific actions is extremely difficult. Standard policy gradient estimators have high variance in this setting.
Progressive reward shaping (arXiv:2512.07478) addresses sparse rewards by building agent capabilities incrementally:
# Progressive reward shaping concept def progressive_reward(trajectory, stage): """Reward function that evolves with training stage.""" outcome_reward = verify_final_answer(trajectory) if stage == "early": # Dense rewards: reward correct tool selection, format, partial progress step_rewards = [score_step(s) for s in trajectory.steps] return 0.7 * mean(step_rewards) + 0.3 * outcome_reward elif stage == "middle": # Mixed: some step rewards, heavier outcome weight key_steps = [score_step(s) for s in trajectory.key_milestones] return 0.3 * mean(key_steps) + 0.7 * outcome_reward else: # "late" # Sparse: pure outcome reward (RLVR-style) return outcome_reward
The paper introduces Value-based Sampling Policy Optimization, which uses a learned value function to select high-quality training trajectories, improving sample efficiency.
Verl-Tool (submitted to ICLR 2026) provides an open-source framework for holistic agentic RL with tool use:
Verl-Tool bridges the gap between RLVR (which works for single-turn verifiable tasks) and full agentic environments requiring multi-turn tool use.
RLVR (Reinforcement Learning from Verifiable Rewards) uses deterministic reward functions, but is limited to tasks with clear correct answers. Agentic RL extends this to:
The progression: RLHF → RLVR → Agentic RL represents increasing automation of the reward signal and increasing complexity of the training environment.
RL trains LLM agents to autonomously select, sequence, and adapt tool use through environment feedback:
Unlike supervised fine-tuning on tool-use demonstrations, RL enables agents to discover novel tool-use strategies through trial and error.
A typical agentic RL training loop: