Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Training LLM agents with reinforcement learning for multi-turn interactive tasks represents a paradigm shift from RLHF-based alignment. Agent-R1 and RAGEN introduce frameworks for end-to-end RL training where agents learn directly from environment rewards across multi-step trajectories, rather than from human preference labels.
Traditional RLHF trains language models using human preference rankings to build a reward model. In contrast, agent RL training uses environment rewards – objective signals derived from task outcomes and intermediate steps:
$$R_{\text{env}}(s_t, a_t, s_{t+1}) = f(\text{tool\_output}, \text{task\_progress}, \text{correctness})$$
This is fundamentally different from RLHF where:
$$R_{\text{RLHF}}(x, y) = \text{RewardModel}(x, y) \approx \text{human\_preference}(x, y)$$
Environment rewards are automated, dense, and specifically designed for interactive agent tasks, making them scalable without human annotation.
Agent-R1 extends the Markov Decision Process to LLM agents with explicit tool use. States are interaction histories, actions are generated tool calls or responses, and transitions are determined by stochastic environment feedback.
Key Architectural Components:
Policy Optimization: Agent-R1 supports PPO, GRPO, and REINFORCE++ on full multi-turn trajectories with masked advantages:
$$\nabla J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t} \nabla \log \pi_\theta(a_t | s_t) \cdot \hat{A}_t \cdot M_t \right]$$
where $M_t$ is the action mask ensuring only agent tokens receive gradient updates.
RAGEN introduces StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL. It addresses three core challenges discovered through systematic study:
1. The Echo Trap: A recurring failure mode where reward variance collapses and gradient spikes destabilize training. StarPO-S addresses this with:
2. Rollout Shaping: Effective multi-turn RL benefits from:
3. Reasoning-Aware Rewards: Without fine-grained reward signals that account for reasoning quality, agents develop shallow strategies or hallucinated reasoning chains.
class AgentRLTrainer: def __init__(self, policy_model, env, optimizer, algo="ppo"): self.policy = policy_model self.env = env self.optimizer = optimizer self.algo = algo def collect_trajectory(self, task): state = self.env.reset(task) trajectory = [] for step in range(self.env.max_steps): action, log_prob = self.policy.generate_action(state) next_state, reward, done = self.env.step(action) mask = self.compute_action_mask(state, action) trajectory.append({ "state": state, "action": action, "reward": reward, "log_prob": log_prob, "mask": mask, "done": done }) state = next_state if done: break return trajectory def compute_masked_advantage(self, trajectory): advantages = [] G = 0 for step in reversed(trajectory): G = step["reward"] + 0.99 * G adv = G - self.critic(step["state"]) advantages.insert(0, adv * step["mask"]) return advantages def update_policy(self, trajectories): for traj in trajectories: advantages = self.compute_masked_advantage(traj) loss = self.compute_policy_loss(traj, advantages) self.optimizer.zero_grad() loss.backward() self.optimizer.step()
| Method | Avg Exact Match |
|---|---|
| PPO (full masks) | 0.3719 |
| REINFORCE++ | 0.3300 |
| Naive RAG | 0.1328 |
| Base Tool Call | 0.0847 |
Agent-R1 with PPO achieves approximately 4x improvement over baseline approaches on multi-hop QA benchmarks with external search tools.