Table of Contents

Agent RL Training: Agent-R1 and RAGEN

Training LLM agents with reinforcement learning for multi-turn interactive tasks represents a paradigm shift from RLHF-based alignment. Agent-R1 and RAGEN introduce frameworks for end-to-end RL training where agents learn directly from environment rewards across multi-step trajectories, rather than from human preference labels.

The Environment Reward Paradigm

Traditional RLHF trains language models using human preference rankings to build a reward model. In contrast, agent RL training uses environment rewards – objective signals derived from task outcomes and intermediate steps:

$$R_{\text{env}}(s_t, a_t, s_{t+1}) = f(\text{tool\_output}, \text{task\_progress}, \text{correctness})$$

This is fundamentally different from RLHF where:

$$R_{\text{RLHF}}(x, y) = \text{RewardModel}(x, y) \approx \text{human\_preference}(x, y)$$

Environment rewards are automated, dense, and specifically designed for interactive agent tasks, making them scalable without human annotation.

Agent-R1: MDP Framework for LLM Agents

Agent-R1 extends the Markov Decision Process to LLM agents with explicit tool use. States are interaction histories, actions are generated tool calls or responses, and transitions are determined by stochastic environment feedback.

Key Architectural Components:

Policy Optimization: Agent-R1 supports PPO, GRPO, and REINFORCE++ on full multi-turn trajectories with masked advantages:

$$\nabla J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t} \nabla \log \pi_\theta(a_t | s_t) \cdot \hat{A}_t \cdot M_t \right]$$

where $M_t$ is the action mask ensuring only agent tokens receive gradient updates.

RAGEN: Self-Evolution via StarPO

RAGEN introduces StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL. It addresses three core challenges discovered through systematic study:

1. The Echo Trap: A recurring failure mode where reward variance collapses and gradient spikes destabilize training. StarPO-S addresses this with:

2. Rollout Shaping: Effective multi-turn RL benefits from:

3. Reasoning-Aware Rewards: Without fine-grained reward signals that account for reasoning quality, agents develop shallow strategies or hallucinated reasoning chains.

Code Example: Multi-Turn RL Training Loop

class AgentRLTrainer:
    def __init__(self, policy_model, env, optimizer, algo="ppo"):
        self.policy = policy_model
        self.env = env
        self.optimizer = optimizer
        self.algo = algo
 
    def collect_trajectory(self, task):
        state = self.env.reset(task)
        trajectory = []
        for step in range(self.env.max_steps):
            action, log_prob = self.policy.generate_action(state)
            next_state, reward, done = self.env.step(action)
            mask = self.compute_action_mask(state, action)
            trajectory.append({
                "state": state, "action": action,
                "reward": reward, "log_prob": log_prob,
                "mask": mask, "done": done
            })
            state = next_state
            if done:
                break
        return trajectory
 
    def compute_masked_advantage(self, trajectory):
        advantages = []
        G = 0
        for step in reversed(trajectory):
            G = step["reward"] + 0.99 * G
            adv = G - self.critic(step["state"])
            advantages.insert(0, adv * step["mask"])
        return advantages
 
    def update_policy(self, trajectories):
        for traj in trajectories:
            advantages = self.compute_masked_advantage(traj)
            loss = self.compute_policy_loss(traj, advantages)
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

Agent-R1 Benchmark Results

Method Avg Exact Match
PPO (full masks) 0.3719
REINFORCE++ 0.3300
Naive RAG 0.1328
Base Tool Call 0.0847

Agent-R1 with PPO achieves approximately 4x improvement over baseline approaches on multi-hop QA benchmarks with external search tools.

Training Pipeline Diagram

flowchart TD A[Task Sampler] --> B[Environment Reset] B --> C[Agent generates action] C --> D[Environment step] D --> E{Done?} E -->|No| C E -->|Yes| F[Compute trajectory rewards] F --> G[Action & Advantage Masking] G --> H[Policy Gradient Update] H --> I{Echo Trap detected?} I -->|Yes| J[StarPO-S: Filter + Stabilize] I -->|No| K[Next training iteration] J --> K K --> A

Key Insights

References

See Also