====== Agent RL Training: Agent-R1 and RAGEN ======

Training LLM agents with reinforcement learning for multi-turn interactive tasks represents a paradigm shift from RLHF-based alignment. **Agent-R1** and **RAGEN** introduce frameworks for end-to-end RL training where agents learn directly from environment rewards across multi-step trajectories, rather than from human preference labels.

===== The Environment Reward Paradigm =====

Traditional RLHF trains language models using human preference rankings to build a reward model. In contrast, agent RL training uses **environment rewards** -- objective signals derived from task outcomes and intermediate steps:

$$R_{\text{env}}(s_t, a_t, s_{t+1}) = f(\text{tool\_output}, \text{task\_progress}, \text{correctness})$$

This is fundamentally different from RLHF where:

$$R_{\text{RLHF}}(x, y) = \text{RewardModel}(x, y) \approx \text{human\_preference}(x, y)$$

Environment rewards are automated, dense, and specifically designed for interactive agent tasks, making them scalable without human annotation.

===== Agent-R1: MDP Framework for LLM Agents =====

Agent-R1 extends the Markov Decision Process to LLM agents with explicit tool use. States are interaction histories, actions are generated tool calls or responses, and transitions are determined by stochastic environment feedback.

**Key Architectural Components:**

  * **Tool Module:** Standardized interfaces for tools (search APIs, calculators) with JSON schema definitions, inspired by OpenAI function calling
  * **ToolEnv Module:** Environment simulators that handle state transitions and reward computation
  * **Action Masking:** Gradients are applied only to agent-controlled tokens, excluding prompts and environmental text
  * **Advantage Masking:** Precise credit assignment distinguishing agent decisions from environment responses

**Policy Optimization:** Agent-R1 supports PPO, GRPO, and REINFORCE++ on full multi-turn trajectories with masked advantages:

$$\nabla J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t} \nabla \log \pi_\theta(a_t | s_t) \cdot \hat{A}_t \cdot M_t \right]$$

where $M_t$ is the action mask ensuring only agent tokens receive gradient updates.

===== RAGEN: Self-Evolution via StarPO =====

RAGEN introduces **StarPO** (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL. It addresses three core challenges discovered through systematic study:

**1. The Echo Trap:** A recurring failure mode where reward variance collapses and gradient spikes destabilize training. StarPO-S addresses this with:
  * Trajectory filtering to remove degenerate rollouts
  * Critic incorporation for variance reduction
  * Gradient stabilization techniques

**2. Rollout Shaping:** Effective multi-turn RL benefits from:
  * Diverse initial states for exploration
  * Medium interaction granularity (not too fine, not too coarse)
  * Frequent sampling to prevent distribution drift

**3. Reasoning-Aware Rewards:** Without fine-grained reward signals that account for reasoning quality, agents develop shallow strategies or hallucinated reasoning chains.

===== Code Example: Multi-Turn RL Training Loop =====

<code python>
class AgentRLTrainer:
    def __init__(self, policy_model, env, optimizer, algo="ppo"):
        self.policy = policy_model
        self.env = env
        self.optimizer = optimizer
        self.algo = algo

    def collect_trajectory(self, task):
        state = self.env.reset(task)
        trajectory = []
        for step in range(self.env.max_steps):
            action, log_prob = self.policy.generate_action(state)
            next_state, reward, done = self.env.step(action)
            mask = self.compute_action_mask(state, action)
            trajectory.append({
                "state": state, "action": action,
                "reward": reward, "log_prob": log_prob,
                "mask": mask, "done": done
            })
            state = next_state
            if done:
                break
        return trajectory

    def compute_masked_advantage(self, trajectory):
        advantages = []
        G = 0
        for step in reversed(trajectory):
            G = step["reward"] + 0.99 * G
            adv = G - self.critic(step["state"])
            advantages.insert(0, adv * step["mask"])
        return advantages

    def update_policy(self, trajectories):
        for traj in trajectories:
            advantages = self.compute_masked_advantage(traj)
            loss = self.compute_policy_loss(traj, advantages)
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
</code>

===== Agent-R1 Benchmark Results =====

^ Method ^ Avg Exact Match ^
| **PPO (full masks)** | **0.3719** |
| REINFORCE++ | 0.3300 |
| Naive RAG | 0.1328 |
| Base Tool Call | 0.0847 |

Agent-R1 with PPO achieves approximately 4x improvement over baseline approaches on multi-hop QA benchmarks with external search tools.

===== Training Pipeline Diagram =====

<mermaid>
flowchart TD
    A[Task Sampler] --> B[Environment Reset]
    B --> C[Agent generates action]
    C --> D[Environment step]
    D --> E{Done?}
    E -->|No| C
    E -->|Yes| F[Compute trajectory rewards]
    F --> G[Action & Advantage Masking]
    G --> H[Policy Gradient Update]
    H --> I{Echo Trap detected?}
    I -->|Yes| J[StarPO-S: Filter + Stabilize]
    I -->|No| K[Next training iteration]
    J --> K
    K --> A
</mermaid>

===== Key Insights =====

  * Multi-turn agent RL is fundamentally different from single-turn RLHF -- it requires handling stochastic environments, tool interactions, and long-horizon credit assignment
  * Action masking is critical: without it, gradients flow through environment tokens, confusing the optimization
  * The Echo Trap is a systematic failure mode specific to agent RL that requires dedicated mitigation
  * Reasoning-aware rewards are necessary for agents to develop genuine problem-solving strategies rather than shallow heuristics

===== References =====

  * [[https://arxiv.org/abs/2511.14460|Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning (arXiv:2511.14460)]]
  * [[https://arxiv.org/abs/2504.20073|RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning (arXiv:2504.20073)]]
  * [[https://github.com/AgentR1/Agent-R1|Agent-R1 GitHub Repository]]
  * [[https://github.com/RAGEN-AI/RAGEN|RAGEN GitHub Repository]]

===== See Also =====

  * [[data_science_agents|Data Science Agents: DatawiseAgent]]
  * [[agent_resource_management|Agent Resource Management: AgentRM]]
  * [[knowledge_graph_world_models|Knowledge Graph World Models: AriGraph]]