====== Agent Trajectory Optimization ======

Agent trajectory optimization focuses on improving the sequences of actions that LLM-based agents take to complete tasks. Rather than optimizing individual responses, these methods optimize entire reasoning trajectories — the chain of observations, thoughts, and actions that lead from a task description to a solution.

===== Overview =====

When LLM agents tackle complex multi-step tasks, they are sensitive to the quality of individual planning steps. A subtle mistake early in a trajectory can cascade into task failure. Traditional reinforcement learning approaches address this through process supervision, rewarding or penalizing every step. However, Process Reward Models (PRMs) are expensive to train because they require extensive per-step trajectory exploration.

Recent research has developed more efficient approaches that focus on relative reward trends across steps, evolutionary refinement of trajectories, and scalable synthesis of training data from diverse agent experiences.

===== Key Methods =====

==== Reward Rising Optimization (RRO) ====

RRO, accepted at COLM 2025, introduces a more efficient approach to process supervision for agent training. Instead of requiring dense per-step reward labels, RRO focuses on the relative reward trend between successive reasoning steps and maintains an increasing reward pattern in collected trajectories.

The method works by incrementally augmenting process supervision until it identifies a step exhibiting positive reward differentials (rising rewards) relative to its preceding iteration. This dynamically expands the search space for next-action candidates while efficiently capturing high-quality training data.

Key results on benchmarks:
  * **WebShop**: 62.91 reward with only 1.86 average trajectories (vs. IPR needing 5 trajectories for 61.32)
  * **InterCode-SQL**: 55.08 reward with 1.64 average trajectories
  * By the final reasoning phase, 45.29% of actions showed rising rewards vs. 40.87% for baseline methods

<code python>
# Reward Rising Optimization concept
class RROTrainer:
    def __init__(self, agent, reward_model):
        self.agent = agent
        self.reward_model = reward_model

    def collect_trajectory(self, task):
        trajectory = []
        state = task.initial_state
        prev_reward = 0

        while not task.is_complete(state):
            # Generate candidate actions
            candidates = self.agent.propose_actions(state, k=5)

            # Score each candidate
            for action in candidates:
                next_state = task.simulate(state, action)
                reward = self.reward_model.score(next_state)

                # Keep only rising-reward transitions
                if reward > prev_reward:
                    trajectory.append((state, action, reward))
                    state = next_state
                    prev_reward = reward
                    break
            else:
                # Expand search if no rising reward found
                candidates = self.agent.propose_actions(state, k=20)
                # Select best available
                best = max(candidates, key=lambda a:
                    self.reward_model.score(task.simulate(state, a)))
                state = task.simulate(state, best)

        return trajectory
</code>

==== SE-Agent (Self-Evolution) ====

SE-Agent introduces an evolutionary mechanism for trajectory optimization with three core operations:

  * **Revision** — the agent revisits and improves previous solution trajectories
  * **Recombination** — cross-trajectory inspiration combines successful elements from different solution paths
  * **Refinement** — iterative improvement of recombined trajectories

This approach expands the search space beyond local optima and mitigates suboptimal reasoning through diverse solution paths. Experimental results across five strong LLMs show up to 55% relative improvement when SE-Agent is integrated.

==== Monte Carlo Tree Search for Agents ====

MCTS adapts classical game-tree search to agent planning by treating each action as a tree node. The agent explores multiple action paths, simulates outcomes, and backpropagates results to guide future exploration. While effective, MCTS can lead to redundant reasoning and suboptimal outcomes in open-ended agent tasks, which motivated evolutionary alternatives like SE-Agent.

==== Web Agent Trajectory Synthesis ====

Scalable multi-agent pipelines generate diverse trajectory training data through exploration as a core mechanism. By ensuring broad domain coverage and skill diversity in training datasets, these methods significantly improve agent performance on benchmarks like Mind2Web-Live and Multimodal-Mind2Web.

==== TrajAgent ====

TrajAgent provides a unified LLM-based framework for trajectory modeling with two key components:

  * **UniEnv** — a unified execution environment with standardized data and model interfaces
  * **AutOpt** — a systematic optimization module that improves model performance by 15.43% over baselines

===== Comparison of Approaches =====

| Method | Key Innovation | Sample Efficiency | Performance Gain |
| RRO | Rising reward filtering | Very high (1.86 traj avg) | 62.91 on WebShop |
| SE-Agent | Evolutionary operations | Moderate | Up to 55% relative improvement |
| MCTS | Tree search exploration | Low (many rollouts) | Strong but expensive |
| Trajectory Synthesis | Diverse data generation | N/A (training data) | Benchmark improvements |

===== Core Concepts =====

==== Process Supervision vs. Outcome Supervision ====

Outcome supervision rewards only final task completion, providing sparse but unambiguous signal. Process supervision rewards intermediate steps, providing dense signal but requiring expensive annotation. RRO bridges this gap by using relative reward trends rather than absolute step-level labels.

==== Exploration vs. Exploitation ====

Trajectory optimization must balance exploring novel action sequences (which may discover superior strategies) against exploiting known good trajectories (which provide reliable performance). SE-Agent's evolutionary approach naturally balances this through revision (exploitation) and recombination (exploration).

==== Self-Evolution ====

The most promising direction in trajectory optimization is self-evolution: agents that improve their own trajectories without external supervision. By maintaining a population of solution strategies and applying evolutionary pressure, these systems can discover novel approaches that neither the base model nor human designers would produce.

===== Applications =====

  * **Web navigation** — optimizing browsing trajectories for task completion
  * **Code generation** — improving multi-step coding sequences
  * **SQL query composition** — refining database interaction patterns
  * **Scientific reasoning** — optimizing experimental design trajectories
  * **Tool use** — learning efficient sequences of tool invocations

===== References =====

  * [[https://openreview.net/forum?id=PhaE8TSM5j|RRO: LLM Agent Optimization Through Rising Reward Trajectories (COLM 2025)]]
  * [[https://arxiv.org/abs/2508.02085|SE-Agent: Self-Evolution Trajectory Optimization (arXiv:2508.02085)]]
  * [[https://github.com/tsinghua-fib-lab/TrajAgent|TrajAgent: Unified Trajectory Modeling Framework]]
  * [[https://aclanthology.org/2025.findings-acl.326.pdf|Web Agent Trajectory Synthesis (ACL 2025)]]

===== See Also =====

  * [[long_horizon_agents|Long-Horizon Agents]]
  * [[continual_learning_agents|Continual Learning Agents]]
  * [[openhands|OpenHands]]
  * [[durable_execution_for_agents|Durable Execution for Agents]]