Agent Trajectory Optimization

Agent trajectory optimization focuses on improving the sequences of actions that LLM-based agents take to complete tasks. Rather than optimizing individual responses, these methods optimize entire reasoning trajectories — the chain of observations, thoughts, and actions that lead from a task description to a solution.

Overview

When LLM agents tackle complex multi-step tasks, they are sensitive to the quality of individual planning steps. A subtle mistake early in a trajectory can cascade into task failure. Traditional reinforcement learning approaches address this through process supervision, rewarding or penalizing every step. However, Process Reward Models (PRMs) are expensive to train because they require extensive per-step trajectory exploration.

Recent research has developed more efficient approaches that focus on relative reward trends across steps, evolutionary refinement of trajectories, and scalable synthesis of training data from diverse agent experiences.

Key Methods

Reward Rising Optimization (RRO)

RRO, accepted at COLM 2025, introduces a more efficient approach to process supervision for agent training. Instead of requiring dense per-step reward labels, RRO focuses on the relative reward trend between successive reasoning steps and maintains an increasing reward pattern in collected trajectories.

The method works by incrementally augmenting process supervision until it identifies a step exhibiting positive reward differentials (rising rewards) relative to its preceding iteration. This dynamically expands the search space for next-action candidates while efficiently capturing high-quality training data.

Key results on benchmarks:

WebShop: 62.91 reward with only 1.86 average trajectories (vs. IPR needing 5 trajectories for 61.32)
InterCode-SQL: 55.08 reward with 1.64 average trajectories
By the final reasoning phase, 45.29% of actions showed rising rewards vs. 40.87% for baseline methods

# Reward Rising Optimization concept
class RROTrainer:
    def __init__(self, agent, reward_model):
        self.agent = agent
        self.reward_model = reward_model
 
    def collect_trajectory(self, task):
        trajectory = []
        state = task.initial_state
        prev_reward = 0
 
        while not task.is_complete(state):
            # Generate candidate actions
            candidates = self.agent.propose_actions(state, k=5)
 
            # Score each candidate
            for action in candidates:
                next_state = task.simulate(state, action)
                reward = self.reward_model.score(next_state)
 
                # Keep only rising-reward transitions
                if reward > prev_reward:
                    trajectory.append((state, action, reward))
                    state = next_state
                    prev_reward = reward
                    break
            else:
                # Expand search if no rising reward found
                candidates = self.agent.propose_actions(state, k=20)
                # Select best available
                best = max(candidates, key=lambda a:
                    self.reward_model.score(task.simulate(state, a)))
                state = task.simulate(state, best)
 
        return trajectory

SE-Agent (Self-Evolution)

SE-Agent introduces an evolutionary mechanism for trajectory optimization with three core operations:

Revision — the agent revisits and improves previous solution trajectories
Recombination — cross-trajectory inspiration combines successful elements from different solution paths
Refinement — iterative improvement of recombined trajectories

This approach expands the search space beyond local optima and mitigates suboptimal reasoning through diverse solution paths. Experimental results across five strong LLMs show up to 55% relative improvement when SE-Agent is integrated.

Monte Carlo Tree Search for Agents

MCTS adapts classical game-tree search to agent planning by treating each action as a tree node. The agent explores multiple action paths, simulates outcomes, and backpropagates results to guide future exploration. While effective, MCTS can lead to redundant reasoning and suboptimal outcomes in open-ended agent tasks, which motivated evolutionary alternatives like SE-Agent.

Web Agent Trajectory Synthesis

Scalable multi-agent pipelines generate diverse trajectory training data through exploration as a core mechanism. By ensuring broad domain coverage and skill diversity in training datasets, these methods significantly improve agent performance on benchmarks like Mind2Web-Live and Multimodal-Mind2Web.

TrajAgent

TrajAgent provides a unified LLM-based framework for trajectory modeling with two key components:

UniEnv — a unified execution environment with standardized data and model interfaces
AutOpt — a systematic optimization module that improves model performance by 15.43% over baselines

Comparison of Approaches

Method	Key Innovation	Sample Efficiency	Performance Gain
RRO	Rising reward filtering	Very high (1.86 traj avg)	62.91 on WebShop
SE-Agent	Evolutionary operations	Moderate	Up to 55% relative improvement
MCTS	Tree search exploration	Low (many rollouts)	Strong but expensive
Trajectory Synthesis	Diverse data generation	N/A (training data)	Benchmark improvements

Core Concepts

Process Supervision vs. Outcome Supervision

Outcome supervision rewards only final task completion, providing sparse but unambiguous signal. Process supervision rewards intermediate steps, providing dense signal but requiring expensive annotation. RRO bridges this gap by using relative reward trends rather than absolute step-level labels.

Exploration vs. Exploitation

Trajectory optimization must balance exploring novel action sequences (which may discover superior strategies) against exploiting known good trajectories (which provide reliable performance). SE-Agent's evolutionary approach naturally balances this through revision (exploitation) and recombination (exploration).

Self-Evolution

The most promising direction in trajectory optimization is self-evolution: agents that improve their own trajectories without external supervision. By maintaining a population of solution strategies and applying evolutionary pressure, these systems can discover novel approaches that neither the base model nor human designers would produce.

Applications

Web navigation — optimizing browsing trajectories for task completion
Code generation — improving multi-step coding sequences
SQL query composition — refining database interaction patterns
Scientific reasoning — optimizing experimental design trajectories
Tool use — learning efficient sequences of tool invocations

Table of Contents