Table of Contents

Agent Trajectory Optimization

Agent trajectory optimization focuses on improving the sequences of actions that LLM-based agents take to complete tasks. Rather than optimizing individual responses, these methods optimize entire reasoning trajectories — the chain of observations, thoughts, and actions that lead from a task description to a solution.

Overview

When LLM agents tackle complex multi-step tasks, they are sensitive to the quality of individual planning steps. A subtle mistake early in a trajectory can cascade into task failure. Traditional reinforcement learning approaches address this through process supervision, rewarding or penalizing every step. However, Process Reward Models (PRMs) are expensive to train because they require extensive per-step trajectory exploration.

Recent research has developed more efficient approaches that focus on relative reward trends across steps, evolutionary refinement of trajectories, and scalable synthesis of training data from diverse agent experiences.

Key Methods

Reward Rising Optimization (RRO)

RRO, accepted at COLM 2025, introduces a more efficient approach to process supervision for agent training. Instead of requiring dense per-step reward labels, RRO focuses on the relative reward trend between successive reasoning steps and maintains an increasing reward pattern in collected trajectories.

The method works by incrementally augmenting process supervision until it identifies a step exhibiting positive reward differentials (rising rewards) relative to its preceding iteration. This dynamically expands the search space for next-action candidates while efficiently capturing high-quality training data.

Key results on benchmarks:

# Reward Rising Optimization concept
class RROTrainer:
    def __init__(self, agent, reward_model):
        self.agent = agent
        self.reward_model = reward_model
 
    def collect_trajectory(self, task):
        trajectory = []
        state = task.initial_state
        prev_reward = 0
 
        while not task.is_complete(state):
            # Generate candidate actions
            candidates = self.agent.propose_actions(state, k=5)
 
            # Score each candidate
            for action in candidates:
                next_state = task.simulate(state, action)
                reward = self.reward_model.score(next_state)
 
                # Keep only rising-reward transitions
                if reward > prev_reward:
                    trajectory.append((state, action, reward))
                    state = next_state
                    prev_reward = reward
                    break
            else:
                # Expand search if no rising reward found
                candidates = self.agent.propose_actions(state, k=20)
                # Select best available
                best = max(candidates, key=lambda a:
                    self.reward_model.score(task.simulate(state, a)))
                state = task.simulate(state, best)
 
        return trajectory

SE-Agent (Self-Evolution)

SE-Agent introduces an evolutionary mechanism for trajectory optimization with three core operations:

This approach expands the search space beyond local optima and mitigates suboptimal reasoning through diverse solution paths. Experimental results across five strong LLMs show up to 55% relative improvement when SE-Agent is integrated.

Monte Carlo Tree Search for Agents

MCTS adapts classical game-tree search to agent planning by treating each action as a tree node. The agent explores multiple action paths, simulates outcomes, and backpropagates results to guide future exploration. While effective, MCTS can lead to redundant reasoning and suboptimal outcomes in open-ended agent tasks, which motivated evolutionary alternatives like SE-Agent.

Web Agent Trajectory Synthesis

Scalable multi-agent pipelines generate diverse trajectory training data through exploration as a core mechanism. By ensuring broad domain coverage and skill diversity in training datasets, these methods significantly improve agent performance on benchmarks like Mind2Web-Live and Multimodal-Mind2Web.

TrajAgent

TrajAgent provides a unified LLM-based framework for trajectory modeling with two key components:

Comparison of Approaches

Method Key Innovation Sample Efficiency Performance Gain
RRO Rising reward filtering Very high (1.86 traj avg) 62.91 on WebShop
SE-Agent Evolutionary operations Moderate Up to 55% relative improvement
MCTS Tree search exploration Low (many rollouts) Strong but expensive
Trajectory Synthesis Diverse data generation N/A (training data) Benchmark improvements

Core Concepts

Process Supervision vs. Outcome Supervision

Outcome supervision rewards only final task completion, providing sparse but unambiguous signal. Process supervision rewards intermediate steps, providing dense signal but requiring expensive annotation. RRO bridges this gap by using relative reward trends rather than absolute step-level labels.

Exploration vs. Exploitation

Trajectory optimization must balance exploring novel action sequences (which may discover superior strategies) against exploiting known good trajectories (which provide reliable performance). SE-Agent's evolutionary approach naturally balances this through revision (exploitation) and recombination (exploration).

Self-Evolution

The most promising direction in trajectory optimization is self-evolution: agents that improve their own trajectories without external supervision. By maintaining a population of solution strategies and applying evolutionary pressure, these systems can discover novel approaches that neither the base model nor human designers would produce.

Applications

References

See Also