Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Code & Software
Safety & Security
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Code & Software
Safety & Security
Evaluation
Research
Development
Meta
Agent trajectory optimization focuses on improving the sequences of actions that LLM-based agents take to complete tasks. Rather than optimizing individual responses, these methods optimize entire reasoning trajectories — the chain of observations, thoughts, and actions that lead from a task description to a solution.
When LLM agents tackle complex multi-step tasks, they are sensitive to the quality of individual planning steps. A subtle mistake early in a trajectory can cascade into task failure. Traditional reinforcement learning approaches address this through process supervision, rewarding or penalizing every step. However, Process Reward Models (PRMs) are expensive to train because they require extensive per-step trajectory exploration.
Recent research has developed more efficient approaches that focus on relative reward trends across steps, evolutionary refinement of trajectories, and scalable synthesis of training data from diverse agent experiences.
RRO, accepted at COLM 2025, introduces a more efficient approach to process supervision for agent training. Instead of requiring dense per-step reward labels, RRO focuses on the relative reward trend between successive reasoning steps and maintains an increasing reward pattern in collected trajectories.
The method works by incrementally augmenting process supervision until it identifies a step exhibiting positive reward differentials (rising rewards) relative to its preceding iteration. This dynamically expands the search space for next-action candidates while efficiently capturing high-quality training data.
Key results on benchmarks:
# Reward Rising Optimization concept class RROTrainer: def __init__(self, agent, reward_model): self.agent = agent self.reward_model = reward_model def collect_trajectory(self, task): trajectory = [] state = task.initial_state prev_reward = 0 while not task.is_complete(state): # Generate candidate actions candidates = self.agent.propose_actions(state, k=5) # Score each candidate for action in candidates: next_state = task.simulate(state, action) reward = self.reward_model.score(next_state) # Keep only rising-reward transitions if reward > prev_reward: trajectory.append((state, action, reward)) state = next_state prev_reward = reward break else: # Expand search if no rising reward found candidates = self.agent.propose_actions(state, k=20) # Select best available best = max(candidates, key=lambda a: self.reward_model.score(task.simulate(state, a))) state = task.simulate(state, best) return trajectory
SE-Agent introduces an evolutionary mechanism for trajectory optimization with three core operations:
This approach expands the search space beyond local optima and mitigates suboptimal reasoning through diverse solution paths. Experimental results across five strong LLMs show up to 55% relative improvement when SE-Agent is integrated.
MCTS adapts classical game-tree search to agent planning by treating each action as a tree node. The agent explores multiple action paths, simulates outcomes, and backpropagates results to guide future exploration. While effective, MCTS can lead to redundant reasoning and suboptimal outcomes in open-ended agent tasks, which motivated evolutionary alternatives like SE-Agent.
Scalable multi-agent pipelines generate diverse trajectory training data through exploration as a core mechanism. By ensuring broad domain coverage and skill diversity in training datasets, these methods significantly improve agent performance on benchmarks like Mind2Web-Live and Multimodal-Mind2Web.
TrajAgent provides a unified LLM-based framework for trajectory modeling with two key components:
| Method | Key Innovation | Sample Efficiency | Performance Gain |
| RRO | Rising reward filtering | Very high (1.86 traj avg) | 62.91 on WebShop |
| SE-Agent | Evolutionary operations | Moderate | Up to 55% relative improvement |
| MCTS | Tree search exploration | Low (many rollouts) | Strong but expensive |
| Trajectory Synthesis | Diverse data generation | N/A (training data) | Benchmark improvements |
Outcome supervision rewards only final task completion, providing sparse but unambiguous signal. Process supervision rewards intermediate steps, providing dense signal but requiring expensive annotation. RRO bridges this gap by using relative reward trends rather than absolute step-level labels.
Trajectory optimization must balance exploring novel action sequences (which may discover superior strategies) against exploiting known good trajectories (which provide reliable performance). SE-Agent's evolutionary approach naturally balances this through revision (exploitation) and recombination (exploration).
The most promising direction in trajectory optimization is self-evolution: agents that improve their own trajectories without external supervision. By maintaining a population of solution strategies and applying evolutionary pressure, these systems can discover novel approaches that neither the base model nor human designers would produce.