Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
Language Agent Tree Search (LATS) is a general framework introduced by Zhou et al. (2023) that unifies reasoning, acting, and planning by combining Monte Carlo Tree Search (MCTS) with LLM capabilities. The LLM serves simultaneously as the policy (action generator), value function (state evaluator), and reflection mechanism, enabling systematic exploration of decision trees with environment feedback.
Existing approaches treat reasoning (CoT, ToT) and acting (ReAct) as separate capabilities. Tree-of-Thoughts explores reasoning paths but lacks grounding in environment feedback. ReAct interacts with environments but follows a single trajectory without systematic exploration. LATS bridges this gap by embedding ReAct-style action trajectories within an MCTS search tree, combining the exploration benefits of tree search with the grounding of environment interaction.
LATS adapts the four classical MCTS phases for language agents. Each state $s = [x, a_{1:i}, o_{1:i}]$ represents the input $x$, actions taken $a_{1:i}$, and observations received $o_{1:i}$.
Starting from the root $s_0$, traverse the tree to a leaf node using the Upper Confidence Bound for Trees (UCT):
$$\text{UCT}(s) = V(s) + c \sqrt{\frac{\ln N(\text{parent}(s))}{N(s)}}$$
where $V(s)$ is the estimated value, $N(s)$ is the visit count, and $c$ is the exploration constant balancing exploitation vs. exploration.
At the selected leaf, sample $n$ new child nodes by prompting the LLM to generate actions (e.g., ReAct reasoning steps or tool calls), adding them to the tree.
Assign scalar values to new nodes via two complementary LLM-based heuristics:
For failed trajectories, the LLM generates reflections stored in memory as semantic feedback for future in-context learning.
Update values and visit counts along the path from the evaluated node back to the root:
$$V(s) \leftarrow \frac{1}{N(s)} \sum_{\text{visits}} v_i$$
class LATS: def __init__(self, llm, env, n_expansions=5, max_iterations=50, exploration_constant=1.0): self.llm = llm self.env = env self.n = n_expansions self.max_iter = max_iterations self.c = exploration_constant self.reflections = [] # Memory for failed trajectory reflections def search(self, task): root = Node(state=self.env.reset(task)) for _ in range(self.max_iter): # Selection: traverse tree using UCT node = self._select(root) # Expansion: generate n child actions via LLM children = self._expand(node) # Evaluation: score children with LLM heuristics for child in children: obs = self.env.step(child.action) child.observation = obs if self.env.is_terminal(child): if self.env.is_success(child): return child.trajectory() # Generate reflection on failure reflection = self.llm.reflect(child.trajectory()) self.reflections.append(reflection) child.value = self._evaluate(child) # Backpropagation: update values up the tree for child in children: self._backpropagate(child) return self._best_trajectory(root) def _select(self, node): while node.children: node = max(node.children, key=lambda c: self._uct(c)) return node def _uct(self, node): if node.visits == 0: return float('inf') exploit = node.value / node.visits explore = self.c * (math.log(node.parent.visits) / node.visits) ** 0.5 return exploit + explore def _evaluate(self, node): prompt = ( f"Trajectory so far:\n{node.trajectory()}\n" f"Past reflections:\n{self.reflections[-3:]}\n" f"Rate progress toward task completion (0-1):" ) return float(self.llm.generate(prompt))
The LLM plays three unified roles:
| Role | Function | How |
|---|---|---|
| Policy | Generate candidate actions | In-context prompting conditioned on trajectory |
| Value Function | Evaluate state quality | Self-assessment + cross-trajectory consistency |
| Reflection | Learn from failures | Generate verbal “gradients” stored in memory |
This eliminates the need for separate trained components, enabling gradient-free learning from experience through in-context reflection.
LATS achieves state-of-the-art across diverse task types (GPT-4, no fine-tuning):
| Benchmark | Task Type | LATS | Previous SOTA | Improvement |
|---|---|---|---|---|
| HumanEval | Programming | 92.7% pass@1 | 86.5% (ToT) | +6.2% |
| WebShop | Web navigation | 75.9 avg score | Comparable to fine-tuned | Gradient-free |
| HotPotQA | Multi-hop QA | Best EM | RAP, ToT | Fewer nodes |
Efficiency gains are substantial: LATS uses 3.55x fewer nodes than RAP and 12.12x fewer nodes than ToT on HotPotQA.
| Method | Search | Environment Feedback | Reflection | Acting |
|---|---|---|---|---|
| ToT | BFS/DFS | No | No | No |
| RAP | MCTS-like | No | No | No |
| ReAct | None (single path) | Yes | No | Yes |
| Reflexion | None (retry) | Yes | Yes | Yes |
| LATS | MCTS | Yes | Yes | Yes |