Language Agent Tree Search

Language Agent Tree Search (LATS) is a general framework introduced by Zhou et al. (2023) that unifies reasoning, acting, and planning by combining Monte Carlo Tree Search (MCTS) with LLM capabilities. The LLM serves simultaneously as the policy (action generator), value function (state evaluator), and reflection mechanism, enabling systematic exploration of decision trees with environment feedback.

Motivation

Existing approaches treat reasoning (CoT, ToT) and acting (ReAct) as separate capabilities. Tree-of-Thoughts explores reasoning paths but lacks grounding in environment feedback. ReAct interacts with environments but follows a single trajectory without systematic exploration. LATS bridges this gap by embedding ReAct-style action trajectories within an MCTS search tree, combining the exploration benefits of tree search with the grounding of environment interaction.

MCTS Phases Adapted for LLMs

LATS adapts the four classical MCTS phases for language agents. Each state $s = [x, a_{1:i}, o_{1:i}]$ represents the input $x$, actions taken $a_{1:i}$, and observations received $o_{1:i}$.

Selection

Starting from the root $s_0$, traverse the tree to a leaf node using the Upper Confidence Bound for Trees (UCT):

$$\text{UCT}(s) = V(s) + c \sqrt{\frac{\ln N(\text{parent}(s))}{N(s)}}$$

where $V(s)$ is the estimated value, $N(s)$ is the visit count, and $c$ is the exploration constant balancing exploitation vs. exploration.

Expansion

At the selected leaf, sample $n$ new child nodes by prompting the LLM to generate actions (e.g., ReAct reasoning steps or tool calls), adding them to the tree.

Evaluation

Assign scalar values to new nodes via two complementary LLM-based heuristics:

Progress score — LLM self-evaluates how close the current state is to task completion
Self-consistency score — Agreement across similar trajectories in the tree

For failed trajectories, the LLM generates reflections stored in memory as semantic feedback for future in-context learning.

Backpropagation

Update values and visit counts along the path from the evaluated node back to the root:

$$V(s) \leftarrow \frac{1}{N(s)} \sum_{\text{visits}} v_i$$

class LATS:
    def __init__(self, llm, env, n_expansions=5, max_iterations=50,
                 exploration_constant=1.0):
        self.llm = llm
        self.env = env
        self.n = n_expansions
        self.max_iter = max_iterations
        self.c = exploration_constant
        self.reflections = []  # Memory for failed trajectory reflections
 
    def search(self, task):
        root = Node(state=self.env.reset(task))
 
        for _ in range(self.max_iter):
            # Selection: traverse tree using UCT
            node = self._select(root)
 
            # Expansion: generate n child actions via LLM
            children = self._expand(node)
 
            # Evaluation: score children with LLM heuristics
            for child in children:
                obs = self.env.step(child.action)
                child.observation = obs
 
                if self.env.is_terminal(child):
                    if self.env.is_success(child):
                        return child.trajectory()
                    # Generate reflection on failure
                    reflection = self.llm.reflect(child.trajectory())
                    self.reflections.append(reflection)
 
                child.value = self._evaluate(child)
 
            # Backpropagation: update values up the tree
            for child in children:
                self._backpropagate(child)
 
        return self._best_trajectory(root)
 
    def _select(self, node):
        while node.children:
            node = max(node.children, key=lambda c: self._uct(c))
        return node
 
    def _uct(self, node):
        if node.visits == 0:
            return float('inf')
        exploit = node.value / node.visits
        explore = self.c * (math.log(node.parent.visits) / node.visits) ** 0.5
        return exploit + explore
 
    def _evaluate(self, node):
        prompt = (
            f"Trajectory so far:\n{node.trajectory()}\n"
            f"Past reflections:\n{self.reflections[-3:]}\n"
            f"Rate progress toward task completion (0-1):"
        )
        return float(self.llm.generate(prompt))

Key Architectural Innovation

The LLM plays three unified roles:

Role	Function	How
Policy	Generate candidate actions	In-context prompting conditioned on trajectory
Value Function	Evaluate state quality	Self-assessment + cross-trajectory consistency
Reflection	Learn from failures	Generate verbal “gradients” stored in memory

This eliminates the need for separate trained components, enabling gradient-free learning from experience through in-context reflection.

Key Results

LATS achieves state-of-the-art across diverse task types (GPT-4, no fine-tuning):

Benchmark	Task Type	LATS	Previous SOTA	Improvement
HumanEval	Programming	92.7% pass@1	86.5% (ToT)	+6.2%
WebShop	Web navigation	75.9 avg score	Comparable to fine-tuned	Gradient-free
HotPotQA	Multi-hop QA	Best EM	RAP, ToT	Fewer nodes

Efficiency gains are substantial: LATS uses 3.55x fewer nodes than RAP and 12.12x fewer nodes than ToT on HotPotQA.

Comparison with Related Methods

Method	Search	Environment Feedback	Reflection	Acting
ToT	BFS/DFS	No	No	No
RAP	MCTS-like	No	No	No
ReAct	None (single path)	Yes	No	Yes
Reflexion	None (retry)	Yes	Yes	Yes
LATS	MCTS	Yes	Yes	Yes

AI Agent Knowledge Base

Sidebar

Table of Contents

Language Agent Tree Search

Motivation

MCTS Phases Adapted for LLMs

Selection

Expansion

Evaluation

Backpropagation

Key Architectural Innovation

Key Results

Comparison with Related Methods

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Language Agent Tree Search

Motivation

MCTS Phases Adapted for LLMs

Selection

Expansion

Evaluation

Backpropagation

Key Architectural Innovation

Key Results

Comparison with Related Methods

References

See Also

Page Tools