Table of Contents

Tree of Thoughts

Tree of Thoughts (ToT) is a reasoning framework that generalizes Chain-of-Thought prompting by allowing language models to explore multiple reasoning paths simultaneously, rather than following a single linear chain1). Proposed by Yao et al., 2023 in “Tree of Thoughts: Deliberate Problem Solving with Large Language Models” (NeurIPS 2023), ToT structures the problem-solving process as a search over a tree where each node represents a partial solution or “thought”, a coherent unit of text that represents progress toward solving the problem. The framework incorporates deliberate search algorithms such as breadth-first and depth-first search, enabling models to evaluate, backtrack, and prune reasoning branches.

Architecture and Search Strategies

ToT operates through four key components:

1. Thought Decomposition: The problem is broken into intermediate thought steps, where each thought is a meaningful semantic unit (e.g., one equation step in a math puzzle, one line of a poem, one row of a crossword). The granularity of thoughts is task-dependent.

2. Thought Generation: From any given state (node in the tree), the LLM proposes multiple candidate next thoughts (typically 3-5). This can be done via independent sampling or sequential proposal, depending on the diversity needed.

3. State Evaluation: The LLM evaluates how promising each candidate thought is for reaching the solution. Yao et al., 2023 use two strategies:

4. Search Algorithm: Classical search algorithms navigate the tree:

Python Example

from [[openai|openai]] import [[openai|OpenAI]]
 
client = [[openai|OpenAI]]()
 
def generate_thoughts(problem: str, current_state: str, n: int = 3) -> list[str]:
    """Generate n candidate next-thoughts from the current state."""
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": (
            f"Problem: {problem}\n"
            f"Progress so far: {current_state}\n"
            f"Propose {n} distinct next steps. Return each on a separate line."
        )}],
    )
    return resp.choices[0].message.content.strip().split("\n")
 
def evaluate_thought(problem: str, state: str) -> str:
    """Evaluate whether a partial solution is sure/maybe/impossible."""
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": (
            f"Problem: {problem}\nCurrent state: {state}\n"
            "Rate this state as sure, maybe, or impossible. Reply with one word."
        )}],
    )
    return resp.choices[0].message.content.strip().lower()
 
def tree_of_thoughts_bfs(problem: str, steps: int = 3, beam_width: int = 2):
    """BFS-based Tree of Thoughts: expand, evaluate, keep top-b at each depth."""
    current_states = [""]
    for step in range(steps):
        candidates = []
        for state in current_states:
            thoughts = generate_thoughts(problem, state)
            for thought in thoughts:
                new_state = f"{state} -> {thought}" if state else thought
                score = evaluate_thought(problem, new_state)
                if score != "impossible":
                    candidates.append((new_state, score))
        # Keep top-b: prioritize "sure" over "maybe"
        candidates.sort(key=lambda x: (x[1] != "sure", x[1]))
        current_states = [c[0] for c in candidates[:beam_width]]
        print(f"Step {step + 1}: {len(candidates)} candidates, kept {len(current_states)}")
    return current_states[0] if current_states else None
 
# Example: solve a reasoning problem with ToT
solution = tree_of_thoughts_bfs("Use 2, 3, 4, 6 with +-*/ to make 24")
print(f"Best path: {solution}")

Thought Decomposition and Evaluation

The evaluation step is what distinguishes ToT from simple beam search over tokens. Rather than scoring at the token level, ToT evaluates at the thought level, meaning the LLM assesses whether a partial solution is on a viable path toward the goal. This mirrors how humans reason about problems: considering whether an approach seems promising before investing further effort.

For example, in the Game of 24 task (use four numbers with arithmetic operations to reach exactly 24), the evaluation prompt asks the LLM to assess whether the remaining numbers and intermediate result can still reach 24. The LLM serves as a flexible, general-purpose heuristic function that is more adaptable than hand-coded rules.

Comparison with Chain-of-Thought

Aspect Chain-of-Thought Tree of Thoughts
Structure Single linear reasoning path Branching tree of multiple paths
Exploration No lookahead or backtracking Systematic BFS/DFS with pruning
Evaluation Implicit (via generation) Explicit LLM-based state evaluation
Error recovery Cannot recover from early mistakes Backtracks from dead-end branches
Compute cost Low (single forward pass) Higher (multiple generations + evaluations)

ToT is particularly valuable for tasks where the first reasoning step significantly constrains the solution space and where wrong early choices are difficult to recover from.

Practical Applications and Results

Yao et al., 2023 evaluated ToT with GPT-4 on three tasks:

Implementations: The official Princeton NLP repository (princeton-nlp/tree-of-thought-llm)3) provides Python code supporting custom tasks with BFS/DFS. Long (2023) extended ToT with an RL-trained “ToT Controller” that dynamically decides when to backtrack, replacing fixed search strategies. More recent work (2024-2025) has focused on reducing the computational overhead through more efficient evaluation, hybrid search strategies, and integration with multi-agent systems.

See Also

References