Tree of Thoughts

Tree of Thoughts (ToT) is a reasoning framework that generalizes Chain-of-Thought prompting by allowing language models to explore multiple reasoning paths simultaneously, rather than following a single linear chain¹⁾. Proposed by Yao et al., 2023 in “Tree of Thoughts: Deliberate Problem Solving with Large Language Models” (NeurIPS 2023), ToT structures the problem-solving process as a search over a tree where each node represents a partial solution or “thought”, a coherent unit of text that represents progress toward solving the problem. The framework incorporates deliberate search algorithms such as breadth-first and depth-first search, enabling models to evaluate, backtrack, and prune reasoning branches.

Architecture and Search Strategies

ToT operates through four key components:

1. Thought Decomposition: The problem is broken into intermediate thought steps, where each thought is a meaningful semantic unit (e.g., one equation step in a math puzzle, one line of a poem, one row of a crossword). The granularity of thoughts is task-dependent.

2. Thought Generation: From any given state (node in the tree), the LLM proposes multiple candidate next thoughts (typically 3-5). This can be done via independent sampling or sequential proposal, depending on the diversity needed.

3. State Evaluation: The LLM evaluates how promising each candidate thought is for reaching the solution. Yao et al., 2023 use two strategies:

Value-based: Each state is independently scored (e.g., on a “sure / maybe / impossible” scale)
Vote-based: States are compared against each other and ranked

4. Search Algorithm: Classical search algorithms navigate the tree:

Breadth-First Search (BFS): Expands all thoughts at the current depth, evaluates them, keeps the top-b most promising, then expands the next level. Best for problems where early pruning is valuable. Used for Game of 24, where partial equations are classified as “sure” (clear path to solution), “maybe” (uncertain), or “impossible” (violates constraints).
Depth-First Search (DFS): Explores one path deeply before backtracking on failure. Enables lookahead (simulating future steps) and is suited for problems with long reasoning chains.

Python Example

from [[openai|openai]] import [[openai|OpenAI]]
 
client = [[openai|OpenAI]]()
 
def generate_thoughts(problem: str, current_state: str, n: int = 3) -> list[str]:
    """Generate n candidate next-thoughts from the current state."""
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": (
            f"Problem: {problem}\n"
            f"Progress so far: {current_state}\n"
            f"Propose {n} distinct next steps. Return each on a separate line."
        )}],
    )
    return resp.choices[0].message.content.strip().split("\n")
 
def evaluate_thought(problem: str, state: str) -> str:
    """Evaluate whether a partial solution is sure/maybe/impossible."""
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": (
            f"Problem: {problem}\nCurrent state: {state}\n"
            "Rate this state as sure, maybe, or impossible. Reply with one word."
        )}],
    )
    return resp.choices[0].message.content.strip().lower()
 
def tree_of_thoughts_bfs(problem: str, steps: int = 3, beam_width: int = 2):
    """BFS-based Tree of Thoughts: expand, evaluate, keep top-b at each depth."""
    current_states = [""]
    for step in range(steps):
        candidates = []
        for state in current_states:
            thoughts = generate_thoughts(problem, state)
            for thought in thoughts:
                new_state = f"{state} -> {thought}" if state else thought
                score = evaluate_thought(problem, new_state)
                if score != "impossible":
                    candidates.append((new_state, score))
        # Keep top-b: prioritize "sure" over "maybe"
        candidates.sort(key=lambda x: (x[1] != "sure", x[1]))
        current_states = [c[0] for c in candidates[:beam_width]]
        print(f"Step {step + 1}: {len(candidates)} candidates, kept {len(current_states)}")
    return current_states[0] if current_states else None
 
# Example: solve a reasoning problem with ToT
solution = tree_of_thoughts_bfs("Use 2, 3, 4, 6 with +-*/ to make 24")
print(f"Best path: {solution}")

Thought Decomposition and Evaluation

The evaluation step is what distinguishes ToT from simple beam search over tokens. Rather than scoring at the token level, ToT evaluates at the thought level, meaning the LLM assesses whether a partial solution is on a viable path toward the goal. This mirrors how humans reason about problems: considering whether an approach seems promising before investing further effort.

For example, in the Game of 24 task (use four numbers with arithmetic operations to reach exactly 24), the evaluation prompt asks the LLM to assess whether the remaining numbers and intermediate result can still reach 24. The LLM serves as a flexible, general-purpose heuristic function that is more adaptable than hand-coded rules.

Comparison with Chain-of-Thought

Aspect	Chain-of-Thought	Tree of Thoughts
Structure	Single linear reasoning path	Branching tree of multiple paths
Exploration	No lookahead or backtracking	Systematic BFS/DFS with pruning
Evaluation	Implicit (via generation)	Explicit LLM-based state evaluation
Error recovery	Cannot recover from early mistakes	Backtracks from dead-end branches
Compute cost	Low (single forward pass)	Higher (multiple generations + evaluations)

ToT is particularly valuable for tasks where the first reasoning step significantly constrains the solution space and where wrong early choices are difficult to recover from.

Practical Applications and Results

Yao et al., 2023 evaluated ToT with GPT-4 on three tasks:

Game of 24: ToT achieved 74% success rate vs. 4% for standard prompting and 9% for CoT²⁾. This task requires exploring different orderings of arithmetic operations, making it a natural fit for tree search.
Creative Writing: Human evaluators rated ToT outputs 4-6x higher in creativity compared to CoT baselines, as the model could explore diverse narrative directions.
Mini Crosswords (5×5): ToT solved up to 60% of puzzles vs. less than 10% for CoT, leveraging backtracking when word placements created conflicts.

Implementations: The official Princeton NLP repository (princeton-nlp/tree-of-thought-llm)³⁾ provides Python code supporting custom tasks with BFS/DFS. Long (2023) extended ToT with an RL-trained “ToT Controller” that dynamically decides when to backtrack, replacing fixed search strategies. More recent work (2024-2025) has focused on reducing the computational overhead through more efficient evaluation, hybrid search strategies, and integration with multi-agent systems.

References

¹⁾

Yao, S. et al. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." arXiv:2305.10601, 2023.

²⁾

Yao et al., 2023 arXiv:2305.10601

³⁾

Yao et al. arXiv:2305.10601

Table of Contents