This is an old revision of the document!

MCTS for LLM Reasoning

Monte Carlo Tree Search (MCTS) applied to Large Language Model reasoning represents a fundamental paradigm shift from brute-force scaling toward algorithmic efficiency at inference time. By framing LLM reasoning as a tree search problem, MCTS enables structured exploration of solution paths, balancing exploitation of promising reasoning chains with exploration of novel approaches. This technique underpins the “o1-style” deliberative reasoning that has transformed LLM capabilities on complex tasks.

Background

Wei et al. (2025) present a unified framework that deconstructs deliberative tree search into three core components: the Search Mechanism, Reward Formulation, and Transition Function. This formalism resolves a key ambiguity in the field – whether the reward signal serves as a transient heuristic for test-time scaling (TTS) or a durable learning target for self-improvement.

The survey establishes that deliberative search unifies two critical frontiers:

Test-Time Scaling (TTS) – Deploying on-demand computation to solve hard problems at inference
Self-Improvement – Using search-generated data to durably enhance model parameters

The MCTS Framework for LLMs

MCTS adapts the classical game-playing algorithm to reasoning by modeling the solution process as a tree where:

Nodes represent partial reasoning states (e.g., intermediate solution steps)
Edges represent reasoning actions (e.g., generating the next step)
Leaf evaluation uses the LLM itself or a learned value function

The four phases of MCTS operate as follows:

Selection

Starting from the root, traverse the tree by selecting child nodes according to the UCT (Upper Confidence Bound applied to Trees) policy until reaching a leaf or unexpanded node.

Expansion

Generate new child nodes by querying the LLM with temperature $T > 0$ to produce diverse candidate continuations. Typically 3-5 candidates are generated per expansion.

Simulation (Rollout)

Evaluate the expanded node either by completing the reasoning chain to a terminal state or by using a value network to estimate the expected outcome.

Backpropagation

Propagate the evaluation score upward through the tree, updating visit counts $N(s)$ and value estimates $Q(s)$ for all ancestors.

The UCT Formula

The UCT formula balances exploitation of known high-value nodes against exploration of less-visited nodes:

$$\text{UCT}(s) = \frac{Q(s)}{N(s)} + C \sqrt{\frac{\ln N(\text{parent}(s))}{N(s)}}$$

where:

$Q(s) / N(s)$ is the average reward (exploitation term)
$C \sqrt{\ln N(\text{parent}) / N(s)}$ is the exploration bonus
$C$ is the exploration constant, typically $C = \sqrt{2}$ but tuned per domain

The exploration term ensures that infrequently visited nodes receive a bonus proportional to $\sqrt{\ln N_{\text{parent}} / N_s}$, guaranteeing asymptotic convergence to optimal play.

Value Estimation Methods

Several approaches exist for estimating node values in the LLM reasoning context:

Self-Evaluation – The LLM rates its own partial solutions on a numerical scale (e.g., 0-100)
Outcome Reward Models (ORM) – Trained classifiers that predict final answer correctness from intermediate states
Process Reward Models (PRM) – Step-level reward models providing granular feedback on each reasoning step
Contrastive Rewards – Pairwise comparison of reasoning paths for interpretable scoring

The formal distinction between Search Guidance (transient heuristic for TTS) and Parametric Reward Modeling (durable learning target) is a key contribution of the unified framework.

Connection to o1-Style Reasoning

OpenAI's o1 model uses internal chain-of-thought search that closely mirrors MCTS principles. The test-time compute scaling law states that for a problem of difficulty $d$, the probability of correct solution scales as:

$$P(\text{correct} | d) \propto 1 - e^{-\beta \cdot C_{\text{test}}(d)}$$

where $C_{\text{test}}(d)$ is the computational budget allocated at test time and $\beta$ is an efficiency parameter dependent on the search algorithm quality.

Code Example

import math
from dataclasses import dataclass, field
 
@dataclass
class MCTSNode:
    state: str
    parent: 'MCTSNode' = None
    children: list = field(default_factory=list)
    visits: int = 0
    value: float = 0.0
 
    def uct_score(self, exploration_constant: float = math.sqrt(2)) -> float:
        if self.visits == 0:
            return float('inf')
        exploitation = self.value / self.visits
        exploration = exploration_constant * math.sqrt(
            math.log(self.parent.visits) / self.visits
        )
        return exploitation + exploration
 
    def best_child(self, c: float = math.sqrt(2)) -> 'MCTSNode':
        return max(self.children, key=lambda n: n.uct_score(c))
 
def mcts_reasoning(llm, root_state: str, num_simulations: int = 100) -> str:
    root = MCTSNode(state=root_state)
    for _ in range(num_simulations):
        node = select(root)
        child = expand(node, llm)
        reward = simulate(child, llm)
        backpropagate(child, reward)
    return root.best_child(c=0).state  # exploit only at final selection

Results

MCTS-enhanced LLM reasoning has demonstrated significant gains:

Mistral-7B on GSM8K: 80.7% accuracy (+4.8% over CoT baseline) via step-level preference data
SC-MCTS* achieves 51.9% speedup per node using speculative decoding
Outperforms o1-mini by 17.4% on multi-step mathematical reasoning benchmarks

AI Agent Knowledge Base

Sidebar

Table of Contents

MCTS for LLM Reasoning

Background

The MCTS Framework for LLMs

Selection

Expansion

Simulation (Rollout)

Backpropagation

The UCT Formula

Value Estimation Methods

Connection to o1-Style Reasoning

Code Example

Results

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

MCTS for LLM Reasoning

Background

The MCTS Framework for LLMs

Selection

Expansion

Simulation (Rollout)

Backpropagation

The UCT Formula

Value Estimation Methods

Connection to o1-Style Reasoning

Code Example

Results

References

See Also

Page Tools