Background
The MCTS Framework for LLMs
The UCT Formula
Value Estimation Methods
Connection to o1-Style Reasoning
Code Example
Results
References
See Also

MCTS for LLM Reasoning

Monte Carlo Tree Search (MCTS) applied to Large Language Model reasoning represents a fundamental paradigm shift from brute-force scaling toward algorithmic efficiency at inference time. By framing LLM reasoning as a tree search problem, MCTS enables structured exploration of solution paths, balancing exploitation of promising reasoning chains with exploration of novel approaches. This technique underpins the “o1-style” deliberative reasoning that has transformed LLM capabilities on complex tasks.

graph TD A[Root Problem] --> B[Select via UCT] B --> C[Expand Node] C --> D[Evaluate with LLM] D --> E[Backpropagate Score] E --> F{Converged?} F -->|No| B F -->|Yes| G[Best Reasoning Path]

Background

Wei et al. (2025) present a unified framework that deconstructs deliberative tree search into three core components: the Search Mechanism, Reward Formulation, and Transition Function. This formalism resolves a key ambiguity in the field – whether the reward signal serves as a transient heuristic for test-time scaling (TTS) or a durable learning target for self-improvement.

The survey establishes that deliberative search unifies two critical frontiers:

Test-Time Scaling (TTS) – Deploying on-demand computation to solve hard problems at inference
Self-Improvement – Using search-generated data to durably enhance model parameters

The MCTS Framework for LLMs

MCTS adapts the classical game-playing algorithm to reasoning by modeling the solution process as a tree where:

Nodes represent partial reasoning states (e.g., intermediate solution steps)
Edges represent reasoning actions (e.g., generating the next step)
Leaf evaluation uses the LLM itself or a learned value function

The four phases of MCTS operate as follows:

Selection

Starting from the root, traverse the tree by selecting child nodes according to the UCT (Upper Confidence Bound applied to Trees) policy until reaching a leaf or unexpanded node.

Expansion

Generate new child nodes by querying the LLM with temperature $T > 0$ to produce diverse candidate continuations. Typically 3-5 candidates are generated per expansion.

Simulation (Rollout)

Evaluate the expanded node either by completing the reasoning chain to a terminal state or by using a value network to estimate the expected outcome.

Backpropagation

Propagate the evaluation score upward through the tree, updating visit counts $N(s)$ and value estimates $Q(s)$ for all ancestors.

The UCT Formula

The UCT formula balances exploitation of known high-value nodes against exploration of less-visited nodes:

$$\text{UCT}(s) = \frac{Q(s)}{N(s)} + C \sqrt{\frac{\ln N(\text{parent}(s))}{N(s)}}$$

where:

$Q(s) / N(s)$ is the average reward (exploitation term)
$C \sqrt{\ln N(\text{parent}) / N(s)}$ is the exploration bonus
$C$ is the exploration constant, typically $C = \sqrt{2}$ but tuned per domain

The exploration term ensures that infrequently visited nodes receive a bonus proportional to $\sqrt{\ln N_{\text{parent}} / N_s}$, guaranteeing asymptotic convergence to optimal play.

Value Estimation Methods

Several approaches exist for estimating node values in the LLM reasoning context:

Self-Evaluation – The LLM rates its own partial solutions on a numerical scale (e.g., 0-100)
Outcome Reward Models (ORM) – Trained classifiers that predict final answer correctness from intermediate states
Process Reward Models (PRM) – Step-level reward models providing granular feedback on each reasoning step
Contrastive Rewards – Pairwise comparison of reasoning paths for interpretable scoring

The formal distinction between Search Guidance (transient heuristic for TTS) and Parametric Reward Modeling (durable learning target) is a key contribution of the unified framework.

Connection to o1-Style Reasoning

OpenAI's o1 model uses internal chain-of-thought search that closely mirrors MCTS principles. The test-time compute scaling law states that for a problem of difficulty $d$, the probability of correct solution scales as:

$$P(\text{correct} | d) \propto 1 - e^{-\beta \cdot C_{\text{test}}(d)}$$

where $C_{\text{test}}(d)$ is the computational budget allocated at test time and $\beta$ is an efficiency parameter dependent on the search algorithm quality.

Code Example

import math
from dataclasses import dataclass, field
 
@dataclass
class MCTSNode:
    state: str
    parent: 'MCTSNode' = None
    children: list = field(default_factory=list)
    visits: int = 0
    value: float = 0.0
 
    def uct_score(self, exploration_constant: float = math.sqrt(2)) -> float:
        if self.visits == 0:
            return float('inf')
        exploitation = self.value / self.visits
        exploration = exploration_constant * math.sqrt(
            math.log(self.parent.visits) / self.visits
        )
        return exploitation + exploration
 
    def best_child(self, c: float = math.sqrt(2)) -> 'MCTSNode':
        return max(self.children, key=lambda n: n.uct_score(c))
 
def mcts_reasoning(llm, root_state: str, num_simulations: int = 100) -> str:
    root = MCTSNode(state=root_state)
    for _ in range(num_simulations):
        node = select(root)
        child = expand(node, llm)
        reward = simulate(child, llm)
        backpropagate(child, reward)
    return root.best_child(c=0).state  # exploit only at final selection

Results

MCTS-enhanced LLM reasoning has demonstrated significant gains:

Mistral-7B on GSM8K: 80.7% accuracy (+4.8% over CoT baseline) via step-level preference data
SC-MCTS* achieves 51.9% speedup per node using speculative decoding
Outperforms o1-mini by 17.4% on multi-step mathematical reasoning benchmarks

Table of Contents