Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
Monte Carlo Tree Search (MCTS) applied to Large Language Model reasoning represents a fundamental paradigm shift from brute-force scaling toward algorithmic efficiency at inference time. By framing LLM reasoning as a tree search problem, MCTS enables structured exploration of solution paths, balancing exploitation of promising reasoning chains with exploration of novel approaches. This technique underpins the “o1-style” deliberative reasoning that has transformed LLM capabilities on complex tasks.
Wei et al. (2025) present a unified framework that deconstructs deliberative tree search into three core components: the Search Mechanism, Reward Formulation, and Transition Function. This formalism resolves a key ambiguity in the field – whether the reward signal serves as a transient heuristic for test-time scaling (TTS) or a durable learning target for self-improvement.
The survey establishes that deliberative search unifies two critical frontiers:
MCTS adapts the classical game-playing algorithm to reasoning by modeling the solution process as a tree where:
The four phases of MCTS operate as follows:
Starting from the root, traverse the tree by selecting child nodes according to the UCT (Upper Confidence Bound applied to Trees) policy until reaching a leaf or unexpanded node.
Generate new child nodes by querying the LLM with temperature $T > 0$ to produce diverse candidate continuations. Typically 3-5 candidates are generated per expansion.
Evaluate the expanded node either by completing the reasoning chain to a terminal state or by using a value network to estimate the expected outcome.
Propagate the evaluation score upward through the tree, updating visit counts $N(s)$ and value estimates $Q(s)$ for all ancestors.
The UCT formula balances exploitation of known high-value nodes against exploration of less-visited nodes:
$$\text{UCT}(s) = \frac{Q(s)}{N(s)} + C \sqrt{\frac{\ln N(\text{parent}(s))}{N(s)}}$$
where:
The exploration term ensures that infrequently visited nodes receive a bonus proportional to $\sqrt{\ln N_{\text{parent}} / N_s}$, guaranteeing asymptotic convergence to optimal play.
Several approaches exist for estimating node values in the LLM reasoning context:
The formal distinction between Search Guidance (transient heuristic for TTS) and Parametric Reward Modeling (durable learning target) is a key contribution of the unified framework.
OpenAI's o1 model uses internal chain-of-thought search that closely mirrors MCTS principles. The test-time compute scaling law states that for a problem of difficulty $d$, the probability of correct solution scales as:
$$P(\text{correct} | d) \propto 1 - e^{-\beta \cdot C_{\text{test}}(d)}$$
where $C_{\text{test}}(d)$ is the computational budget allocated at test time and $\beta$ is an efficiency parameter dependent on the search algorithm quality.
import math from dataclasses import dataclass, field @dataclass class MCTSNode: state: str parent: 'MCTSNode' = None children: list = field(default_factory=list) visits: int = 0 value: float = 0.0 def uct_score(self, exploration_constant: float = math.sqrt(2)) -> float: if self.visits == 0: return float('inf') exploitation = self.value / self.visits exploration = exploration_constant * math.sqrt( math.log(self.parent.visits) / self.visits ) return exploitation + exploration def best_child(self, c: float = math.sqrt(2)) -> 'MCTSNode': return max(self.children, key=lambda n: n.uct_score(c)) def mcts_reasoning(llm, root_state: str, num_simulations: int = 100) -> str: root = MCTSNode(state=root_state) for _ in range(num_simulations): node = select(root) child = expand(node, llm) reward = simulate(child, llm) backpropagate(child, reward) return root.best_child(c=0).state # exploit only at final selection
MCTS-enhanced LLM reasoning has demonstrated significant gains: