====== MCTS for LLM Reasoning ======

Monte Carlo Tree Search (MCTS) applied to Large Language Model reasoning represents a fundamental paradigm shift from brute-force scaling toward algorithmic efficiency at inference time. By framing LLM reasoning as a tree search problem, MCTS enables structured exploration of solution paths, balancing exploitation of promising reasoning chains with exploration of novel approaches. This technique underpins the "o1-style" deliberative reasoning that has transformed LLM capabilities on complex tasks.

<mermaid>
graph TD
    A[Root Problem] --> B[Select via UCT]
    B --> C[Expand Node]
    C --> D[Evaluate with LLM]
    D --> E[Backpropagate Score]
    E --> F{Converged?}
    F -->|No| B
    F -->|Yes| G[Best Reasoning Path]
</mermaid>

===== Background =====

Wei et al. (2025) present a unified framework that deconstructs deliberative tree search into three core components: the **Search Mechanism**, **Reward Formulation**, and **Transition Function**. This formalism resolves a key ambiguity in the field -- whether the reward signal serves as a transient heuristic for test-time scaling (TTS) or a durable learning target for self-improvement.

The survey establishes that deliberative search unifies two critical frontiers:
  * **Test-Time Scaling (TTS)** -- Deploying on-demand computation to solve hard problems at inference
  * **Self-Improvement** -- Using search-generated data to durably enhance model parameters

===== The MCTS Framework for LLMs =====

MCTS adapts the classical game-playing algorithm to reasoning by modeling the solution process as a tree where:
  * **Nodes** represent partial reasoning states (e.g., intermediate solution steps)
  * **Edges** represent reasoning actions (e.g., generating the next step)
  * **Leaf evaluation** uses the LLM itself or a learned value function

The four phases of MCTS operate as follows:

=== Selection ===
Starting from the root, traverse the tree by selecting child nodes according to the UCT (Upper Confidence Bound applied to Trees) policy until reaching a leaf or unexpanded node.

=== Expansion ===
Generate new child nodes by querying the LLM with temperature $T > 0$ to produce diverse candidate continuations. Typically 3-5 candidates are generated per expansion.

=== Simulation (Rollout) ===
Evaluate the expanded node either by completing the reasoning chain to a terminal state or by using a value network to estimate the expected outcome.

=== Backpropagation ===
Propagate the evaluation score upward through the tree, updating visit counts $N(s)$ and value estimates $Q(s)$ for all ancestors.

===== The UCT Formula =====

The UCT formula balances exploitation of known high-value nodes against exploration of less-visited nodes:

$$\text{UCT}(s) = \frac{Q(s)}{N(s)} + C \sqrt{\frac{\ln N(\text{parent}(s))}{N(s)}}$$

where:
  * $Q(s) / N(s)$ is the average reward (exploitation term)
  * $C \sqrt{\ln N(\text{parent}) / N(s)}$ is the exploration bonus
  * $C$ is the exploration constant, typically $C = \sqrt{2}$ but tuned per domain

The exploration term ensures that infrequently visited nodes receive a bonus proportional to $\sqrt{\ln N_{\text{parent}} / N_s}$, guaranteeing asymptotic convergence to optimal play.

===== Value Estimation Methods =====

Several approaches exist for estimating node values in the LLM reasoning context:

  * **Self-Evaluation** -- The LLM rates its own partial solutions on a numerical scale (e.g., 0-100)
  * **Outcome Reward Models (ORM)** -- Trained classifiers that predict final answer correctness from intermediate states
  * **Process Reward Models (PRM)** -- Step-level reward models providing granular feedback on each reasoning step
  * **Contrastive Rewards** -- Pairwise comparison of reasoning paths for interpretable scoring

The formal distinction between //Search Guidance// (transient heuristic for TTS) and //Parametric Reward Modeling// (durable learning target) is a key contribution of the unified framework.

===== Connection to o1-Style Reasoning =====

OpenAI's o1 model uses internal chain-of-thought search that closely mirrors MCTS principles. The test-time compute scaling law states that for a problem of difficulty $d$, the probability of correct solution scales as:

$$P(\text{correct} | d) \propto 1 - e^{-\beta \cdot C_{\text{test}}(d)}$$

where $C_{\text{test}}(d)$ is the computational budget allocated at test time and $\beta$ is an efficiency parameter dependent on the search algorithm quality.

===== Code Example =====

<code python>
import math
from dataclasses import dataclass, field

@dataclass
class MCTSNode:
    state: str
    parent: 'MCTSNode' = None
    children: list = field(default_factory=list)
    visits: int = 0
    value: float = 0.0

    def uct_score(self, exploration_constant: float = math.sqrt(2)) -> float:
        if self.visits == 0:
            return float('inf')
        exploitation = self.value / self.visits
        exploration = exploration_constant * math.sqrt(
            math.log(self.parent.visits) / self.visits
        )
        return exploitation + exploration

    def best_child(self, c: float = math.sqrt(2)) -> 'MCTSNode':
        return max(self.children, key=lambda n: n.uct_score(c))

def mcts_reasoning(llm, root_state: str, num_simulations: int = 100) -> str:
    root = MCTSNode(state=root_state)
    for _ in range(num_simulations):
        node = select(root)
        child = expand(node, llm)
        reward = simulate(child, llm)
        backpropagate(child, reward)
    return root.best_child(c=0).state  # exploit only at final selection
</code>

===== Results =====

MCTS-enhanced LLM reasoning has demonstrated significant gains:
  * Mistral-7B on GSM8K: 80.7% accuracy (+4.8% over CoT baseline) via step-level preference data
  * SC-MCTS* achieves 51.9% speedup per node using speculative decoding
  * Outperforms o1-mini by 17.4% on multi-step mathematical reasoning benchmarks

===== References =====

  * [[https://arxiv.org/abs/2510.09988|Wei et al. (2025) -- Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey]]
  * [[https://github.com/More2Search/Awesome-Search-LLM|Awesome-Search-LLM GitHub Repository]]
  * [[https://arxiv.org/abs/2405.00451|AlphaLLM: MCTS for Self-Improving LLMs]]

===== See Also =====

  * [[deep_search_agents]] -- Search agents with dynamic planning
  * [[agentic_rag]] -- Agentic retrieval-augmented generation
  * [[formal_verification_llm_reasoning]] -- Formal verification of LLM reasoning outputs