rStar Reasoning

rStar is a self-play mutual reasoning framework from Microsoft Research (Qi et al. 2024) that dramatically boosts the reasoning capabilities of small language models (SLMs) without fine-tuning or distillation from larger models. By decoupling reasoning into mutual generation and discrimination via augmented Monte Carlo Tree Search (MCTS), rStar enables SLMs to achieve performance rivaling much larger models on mathematical and logical reasoning tasks.

Motivation

Small language models (1B-8B parameters) typically underperform on complex reasoning tasks compared to frontier models like GPT-4. Common approaches to close this gap — distillation, fine-tuning on large-model outputs — create dependency on superior models. rStar asks: can small models improve themselves through self-play, without any external supervision?

The answer is yes, via a mutual generation-discrimination protocol where two SLMs of similar capability verify each other's reasoning.

Method: MCTS-Based Self-Play

Generation Phase

A target SLM uses augmented MCTS to explore reasoning trajectories. Each MCTS node represents a partial solution state, and edges represent human-like reasoning actions:

Decomposition: Break a problem into subproblems
Sequential step: Apply a single reasoning step
Verification: Check intermediate results via code execution
Backtracking: Abandon unpromising paths

The MCTS selection, expansion, simulation, and backpropagation cycle guides the model toward high-quality reasoning trajectories.

Discrimination Phase

A second SLM independently evaluates each generated trajectory for correctness. Only trajectories where both models mutually agree are retained as reliable solutions. This consistency check filters out hallucinations and reasoning errors without external ground truth.

# Simplified rStar self-play MCTS reasoning
 
import math
from collections import defaultdict
 
class MCTSNode:
    def __init__(self, state, parent=None):
        self.state = state          # partial solution
        self.parent = parent
        self.children = []
        self.visits = 0
        self.value = 0.0
 
    def ucb1(self, exploration=1.414):
        if self.visits == 0:
            return float('inf')
        exploitation = self.value / self.visits
        exploration_term = exploration * math.sqrt(
            math.log(self.parent.visits) / self.visits
        )
        return exploitation + exploration_term
 
def rstar_reasoning(problem, generator_slm, discriminator_slm, n_iterations=100):
    root = MCTSNode(state={"problem": problem, "steps": []})
 
    for _ in range(n_iterations):
        # SELECT: traverse tree using UCB1
        node = select(root)
 
        # EXPAND: generator SLM proposes next reasoning action
        actions = generator_slm.propose_actions(node.state)
        for action in actions:
            child_state = apply_action(node.state, action)
            child = MCTSNode(state=child_state, parent=node)
            node.children.append(child)
 
        # SIMULATE: roll out to terminal state
        leaf = node.children[0]  # select most promising
        trajectory = generator_slm.rollout(leaf.state)
 
        # DISCRIMINATE: second SLM verifies trajectory
        is_valid = discriminator_slm.verify(problem, trajectory)
 
        # BACKPROPAGATE
        reward = 1.0 if is_valid else 0.0
        backpropagate(leaf, reward)
 
    # Return best mutually-agreed trajectory
    return best_trajectory(root)

MCTS Components

The four phases of MCTS as applied in rStar:

graph TB A[Selection] -->|UCB1 policy| B[Expansion] B -->|Generator SLM proposes actions| C[Simulation] C -->|Rollout to completion| D[Backpropagation] D -->|Update node values| A C -->|Discriminator SLM verifies| E{Mutually Agreed?} E -->|Yes| F[Accept Trajectory] E -->|No| G[Reject]

UCB1 Selection Formula

Node selection uses the Upper Confidence Bound:

$$\text{UCB1}(i) = \frac{w_i}{n_i} + c \cdot \sqrt{\frac{\ln N}{n_i}}$$

where $w_i$ is the accumulated reward, $n_i$ is the visit count for node $i$, $N$ is the parent visit count, and $c$ is the exploration constant (typically $\sqrt{2}$).

Human-Like Reasoning Actions

Unlike standard MCTS which uses uniform action spaces, rStar defines diverse reasoning actions:

Action Type	Description
Propose one step	Generate a single logical reasoning step
Propose remaining steps	Generate all steps to completion
Decompose	Break the problem into subproblems
Rephrase	Restate the problem from a different angle
Code verification	Validate an intermediate result with executable code

Results

Performance improvements on reasoning benchmarks (pass@1 accuracy):

Model	GSM8K (Before)	GSM8K (After rStar)	Improvement
LLaMA2-7B	12.51%	63.91%	+51.4
Mistral-7B	36.46%	81.88%	+45.4
LLaMA3-8B-Instruct	74.53%	91.13%	+16.6

rStar also shows strong results on GSM-Hard, MATH, SVAMP, and StrategyQA benchmarks.

rStar-Math: Self-Evolved Deep Thinking

A follow-up work, rStar-Math (Guan et al. 2025, arXiv:2501.04519), extends the framework with:

Code-augmented CoT synthesis: MCTS rollouts generate step-by-step verified trajectories with executable code checks
Process Preference Model (PPM): Replaces naive step-level scoring with preference-based reward modeling
Self-evolution: Policy SLM and PPM are iteratively evolved over 4 rounds

rStar-Math achieves 90.0% on MATH (Qwen2.5-Math-7B) and solves 53.3% of AIME problems, rivaling OpenAI o1-preview.

AI Agent Knowledge Base

Sidebar

Table of Contents

rStar Reasoning

Motivation

Method: MCTS-Based Self-Play

Generation Phase

Discrimination Phase

MCTS Components

UCB1 Selection Formula

Human-Like Reasoning Actions

Results

rStar-Math: Self-Evolved Deep Thinking

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

rStar Reasoning

Motivation

Method: MCTS-Based Self-Play

Generation Phase

Discrimination Phase

MCTS Components

UCB1 Selection Formula

Human-Like Reasoning Actions

Results

rStar-Math: Self-Evolved Deep Thinking

References

See Also

Page Tools