Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
rStar is a self-play mutual reasoning framework from Microsoft Research (Qi et al. 2024) that dramatically boosts the reasoning capabilities of small language models (SLMs) without fine-tuning or distillation from larger models. By decoupling reasoning into mutual generation and discrimination via augmented Monte Carlo Tree Search (MCTS), rStar enables SLMs to achieve performance rivaling much larger models on mathematical and logical reasoning tasks.
Small language models (1B-8B parameters) typically underperform on complex reasoning tasks compared to frontier models like GPT-4. Common approaches to close this gap — distillation, fine-tuning on large-model outputs — create dependency on superior models. rStar asks: can small models improve themselves through self-play, without any external supervision?
The answer is yes, via a mutual generation-discrimination protocol where two SLMs of similar capability verify each other's reasoning.
A target SLM uses augmented MCTS to explore reasoning trajectories. Each MCTS node represents a partial solution state, and edges represent human-like reasoning actions:
The MCTS selection, expansion, simulation, and backpropagation cycle guides the model toward high-quality reasoning trajectories.
A second SLM independently evaluates each generated trajectory for correctness. Only trajectories where both models mutually agree are retained as reliable solutions. This consistency check filters out hallucinations and reasoning errors without external ground truth.
# Simplified rStar self-play MCTS reasoning import math from collections import defaultdict class MCTSNode: def __init__(self, state, parent=None): self.state = state # partial solution self.parent = parent self.children = [] self.visits = 0 self.value = 0.0 def ucb1(self, exploration=1.414): if self.visits == 0: return float('inf') exploitation = self.value / self.visits exploration_term = exploration * math.sqrt( math.log(self.parent.visits) / self.visits ) return exploitation + exploration_term def rstar_reasoning(problem, generator_slm, discriminator_slm, n_iterations=100): root = MCTSNode(state={"problem": problem, "steps": []}) for _ in range(n_iterations): # SELECT: traverse tree using UCB1 node = select(root) # EXPAND: generator SLM proposes next reasoning action actions = generator_slm.propose_actions(node.state) for action in actions: child_state = apply_action(node.state, action) child = MCTSNode(state=child_state, parent=node) node.children.append(child) # SIMULATE: roll out to terminal state leaf = node.children[0] # select most promising trajectory = generator_slm.rollout(leaf.state) # DISCRIMINATE: second SLM verifies trajectory is_valid = discriminator_slm.verify(problem, trajectory) # BACKPROPAGATE reward = 1.0 if is_valid else 0.0 backpropagate(leaf, reward) # Return best mutually-agreed trajectory return best_trajectory(root)
The four phases of MCTS as applied in rStar:
Node selection uses the Upper Confidence Bound:
$$\text{UCB1}(i) = \frac{w_i}{n_i} + c \cdot \sqrt{\frac{\ln N}{n_i}}$$
where $w_i$ is the accumulated reward, $n_i$ is the visit count for node $i$, $N$ is the parent visit count, and $c$ is the exploration constant (typically $\sqrt{2}$).
Unlike standard MCTS which uses uniform action spaces, rStar defines diverse reasoning actions:
| Action Type | Description |
|---|---|
| Propose one step | Generate a single logical reasoning step |
| Propose remaining steps | Generate all steps to completion |
| Decompose | Break the problem into subproblems |
| Rephrase | Restate the problem from a different angle |
| Code verification | Validate an intermediate result with executable code |
Performance improvements on reasoning benchmarks (pass@1 accuracy):
| Model | GSM8K (Before) | GSM8K (After rStar) | Improvement |
|---|---|---|---|
| LLaMA2-7B | 12.51% | 63.91% | +51.4 |
| Mistral-7B | 36.46% | 81.88% | +45.4 |
| LLaMA3-8B-Instruct | 74.53% | 91.13% | +16.6 |
rStar also shows strong results on GSM-Hard, MATH, SVAMP, and StrategyQA benchmarks.
A follow-up work, rStar-Math (Guan et al. 2025, arXiv:2501.04519), extends the framework with:
rStar-Math achieves 90.0% on MATH (Qwen2.5-Math-7B) and solves 53.3% of AIME problems, rivaling OpenAI o1-preview.