====== rStar Reasoning ======
**rStar** is a self-play mutual reasoning framework from Microsoft Research (Qi et al. 2024) that dramatically boosts the reasoning capabilities of small language models (SLMs) without fine-tuning or distillation from larger models. By decoupling reasoning into mutual generation and discrimination via augmented Monte Carlo Tree Search (MCTS), rStar enables SLMs to achieve performance rivaling much larger models on mathematical and logical reasoning tasks.
===== Motivation =====
Small language models (1B-8B parameters) typically underperform on complex reasoning tasks compared to frontier models like GPT-4. Common approaches to close this gap --- distillation, fine-tuning on large-model outputs --- create dependency on superior models. rStar asks: //can small models improve themselves through self-play, without any external supervision?//
The answer is yes, via a mutual generation-discrimination protocol where two SLMs of similar capability verify each other's reasoning.
===== Method: MCTS-Based Self-Play =====
=== Generation Phase ===
A target SLM uses augmented MCTS to explore reasoning trajectories. Each MCTS node represents a partial solution state, and edges represent //human-like reasoning actions//:
* **Decomposition**: Break a problem into subproblems
* **Sequential step**: Apply a single reasoning step
* **Verification**: Check intermediate results via code execution
* **Backtracking**: Abandon unpromising paths
The MCTS selection, expansion, simulation, and backpropagation cycle guides the model toward high-quality reasoning trajectories.
=== Discrimination Phase ===
A second SLM independently evaluates each generated trajectory for correctness. Only trajectories where both models //mutually agree// are retained as reliable solutions. This consistency check filters out hallucinations and reasoning errors without external ground truth.
# Simplified rStar self-play MCTS reasoning
import math
from collections import defaultdict
class MCTSNode:
def __init__(self, state, parent=None):
self.state = state # partial solution
self.parent = parent
self.children = []
self.visits = 0
self.value = 0.0
def ucb1(self, exploration=1.414):
if self.visits == 0:
return float('inf')
exploitation = self.value / self.visits
exploration_term = exploration * math.sqrt(
math.log(self.parent.visits) / self.visits
)
return exploitation + exploration_term
def rstar_reasoning(problem, generator_slm, discriminator_slm, n_iterations=100):
root = MCTSNode(state={"problem": problem, "steps": []})
for _ in range(n_iterations):
# SELECT: traverse tree using UCB1
node = select(root)
# EXPAND: generator SLM proposes next reasoning action
actions = generator_slm.propose_actions(node.state)
for action in actions:
child_state = apply_action(node.state, action)
child = MCTSNode(state=child_state, parent=node)
node.children.append(child)
# SIMULATE: roll out to terminal state
leaf = node.children[0] # select most promising
trajectory = generator_slm.rollout(leaf.state)
# DISCRIMINATE: second SLM verifies trajectory
is_valid = discriminator_slm.verify(problem, trajectory)
# BACKPROPAGATE
reward = 1.0 if is_valid else 0.0
backpropagate(leaf, reward)
# Return best mutually-agreed trajectory
return best_trajectory(root)
===== MCTS Components =====
The four phases of MCTS as applied in rStar:
graph TB
A[Selection] -->|UCB1 policy| B[Expansion]
B -->|Generator SLM proposes actions| C[Simulation]
C -->|Rollout to completion| D[Backpropagation]
D -->|Update node values| A
C -->|Discriminator SLM verifies| E{Mutually Agreed?}
E -->|Yes| F[Accept Trajectory]
E -->|No| G[Reject]
===== UCB1 Selection Formula =====
Node selection uses the Upper Confidence Bound:
$$\text{UCB1}(i) = \frac{w_i}{n_i} + c \cdot \sqrt{\frac{\ln N}{n_i}}$$
where $w_i$ is the accumulated reward, $n_i$ is the visit count for node $i$, $N$ is the parent visit count, and $c$ is the exploration constant (typically $\sqrt{2}$).
===== Human-Like Reasoning Actions =====
Unlike standard MCTS which uses uniform action spaces, rStar defines diverse reasoning actions:
^ Action Type ^ Description ^
| Propose one step | Generate a single logical reasoning step |
| Propose remaining steps | Generate all steps to completion |
| Decompose | Break the problem into subproblems |
| Rephrase | Restate the problem from a different angle |
| Code verification | Validate an intermediate result with executable code |
===== Results =====
Performance improvements on reasoning benchmarks (pass@1 accuracy):
^ Model ^ GSM8K (Before) ^ GSM8K (After rStar) ^ Improvement ^
| LLaMA2-7B | 12.51% | **63.91%** | +51.4 |
| Mistral-7B | 36.46% | **81.88%** | +45.4 |
| LLaMA3-8B-Instruct | 74.53% | **91.13%** | +16.6 |
rStar also shows strong results on GSM-Hard, MATH, SVAMP, and StrategyQA benchmarks.
===== rStar-Math: Self-Evolved Deep Thinking =====
A follow-up work, **rStar-Math** (Guan et al. 2025, arXiv:2501.04519), extends the framework with:
* **Code-augmented CoT synthesis**: MCTS rollouts generate step-by-step verified trajectories with executable code checks
* **Process Preference Model (PPM)**: Replaces naive step-level scoring with preference-based reward modeling
* **Self-evolution**: Policy SLM and PPM are iteratively evolved over 4 rounds
rStar-Math achieves 90.0% on MATH (Qwen2.5-Math-7B) and solves 53.3% of AIME problems, rivaling OpenAI o1-preview.
===== References =====
* [[https://arxiv.org/abs/2408.06195|Qi et al. "Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers" (2024). arXiv:2408.06195]]
* [[https://arxiv.org/abs/2501.04519|Guan et al. "rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking" (2025). arXiv:2501.04519]]
* [[https://github.com/zhentingqi/rStar|Official rStar implementation (GitHub)]]
===== See Also =====
* [[monte_carlo_tree_search|Monte Carlo Tree Search]]
* [[self_play|Self-Play in AI]]
* [[chain_of_thought|Chain of Thought]]
* [[process_reward_models|Process Reward Models]]