====== RAP: Reasoning via Planning with LLM as World Model ======
**RAP** (Reasoning via Planning) is a framework introduced by Hao et al. (2023) that repurposes a large language model to serve as **both a world model and a reasoning agent**, guided by **Monte Carlo Tree Search (MCTS)** to explore high-reward reasoning paths.((https://arxiv.org/abs/2305.14992)) With **925 citations**, it demonstrates that deliberate planning significantly outperforms autoregressive chain-of-thought reasoning(([[https://arxiv.org/abs/2201.11903|Wei et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (2022)]])) across diverse tasks by enabling strategic exploration, lookahead, and backtracking.
[[https://arxiv.org/abs/2305.14992|arXiv:2305.14992]]
===== Dual Role of the LLM =====
RAP assigns two complementary roles to the same LLM via task-specific prompting:((https://arxiv.org/abs/2305.14992))
==== World Model ====
The LLM predicts the next state $s_{t+1}$ given current state $s_t$ and action $a_t$:
$$s_{t+1} = \text{LLM}_{\text{world}}(s_t, a_t)$$
This enables the agent to simulate future outcomes without actually executing actions, providing lookahead capability analogous to a mental model.
==== Reasoning Agent ====
The LLM generates candidate actions at each state:
$$a_t \sim \pi_{\text{LLM}}(a \mid s_t)$$
Actions and states are flexibly defined per task -- for plan generation, states are world configurations and actions are operations; for math, states are partial solutions and actions are reasoning steps.
===== Monte Carlo Tree Search for Reasoning =====
MCTS builds a search tree over the reasoning space through four iterative phases:((https://arxiv.org/abs/2305.14992))
- **Selection**: Traverse the tree from root using UCB1 to balance exploration and exploitation:
$$\text{UCB1}(s, a) = Q(s, a) + c \sqrt{\frac{\ln N(s)}{N(s, a)}}$$
where $Q(s,a)$ is the estimated action value, $N(s)$ is the visit count for state $s$, and $c$ is the exploration constant.
- **Expansion**: The agent-LLM proposes new actions at leaf nodes
- **Simulation**: The world-model-LLM rolls out future states to compute rewards
- **Backpropagation**: Update value estimates along the traversed path
The final answer is selected from the highest-reward complete reasoning trace, optionally aggregated via majority vote across multiple MCTS runs(([[https://arxiv.org/abs/2203.11171|Wang et al. "Self-Consistency Improves Chain of Thought Reasoning" (2022)]])).
===== System Architecture =====
graph TD
A[Input Problem] --> B[Initialize Root State]
B --> C[MCTS Iteration]
C --> D[Selection via UCB1]
D --> E[Leaf Node]
E --> F{Fully Expanded?}
F -- No --> G[Expansion: Agent-LLM Proposes Actions]
G --> H[New Child Nodes]
F -- Yes --> I[Select Best Child]
H --> J[Simulation: World-Model Rollout]
I --> J
J --> K[Compute Reward]
K --> L[Backpropagation: Update Q-values]
L --> M{Budget Exhausted?}
M -- No --> C
M -- Yes --> N[Extract Best Reasoning Trace]
N --> O[Optional: Majority Vote Aggregation]
O --> P[Final Answer]
===== Code Example =====
# Simplified RAP with MCTS for LLM reasoning
import math
class RAPNode:
def __init__(self, state, parent=None):
self.state = state
self.parent = parent
self.children = {}
self.visits = 0
self.total_reward = 0.0
class RAP:
def __init__(self, llm, exploration_weight=1.41, max_depth=10):
self.llm = llm
self.c = exploration_weight
self.max_depth = max_depth
def ucb1(self, node, child):
if child.visits == 0:
return float("inf")
exploit = child.total_reward / child.visits
explore = self.c * math.sqrt(math.log(node.visits) / child.visits)
return exploit + explore
def select(self, node):
while node.children:
node = max(node.children.values(), key=lambda c: self.ucb1(node, c))
return node
def expand(self, node):
actions = self.llm.generate_actions(node.state)
for action in actions:
next_state = self.llm.world_model(node.state, action)
node.children[action] = RAPNode(next_state, parent=node)
def simulate(self, node):
state = node.state
for _ in range(self.max_depth):
action = self.llm.generate_actions(state)[0]
state = self.llm.world_model(state, action)
if is_terminal(state):
break
return self.llm.compute_reward(state)
def search(self, problem, num_iterations=100):
root = RAPNode(state=problem)
for _ in range(num_iterations):
leaf = self.select(root)
self.expand(leaf)
child = list(leaf.children.values())[0]
reward = self.simulate(child)
self._backpropagate(child, reward)
return self._best_trace(root)
===== Key Results =====
^ Benchmark ^ Task ^ RAP Improvement ^
| Blocksworld | Embodied plan generation | Substantial gains over CoT |
| GSM8K | Grade-school math | Higher accuracy via trace ensembling |
| Logical Reasoning | Hypothesis verification | Outperforms with detailed proofs |
* MCTS enables strategic exploration that autoregressive CoT cannot achieve(([[https://aclanthology.org/2023.emnlp-main.507|EMNLP 2023 Proceedings]]))((https://arxiv.org/abs/2305.14992))
* World model allows anticipating consequences before committing to a reasoning step
* Scales effectively: more MCTS iterations yield better reasoning quality
* Compatible with any LLM backbone (tested on text-davinci-002/003)
===== See Also =====
* [[llm_tool_makers|LATM: Large Language Models as Tool Makers]]
* [[toolllm|ToolLLM: Mastering 16,000+ Real-World APIs]]
* [[expel_experiential_learning|ExpeL: Experiential Learning Agents]]
===== References =====