====== RAP: Reasoning via Planning with LLM as World Model ======

**RAP** (Reasoning via Planning) is a framework introduced by Hao et al. (2023) that repurposes a large language model to serve as **both a world model and a reasoning agent**, guided by **Monte Carlo Tree Search (MCTS)** to explore high-reward reasoning paths.((https://arxiv.org/abs/2305.14992)) With **925 citations**, it demonstrates that deliberate planning significantly outperforms autoregressive chain-of-thought reasoning(([[https://arxiv.org/abs/2201.11903|Wei et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (2022)]])) across diverse tasks by enabling strategic exploration, lookahead, and backtracking.

[[https://arxiv.org/abs/2305.14992|arXiv:2305.14992]]

===== Dual Role of the LLM =====

RAP assigns two complementary roles to the same LLM via task-specific prompting:((https://arxiv.org/abs/2305.14992))

==== World Model ====

The LLM predicts the next state $s_{t+1}$ given current state $s_t$ and action $a_t$:

$$s_{t+1} = \text{LLM}_{\text{world}}(s_t, a_t)$$

This enables the agent to simulate future outcomes without actually executing actions, providing lookahead capability analogous to a mental model.

==== Reasoning Agent ====

The LLM generates candidate actions at each state:

$$a_t \sim \pi_{\text{LLM}}(a \mid s_t)$$

Actions and states are flexibly defined per task -- for plan generation, states are world configurations and actions are operations; for math, states are partial solutions and actions are reasoning steps.

===== Monte Carlo Tree Search for Reasoning =====

MCTS builds a search tree over the reasoning space through four iterative phases:((https://arxiv.org/abs/2305.14992))

  - **Selection**: Traverse the tree from root using UCB1 to balance exploration and exploitation:

$$\text{UCB1}(s, a) = Q(s, a) + c \sqrt{\frac{\ln N(s)}{N(s, a)}}$$

where $Q(s,a)$ is the estimated action value, $N(s)$ is the visit count for state $s$, and $c$ is the exploration constant.

  - **Expansion**: The agent-LLM proposes new actions at leaf nodes
  - **Simulation**: The world-model-LLM rolls out future states to compute rewards
  - **Backpropagation**: Update value estimates along the traversed path

The final answer is selected from the highest-reward complete reasoning trace, optionally aggregated via majority vote across multiple MCTS runs(([[https://arxiv.org/abs/2203.11171|Wang et al. "Self-Consistency Improves Chain of Thought Reasoning" (2022)]])).

===== System Architecture =====

<mermaid>
graph TD
    A[Input Problem] --> B[Initialize Root State]
    B --> C[MCTS Iteration]
    C --> D[Selection via UCB1]
    D --> E[Leaf Node]
    E --> F{Fully Expanded?}
    F -- No --> G[Expansion: Agent-LLM Proposes Actions]
    G --> H[New Child Nodes]
    F -- Yes --> I[Select Best Child]
    H --> J[Simulation: World-Model Rollout]
    I --> J
    J --> K[Compute Reward]
    K --> L[Backpropagation: Update Q-values]
    L --> M{Budget Exhausted?}
    M -- No --> C
    M -- Yes --> N[Extract Best Reasoning Trace]
    N --> O[Optional: Majority Vote Aggregation]
    O --> P[Final Answer]
</mermaid>

===== Code Example =====

<code python>
# Simplified RAP with MCTS for LLM reasoning
import math

class RAPNode:
    def __init__(self, state, parent=None):
        self.state = state
        self.parent = parent
        self.children = {}
        self.visits = 0
        self.total_reward = 0.0

class RAP:
    def __init__(self, llm, exploration_weight=1.41, max_depth=10):
        self.llm = llm
        self.c = exploration_weight
        self.max_depth = max_depth

    def ucb1(self, node, child):
        if child.visits == 0:
            return float("inf")
        exploit = child.total_reward / child.visits
        explore = self.c * math.sqrt(math.log(node.visits) / child.visits)
        return exploit + explore

    def select(self, node):
        while node.children:
            node = max(node.children.values(), key=lambda c: self.ucb1(node, c))
        return node

    def expand(self, node):
        actions = self.llm.generate_actions(node.state)
        for action in actions:
            next_state = self.llm.world_model(node.state, action)
            node.children[action] = RAPNode(next_state, parent=node)

    def simulate(self, node):
        state = node.state
        for _ in range(self.max_depth):
            action = self.llm.generate_actions(state)[0]
            state = self.llm.world_model(state, action)
            if is_terminal(state):
                break
        return self.llm.compute_reward(state)

    def search(self, problem, num_iterations=100):
        root = RAPNode(state=problem)
        for _ in range(num_iterations):
            leaf = self.select(root)
            self.expand(leaf)
            child = list(leaf.children.values())[0]
            reward = self.simulate(child)
            self._backpropagate(child, reward)
        return self._best_trace(root)
</code>

===== Key Results =====

^ Benchmark ^ Task ^ RAP Improvement ^
| Blocksworld | Embodied plan generation | Substantial gains over CoT |
| GSM8K | Grade-school math | Higher accuracy via trace ensembling |
| Logical Reasoning | Hypothesis verification | Outperforms with detailed proofs |

  * MCTS enables strategic exploration that autoregressive CoT cannot achieve(([[https://aclanthology.org/2023.emnlp-main.507|EMNLP 2023 Proceedings]]))((https://arxiv.org/abs/2305.14992))
  * World model allows anticipating consequences before committing to a reasoning step
  * Scales effectively: more MCTS iterations yield better reasoning quality
  * Compatible with any LLM backbone (tested on text-davinci-002/003)

===== See Also =====

  * [[llm_tool_makers|LATM: Large Language Models as Tool Makers]]
  * [[toolllm|ToolLLM: Mastering 16,000+ Real-World APIs]]
  * [[expel_experiential_learning|ExpeL: Experiential Learning Agents]]

===== References =====