Advanced Reasoning and Planning

Advanced reasoning and planning encompasses the techniques and architectures that enable AI agents to break down complex problems, formulate multi-step strategies, and adapt their approach based on intermediate results. These capabilities are fundamental to building agents that can operate autonomously on open-ended tasks, moving beyond simple prompt-response interactions to exhibit goal-directed behavior.

Chain-of-Thought and Multi-Step Reasoning

Chain-of-Thought (CoT) prompting, introduced by Wei et al., 2022, remains the foundational technique for eliciting step-by-step reasoning from LLMs.¹⁾ By including intermediate reasoning steps in prompts, CoT dramatically improves performance on arithmetic, commonsense, and symbolic reasoning tasks.

Key variants and extensions include:

Zero-Shot CoT (Kojima et al., 2022): Adding “Let's think step by step” triggers reasoning without exemplars²⁾
Self-Consistency (Wang et al., 2023): Samples multiple reasoning paths and selects the most consistent answer via majority voting³⁾
Chain-of-Associated-Thoughts (CoAT) (Pan et al., 2025): Integrates Monte Carlo Tree Search with an association mechanism, enabling models to explore reasoning paths that combine both “fast” and “slow” thinking
Chain-of-X Paradigms (Xia et al., 2025, COLING): A survey documenting extensions beyond CoT including Chain-of-Verification, Chain-of-Knowledge, and Chain-of-Feedback

Modern reasoning models like OpenAI o3, DeepSeek-R1, and Claude 3.7 Sonnet use extended CoT with inference-time compute scaling, where additional computation at generation time yields deeper, more accurate reasoning.

Search-Based Planning Strategies

Tree of Thoughts (ToT), introduced by Yao et al., 2023, organizes reasoning into a tree structure where multiple reasoning paths are explored simultaneously via breadth-first or depth-first search.⁴⁾ Each node represents an intermediate “thought” that is evaluated by the LLM for progress toward the goal.

Graph of Thoughts (GoT), proposed by Besta et al., 2024, ETH Zurich, generalizes CoT and ToT by modeling reasoning as an arbitrary directed graph.⁵⁾ This enables:

Aggregation of multiple partial solutions
Refinement loops where thoughts feed back into earlier stages
Non-linear information flow capturing complex dependencies

Matrix of Thought (MoT) (Tang et al., 2025) re-evaluates the chain-vs-tree tradeoff and proposes structured matrices that capture both sequential and parallel reasoning dimensions.

A comprehensive taxonomy by Besta et al. (2025) titled “Demystifying Chains, Trees, and Graphs of Thoughts” provides a unified framework comparing these topologies across efficiency, accuracy, and cost dimensions.

Hierarchical and Recursive Planning

For complex, long-horizon tasks, agents employ hierarchical decomposition:

Least-to-Most Prompting (Zhou et al., 2022): Decomposes problems into progressively simpler sub-problems, solving from easiest to hardest⁶⁾
Decomposed Prompting (DecomP) (Khot et al., 2023): Routes sub-tasks to specialized modules or tools⁷⁾
Plan-and-Solve (Wang et al., 2023): Explicitly generates a plan before executing each step⁸⁾
Self-Refine (Madaan et al., 2023): Iteratively improves solutions using self-generated feedback⁹⁾

Modern agents like OpenAI Deep Research and Anthropic Claude use hierarchical planning to break hours-long research tasks into manageable sub-tasks, coordinating tool use, memory retrieval, and synthesis.

How Modern Agents Reason

As of 2025, frontier models employ distinct reasoning strategies:

OpenAI o3/o4-mini: Dedicated reasoning models using extended chain-of-thought with reinforcement learning; inference-time compute scaling allows variable reasoning depth
Claude 3.7 Sonnet: Extended thinking mode with persistent memory architectures (Harmony/Compass) for maintaining context across complex tasks
Gemini 2.5 Pro: Hybrid reasoning combining multi-modal inputs with structured tool chains
DeepSeek-R1: Open-weight reasoning model trained with reinforcement learning to incentivize step-by-step verification

Evaluation and Benchmarks

Key benchmarks for evaluating reasoning and planning:

GSM8K (Cobbe et al., 2021): Grade school math word problems; frontier models now achieve >95% accuracy¹⁰⁾
MATH (Hendrycks et al., 2021): Competition-level mathematics; o3 and DeepSeek-R1 exceed 90%¹¹⁾
ARC Challenge (Clark et al., 2018): Science reasoning requiring world knowledge¹²⁾
BIG-Bench Hard (Suzgun et al., 2022): Subset of tasks where LLMs initially struggled¹³⁾
PlanBench (Valmeekam et al., 2023): Evaluates planning capabilities using classical planning domains¹⁴⁾
TravelPlanner (Xie et al., 2024): Real-world planning requiring constraint satisfaction¹⁵⁾

References

¹⁾

Wei, J. et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” arXiv:2201.11903, 2022.

²⁾

Kojima, T. et al. “Large Language Models are Zero-Shot Reasoners.” arXiv:2205.11916, 2022.

³⁾

Wang, X. et al. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” arXiv:2203.11171, 2023.

⁴⁾

Yao, S. et al. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” arXiv:2305.10601, 2023.

⁵⁾

Besta, M. et al. “Graph of Thoughts: Solving Elaborate Problems with Large Language Models.” arXiv:2308.11114, 2024.

⁶⁾

Zhou, D. et al. “Least-to-Most Prompting Enables Complex Reasoning in Large Language Models.” arXiv:2210.11443, 2022.

⁷⁾

Khot, T. et al. “Decomposed Prompting: A Modular Approach for Solving Complex Tasks.” arXiv:2210.02406, 2023.

⁸⁾

Wang, L. et al. “Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models.” arXiv:2305.04091, 2023.

⁹⁾

Madaan, A. et al. “Self-Refine: Iterative Refinement with Self-Feedback.” arXiv:2303.17651, 2023.

¹⁰⁾

Cobbe, K. et al. “Training Verifiers to Solve Math Word Problems.” arXiv:2110.14168, 2021.

¹¹⁾

Hendrycks, D. et al. “Measuring Mathematical Problem Solving With the MATH Dataset.” arXiv:2103.03874, 2021.

¹²⁾

Clark et al. - Think you have Solved Question Answering? Try ARC (2018

¹³⁾

Suzgun, M. et al. “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them.” arXiv:2206.05296, 2022.

¹⁴⁾

Valmeekam, K. et al. “PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change.” arXiv:2305.10918, 2023.

¹⁵⁾

Xie, J. et al. “TravelPlanner: A Benchmark for Real-World Planning with Language Agents.” arXiv:2403.12687, 2024.

AI Agent Knowledge Base

Sidebar

Table of Contents

Advanced Reasoning and Planning

Chain-of-Thought and Multi-Step Reasoning

Search-Based Planning Strategies

Hierarchical and Recursive Planning

How Modern Agents Reason

Evaluation and Benchmarks

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Advanced Reasoning and Planning

Chain-of-Thought and Multi-Step Reasoning

Search-Based Planning Strategies

Hierarchical and Recursive Planning

How Modern Agents Reason

Evaluation and Benchmarks

See Also

References

Page Tools