Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Agent design patterns are reusable architectural solutions for building AI agents that reason, act, and collaborate effectively. Just as software engineering has its Gang of Four patterns, the emerging field of agentic AI has converged on a set of proven patterns that dramatically improve agent reliability, capability, and efficiency1). This article serves as a definitive index of all major agent design patterns, organized by category.
The patterns below range from single-agent reasoning techniques to complex multi-agent orchestration architectures. Many patterns compose naturally — a multi-agent system might use ReAct for tool calling, planning for task decomposition, and human-in-the-loop gates for safety-critical decisions.
These are the foundational patterns identified by Andrew Ng and widely adopted across the agent-building community2). They represent the building blocks from which more complex agent architectures are composed.
Reflection is a pattern where an agent critiques its own output and iteratively improves it. The agent generates an initial response, then evaluates that response against quality criteria, identifies flaws, and produces an improved version. This self-critique loop can run for a fixed number of iterations or until a quality threshold is met. In empirical tests, adding reflection to GPT-4 improved HumanEval coding benchmark scores from 67% to 88%3). Use reflection when output quality matters more than latency, and when the task has verifiable quality criteria. See Critic & Self-Correction for detailed coverage.
The Tool Use pattern enables agents to call external tools — APIs, databases, code interpreters, search engines — within a structured reason-act loop. The agent reasons about what information or action is needed, selects and invokes the appropriate tool, observes the result, and continues reasoning. This grounds the agent in real-world data and extends its capabilities far beyond text generation alone4). Use this pattern whenever the agent needs access to external information or must take actions in the world. See ReAct Framework and Tool Use for implementation details.
Planning is the pattern of decomposing a complex task into a sequence of smaller, manageable subtasks before execution begins. The agent analyzes the overall goal, identifies dependencies between subtasks, determines an execution order, and then works through the plan step by step. Planning separates the “what to do” from the “how to do it,” enabling agents to tackle problems that would be too complex to solve in a single pass5). Use planning for multi-step tasks with dependencies or when the solution path is not immediately obvious. See Planning and Plan-and-Execute Agents.
Multi-Agent Collaboration involves multiple specialized agents working together to accomplish a goal that no single agent could handle alone. Each agent has a defined role, expertise, or capability, and they coordinate through message passing, shared state, or a supervisor. This mirrors how human organizations divide labor among specialists. Use multi-agent patterns when the problem domain is too broad for one agent, when different subtasks require different tool sets or prompts, or when you need debate and verification between independent perspectives. See Multi-Agent Systems for architectures and frameworks.
Human-in-the-Loop (HITL) is the pattern of incorporating human oversight, approval gates, or feedback injection into the agent's workflow. Rather than running fully autonomously, the agent pauses at defined checkpoints to request human review, confirmation of high-stakes actions, or corrective feedback. This pattern is essential for production deployments where errors carry real consequences6). Use HITL for safety-critical decisions, actions with irreversible consequences, or when building trust during initial deployment. See Human-in-the-Loop.
Reasoning patterns structure how an agent thinks through a problem before producing an answer. They operate at the cognitive level, shaping the internal reasoning process.
Chain of Thought prompts the model to produce intermediate reasoning steps before arriving at a final answer. By making the reasoning process explicit, CoT dramatically improves performance on math, logic, and multi-step problems. Use CoT whenever the task requires multi-step reasoning or when you need to audit how the agent reached its conclusion. See Chain of Thought.
Tree of Thoughts extends CoT by exploring multiple reasoning paths simultaneously, branching at decision points and evaluating which paths are most promising. The agent can backtrack from dead ends and explore alternatives, mimicking how humans consider multiple approaches to a problem7). Use ToT for problems with large solution spaces or where the first reasoning path may not be optimal. See Tree of Thoughts.
Graph of Thoughts generalizes Tree of Thoughts by allowing reasoning paths to merge, split, and form arbitrary graph structures. Partial solutions from different branches can be combined, enabling more sophisticated reasoning than strictly tree-shaped exploration8). Use GoT for problems where partial solutions can be meaningfully combined. See Graph of Thoughts.
Chain of Draft is an efficiency-oriented variant of CoT where the agent produces minimal, abbreviated reasoning steps rather than verbose explanations. Each intermediate step contains only the essential information needed to advance the reasoning. This preserves the accuracy benefits of CoT while significantly reducing token usage and latency. Use CoD when you need CoT-level reasoning quality but are constrained by cost or speed. See Chain of Draft.
Self-Consistency generates multiple independent reasoning chains for the same problem and selects the most common answer through majority voting. By sampling diverse reasoning paths, this pattern reduces the chance that a single flawed chain of thought produces an incorrect answer9). Use self-consistency when correctness is paramount and you can afford the additional compute cost. See Self-Consistency.
ReAct interleaves reasoning traces with action execution, creating a tight loop of thought-action-observation. Unlike pure reasoning patterns, ReAct grounds each reasoning step in real observations from tool use or environment interaction. This prevents the hallucination and reasoning drift that can occur in purely internal reasoning chains. Use ReAct as the default pattern for agents that need to interact with external tools or environments. See ReAct Framework.
Reflexion adds a verbal self-reflection step after task completion, where the agent analyzes what went wrong (or right) and stores these reflections in memory for future attempts. Unlike simple reflection, Reflexion maintains an episodic memory of past failures and successes that persists across task attempts. Use Reflexion for iterative improvement on recurring task types. See Reflexion.
Self-Refine is a single-agent iterative refinement loop: generate, get feedback, refine. The same model both produces output and critiques it, using structured feedback to guide each revision. Unlike Reflexion, Self-Refine operates within a single task attempt rather than across attempts10). Use Self-Refine for tasks where quality improves measurably with iteration, such as code generation or creative writing. See Self-Refine.
Orchestration patterns define how multiple agents or processing stages are coordinated to accomplish complex workflows.
A central supervisor agent receives tasks, delegates them to specialized worker agents, collects results, and synthesizes a final output. The supervisor maintains the overall plan and decides which worker to invoke at each step. This pattern provides clear control flow and is easy to reason about, but the supervisor can become a bottleneck. Use it when you need centralized coordination and a clear chain of command. See Supervisor Pattern.
In the swarm pattern, agents operate as equals without a central coordinator. Each agent can communicate with any other agent, and control flows dynamically based on the conversation state. This produces emergent, flexible behavior but can be harder to debug and predict. Use swarm architectures for exploratory tasks or when no single agent has sufficient context to coordinate the others. See Swarm Pattern.
Hierarchical delegation extends the supervisor pattern into multiple levels: a top-level manager delegates to mid-level supervisors, who in turn delegate to specialized workers. This mirrors organizational hierarchies and scales to very complex tasks. Use hierarchical delegation when the problem naturally decomposes into domains and sub-domains. See Hierarchical Delegation.
The pipeline pattern chains agents in a fixed sequence, where each agent's output becomes the next agent's input. Each stage performs a specific transformation or enrichment. Pipelines are simple, predictable, and easy to test. Use them when the task naturally decomposes into ordered stages with clear interfaces. See Pipeline Pattern.
Map-Reduce distributes independent subtasks across multiple agents in parallel (map phase), then aggregates their results into a final output (reduce phase). This pattern excels at processing large datasets or document collections where each item can be analyzed independently. Use Map-Reduce when subtasks are independent and parallelizable. See Map-Reduce Pattern.
A router agent analyzes incoming requests and directs them to the most appropriate specialized agent or pipeline based on the request's content, intent, or complexity. This avoids the overhead of engaging all agents for every request. Use routing when you have diverse request types that require different processing strategies. See Router Pattern.
Memory patterns define how agents store and retrieve information across and within interactions.
Short-term memory is the agent's immediate conversational context — the current prompt and recent exchanges that fit within the LLM's context window. It is inherently limited by the model's maximum token count and is lost when the conversation ends. Effective short-term memory management involves summarizing older context, prioritizing recent and relevant information, and structuring prompts to maximize the utility of available tokens. See Short-Term Memory.
Long-term memory persists information across conversations using external storage, typically a vector database. The agent embeds information into vector representations and retrieves relevant memories via semantic similarity search. This enables agents to accumulate knowledge over time, remember user preferences, and build on past interactions. Use long-term memory for personalization, knowledge accumulation, and cross-session continuity. See Long-Term Memory.
Episodic memory stores records of specific past experiences — complete interactions, task attempts, successes, and failures — rather than just extracted facts. The agent can recall how it handled similar situations in the past, what worked, and what did not. This supports learning from experience and avoids repeating mistakes. Use episodic memory when the agent performs recurring tasks and should improve over time. See Episodic Memory.
Working memory provides the agent with an explicit scratchpad for storing intermediate results, partial computations, and temporary state during complex reasoning. Unlike short-term memory, the scratchpad is structured and the agent can read, write, and organize it deliberately. Use working memory for multi-step computations, complex data transformations, or any task where intermediate state must be tracked explicitly. See Working Memory.
Communication patterns define how information flows between agents and humans in an agentic system.
Human-in-the-loop communication establishes structured interaction points where the agent requests human input, confirmation, or correction. This can range from simple approval gates to rich collaborative workflows where the human and agent iterate together. Effective HITL design minimizes human cognitive load while maximizing oversight of critical decisions. See Human-in-the-Loop.
Agent-to-agent messaging enables direct communication between agents through structured message protocols. Messages can carry task assignments, results, queries, or coordination signals. Well-designed messaging protocols include clear message schemas, routing rules, and error handling. Use structured messaging when agents need to coordinate closely or share complex information. See Agent-to-Agent Messaging.
The shared blackboard pattern provides a common knowledge store that all agents can read from and write to. Agents post partial results, observations, and hypotheses to the blackboard, and other agents react to relevant updates. This decouples agents from each other — they interact through the shared state rather than direct messages. Use the blackboard pattern for collaborative problem-solving where multiple agents contribute partial solutions. See Blackboard Pattern.
Event-driven communication uses an event bus or message queue to decouple agent interactions. Agents publish events when they complete actions or detect relevant conditions, and other agents subscribe to events they care about. This pattern enables loose coupling, scalability, and asynchronous processing. Use event-driven architectures for systems with many agents that need to react to changing conditions. See Event-Driven Agents.
Reliability patterns ensure agent systems behave predictably and recover gracefully from failures.
When an LLM call, tool invocation, or API request fails, the agent retries with exponentially increasing delays between attempts. This handles transient failures — rate limits, network blips, temporary service outages — without overwhelming the failing service. Use retry with backoff as a baseline reliability pattern for all external calls. See Retry Patterns.
Fallback chains define a prioritized list of alternative strategies to try when the primary approach fails. For example, if GPT-4 is unavailable, fall back to Claude; if the primary API fails, try a cached result. Each level in the chain may trade off quality for availability. Use fallback chains for production systems where uptime is critical. See Fallback Chains.
The circuit breaker pattern monitors failure rates for external services and temporarily stops calling a service that is consistently failing. After a cooldown period, the circuit breaker allows a test request through to see if the service has recovered. This prevents cascading failures and wasted resources on calls that are unlikely to succeed. Use circuit breakers when your agent depends on external services with variable reliability. See Circuit Breaker.
Guardrails enforce constraints on agent inputs and outputs through validation layers. Input guardrails filter or reject harmful, off-topic, or malformed requests. Output guardrails validate that agent responses meet format requirements, factual accuracy checks, safety criteria, or business rules. Use guardrails in any production deployment to prevent harmful or incorrect agent behavior. See Guardrails & Validation.
The dual LLM pattern separates planning from execution using two different models or prompts. A capable, expensive model handles high-level planning and decision-making, while a faster, cheaper model handles routine execution steps. This balances quality and cost — critical thinking gets the best model, while mechanical tasks use an efficient one. Use the dual LLM pattern when cost optimization is important but complex reasoning quality must be preserved. See Dual LLM Pattern.
Efficiency patterns reduce the cost, latency, and resource consumption of agent systems.
Caching stores the results of previous LLM calls or tool invocations for reuse. Exact caching returns stored results when the input matches precisely. Semantic caching uses embedding similarity to return cached results for inputs that are semantically equivalent but not identical. Caching can dramatically reduce costs and latency for repetitive queries. Use caching when the agent handles recurring or similar requests. See Agent Caching.
Speculative execution runs multiple possible next steps in parallel before knowing which one will actually be needed. When the decision point is reached, the pre-computed result for the chosen path is available immediately. This trades compute cost for reduced latency. Use speculative execution when latency is critical and the set of possible next steps is small and predictable. See Speculative Execution.
Budget-aware reasoning constrains the agent's resource consumption — limiting the number of LLM calls, total tokens, tool invocations, or wall-clock time. The agent must reason about how to allocate its budget across subtasks and may choose simpler strategies when resources are scarce. Use budget-aware reasoning in production systems where costs must be controlled or where response time SLAs exist. See Budget-Aware Reasoning.
Parallel tool calling executes multiple independent tool invocations simultaneously rather than sequentially. When the agent identifies that several pieces of information are needed and the requests are independent, it dispatches all calls at once and processes results as they arrive. This can reduce latency by a factor proportional to the number of parallel calls. Use parallel tool calling whenever the agent needs multiple independent pieces of external data. See Parallel Tool Calling.