Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
Patterns for handling failures in AI agent systems, including retry with backoff, fallback chains, graceful degradation, error classification, and self-healing mechanisms.
AI agents fail differently than traditional software. LLM-powered systems face non-deterministic failures, partial successes, cascading errors, silent failures, and rate limiting. Production agent error handling requires patterns designed specifically for these unique failure modes. The goal is not to prevent all errors but to enable autonomous recovery.
Before recovering, classify the error. Agent failures fall into four categories:
Classification determines the recovery strategy. Rate limits need backoff; auth errors need no retry; malformed output needs re-prompting; hallucinations need model escalation.
The foundational recovery pattern. Increases wait time between attempts to avoid retry storms and respect rate limits. Always add jitter to prevent synchronized retries across multiple agents.
Key principles:
import asyncio import random from enum import Enum from typing import Callable, TypeVar T = TypeVar("T") class ErrorSeverity(Enum): TRANSIENT = "transient" # retry with backoff PERMANENT = "permanent" # fail immediately DEGRADED = "degraded" # try fallback def classify_error(error: Exception) -> ErrorSeverity: status = getattr(error, "status_code", None) if status in (429, 500, 502, 503, 504): return ErrorSeverity.TRANSIENT if status in (401, 403, 404): return ErrorSeverity.PERMANENT if "context_length" in str(error).lower(): return ErrorSeverity.DEGRADED return ErrorSeverity.TRANSIENT async def retry_with_backoff( fn: Callable, max_retries: int = 5, base_delay: float = 1.0, max_delay: float = 60.0, ) -> T: for attempt in range(max_retries): try: return await fn() except Exception as e: severity = classify_error(e) if severity == ErrorSeverity.PERMANENT: raise if severity == ErrorSeverity.DEGRADED: raise # let fallback chain handle it if attempt == max_retries - 1: raise delay = min(base_delay * (2 ** attempt), max_delay) jitter = random.uniform(0, delay * 0.5) await asyncio.sleep(delay + jitter)
When a primary model or service fails, switch to progressively simpler alternatives. A fallback chain ensures the agent always produces some response.
Design pattern: Primary model → Cheaper/faster model → Cached response → Graceful error message.
class FallbackChain: def __init__(self, handlers: list[Callable]): self.handlers = handlers async def execute(self, *args, **kwargs): last_error = None for handler in self.handlers: try: return await retry_with_backoff( lambda h=handler: h(*args, **kwargs) ) except Exception as e: last_error = e continue raise last_error # Usage: try GPT-4 -> Claude -> cached response chain = FallbackChain([ call_gpt4, call_claude, get_cached_response, ]) result = await chain.execute(prompt="Summarize this document")
Operate at three levels:
The key principle is that workflows should complete with available data rather than failing entirely when a single component encounters issues.
Self-healing allows agents to resolve errors without manual intervention:
%%% Mermaid diagram - render at mermaid.live %%%
graph TD
A[Agent Action] --> B{Execute}
B -->|Success| C[Return Result]
B -->|Error| D{Classify Error}
D -->|Transient| E[Retry with Backoff]
D -->|Permanent| F[Fail Fast]
D -->|Degraded| G[Fallback Chain]
E -->|Success| C
E -->|Max Retries| G
G -->|Primary Model| H{Try Next}
H -->|Success| C
H -->|Fail| I[Cheaper Model]
I -->|Success| C
I -->|Fail| J[Cached Response]
J -->|Available| C
J -->|None| K[Human Escalation]
F --> L[Log & Alert]
K --> L
Stop cascading failures by tracking error rates and temporarily disabling failing services. After a cooldown period, allow a test request through to check recovery.
States: Closed (normal) → Open (blocking requests) → Half-Open (testing recovery).
Circuit breakers prevent agents from burning through API budgets during outages and protect downstream services from retry storms.
Production error recovery requires comprehensive monitoring:
A financial services case study showed that implementing these patterns achieved a 70% reduction in task failure rates and 50% decrease in mean time to resolution.