====== Agent Error Recovery ====== Patterns for handling failures in AI agent systems, including retry with backoff, fallback chains, graceful degradation, error classification, and self-healing mechanisms. ===== Overview ===== AI agents fail differently than traditional software. LLM-powered systems face non-deterministic failures, partial successes, cascading errors, silent failures, and rate limiting. Production agent error handling requires patterns designed specifically for these unique failure modes. The goal is not to prevent all errors but to enable autonomous recovery. ===== Error Classification ===== Before recovering, classify the error. Agent failures fall into four categories: * **Syntactic Errors** -- Malformed JSON, broken tool call formatting, invalid output structure * **Semantic Errors** -- Valid structure but nonsensical or contradictory content * **Environmental Errors** -- API timeouts, rate limits (429), network failures, service outages * **Cognitive Errors** -- Hallucinated tool names, incorrect reasoning, confidence miscalibration Classification determines the recovery strategy. Rate limits need backoff; auth errors need no retry; malformed output needs re-prompting; hallucinations need model escalation. ===== Retry with Exponential Backoff ===== The foundational recovery pattern. Increases wait time between attempts to avoid retry storms and respect rate limits. Always add jitter to prevent synchronized retries across multiple agents. Key principles: * Differentiate transient vs permanent errors before retrying * Add random jitter to prevent thundering herd * Set maximum retry count and total timeout * Only retry on retryable error classes (429, 500, 503, timeouts) import asyncio import random from enum import Enum from typing import Callable, TypeVar T = TypeVar("T") class ErrorSeverity(Enum): TRANSIENT = "transient" # retry with backoff PERMANENT = "permanent" # fail immediately DEGRADED = "degraded" # try fallback def classify_error(error: Exception) -> ErrorSeverity: status = getattr(error, "status_code", None) if status in (429, 500, 502, 503, 504): return ErrorSeverity.TRANSIENT if status in (401, 403, 404): return ErrorSeverity.PERMANENT if "context_length" in str(error).lower(): return ErrorSeverity.DEGRADED return ErrorSeverity.TRANSIENT async def retry_with_backoff( fn: Callable, max_retries: int = 5, base_delay: float = 1.0, max_delay: float = 60.0, ) -> T: for attempt in range(max_retries): try: return await fn() except Exception as e: severity = classify_error(e) if severity == ErrorSeverity.PERMANENT: raise if severity == ErrorSeverity.DEGRADED: raise # let fallback chain handle it if attempt == max_retries - 1: raise delay = min(base_delay * (2 ** attempt), max_delay) jitter = random.uniform(0, delay * 0.5) await asyncio.sleep(delay + jitter) ===== Fallback Chains ===== When a primary model or service fails, switch to progressively simpler alternatives. A fallback chain ensures the agent always produces some response. Design pattern: Primary model -> Cheaper/faster model -> Cached response -> Graceful error message. class FallbackChain: def __init__(self, handlers: list[Callable]): self.handlers = handlers async def execute(self, *args, **kwargs): last_error = None for handler in self.handlers: try: return await retry_with_backoff( lambda h=handler: h(*args, **kwargs) ) except Exception as e: last_error = e continue raise last_error # Usage: try GPT-4 -> Claude -> cached response chain = FallbackChain([ call_gpt4, call_claude, get_cached_response, ]) result = await chain.execute(prompt="Summarize this document") ===== Graceful Degradation ===== Operate at three levels: * **Operation-level** -- Retry individual LLM calls or tool invocations * **Step-level** -- Re-plan or substitute alternative actions when a step fails * **System-level** -- Fall back to simpler models, cached responses, or human escalation The key principle is that workflows should complete with available data rather than failing entirely when a single component encounters issues. ===== Self-Healing Patterns ===== Self-healing allows agents to resolve errors without manual intervention: * **Try-Rewrite-Retry** -- Feed the error message back to the LLM so it can debug its own output and produce a corrected version * **Automated resource scaling** -- Dynamically allocate resources under load * **Task redistribution** -- Move work from failed components to healthy ones * **Checkpoint recovery** -- Resume from the last successful step rather than restarting from scratch ===== Error Recovery Flow ===== graph TD A[Agent Action] --> B{Execute} B -->|Success| C[Return Result] B -->|Error| D{Classify Error} D -->|Transient| E[Retry with Backoff] D -->|Permanent| F[Fail Fast] D -->|Degraded| G[Fallback Chain] E -->|Success| C E -->|Max Retries| G G -->|Primary Model| H{Try Next} H -->|Success| C H -->|Fail| I[Cheaper Model] I -->|Success| C I -->|Fail| J[Cached Response] J -->|Available| C J -->|None| K[Human Escalation] F --> L[Log & Alert] K --> L ===== Circuit Breakers ===== Stop cascading failures by tracking error rates and temporarily disabling failing services. After a cooldown period, allow a test request through to check recovery. States: Closed (normal) -> Open (blocking requests) -> Half-Open (testing recovery). Circuit breakers prevent agents from burning through API budgets during outages and protect downstream services from retry storms. ===== Observability ===== Production error recovery requires comprehensive monitoring: * Standardized observability with OpenTelemetry for logs, metrics, and traces * Real-time dashboards for task success/failure rates and rollback frequency * Automated alerting on anomaly detection * Structured logging with session IDs, inputs/outputs, and duration A financial services case study showed that implementing these patterns achieved a 70% reduction in task failure rates and 50% decrease in mean time to resolution. ===== References ===== * [[https://dev.to/nebulagg/ai-agent-error-handling-4-resilience-patterns-in-python-12of|AI Agent Error Handling: 4 Resilience Patterns in Python]] * [[https://dev.to/techfind777/building-self-healing-ai-agents-7-error-handling-patterns-that-keep-your-agent-running-at-3-am-5h81|Building Self-Healing AI Agents: 7 Error Handling Patterns]] * [[https://getathenic.com/blog/ai-agent-retry-strategies-exponential-backoff|AI Agent Retry Strategies: Exponential Backoff and Graceful Degradation]] * [[https://sparkco.ai/blog/mastering-agent-error-recovery-retry-logic|Mastering Agent Error Recovery: Retry Logic]] * [[https://www.datagrid.com/blog/exception-handling-frameworks-ai-agents|Exception Handling Frameworks for AI Agents]] * [[https://www.arunbaby.com/ai-agents/0033-error-handling-recovery/|Error Handling and Recovery - Arun Baby]] ===== See Also ===== * [[tool_result_parsing|Tool Result Parsing]] * [[agent_memory_persistence|Agent Memory Persistence]] * [[agent_observability|Agent Observability]]