====== Agent Error Recovery ======
Patterns for handling failures in AI agent systems, including retry with backoff, fallback chains, graceful degradation, error classification, and self-healing mechanisms.

===== Overview =====
AI agents fail differently than traditional software. LLM-powered systems face non-deterministic failures, partial successes, cascading errors, silent failures, and rate limiting. Production agent error handling requires patterns designed specifically for these unique failure modes. The goal is not to prevent all errors but to enable autonomous recovery. Runtime mechanisms that manage model failures, tool invocation errors, and system faults in agentic applications are essential components of this strategy, though these patterns must be continuously reassessed as model reliability improves to avoid unnecessary overhead.(([[https://cobusgreyling.substack.com/p/architecting-agentic-ai-how-sdks-107|Cobus Greyling (LLMs) - Error Handling and Failure Recovery (2026]]))

===== Error Classification =====
Before recovering, classify the error. Agent failures fall into four categories:

  * **Syntactic Errors**, Malformed JSON, broken tool call formatting, invalid output structure
  * **Semantic Errors**, Valid structure but nonsensical or contradictory content
  * **Environmental Errors**, API timeouts, rate limits (429), network failures, service outages
  * **Cognitive Errors**, Hallucinated tool names, incorrect reasoning, confidence miscalibration

Classification determines the recovery strategy. Rate limits need backoff; auth errors need no retry; malformed output needs re-prompting; hallucinations need model escalation.((Nebulagg. "AI Agent Error Handling: 4 Resilience Patterns in Python." [[https://dev.to/nebulagg/ai-agent-error-handling-4-resilience-patterns-in-python-12of|dev.to]]))

===== Retry with Exponential Backoff =====
The foundational recovery pattern. Increases wait time between attempts to avoid retry storms and respect rate limits. Always add jitter to prevent synchronized retries across multiple agents.((Sparkco AI. "Mastering Agent Error Recovery: Retry Logic." [[https://sparkco.ai/blog/mastering-agent-error-recovery-retry-logic|sparkco.ai]]))

Key principles:
  * Differentiate transient vs permanent errors before retrying
  * Add random jitter to prevent thundering herd
  * Set maximum retry count and total timeout
  * Only retry on retryable error classes (429, 500, 503, timeouts)

<code python>
import asyncio
import random
from enum import Enum
from typing import Callable, TypeVar

T = TypeVar("T")


class ErrorSeverity(Enum):
    TRANSIENT = "transient"      # retry with backoff
    PERMANENT = "permanent"      # fail immediately
    DEGRADED = "degraded"        # try fallback


def classify_error(error: Exception) -> ErrorSeverity:
    status = getattr(error, "status_code", None)
    if status in (429, 500, 502, 503, 504):
        return ErrorSeverity.TRANSIENT
    if status in (401, 403, 404):
        return ErrorSeverity.PERMANENT
    if "context_length" in str(error).lower():
        return ErrorSeverity.DEGRADED
    return ErrorSeverity.TRANSIENT


async def retry_with_backoff(
    fn: Callable,
    max_retries: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
) -> T:
    for attempt in range(max_retries):
        try:
            return await fn()
        except Exception as e:
            severity = classify_error(e)
            if severity == ErrorSeverity.PERMANENT:
                raise
            if severity == ErrorSeverity.DEGRADED:
                raise  # let fallback chain handle it
            if attempt == max_retries - 1:
                raise
            delay = min(base_delay * (2 ** attempt), max_delay)
            jitter = random.uniform(0, delay * 0.5)
            await asyncio.sleep(delay + jitter)
</code>

===== Fallback Chains =====
When a primary model or service fails, switch to progressively simpler alternatives. A fallback chain ensures the agent always produces some response.((Getathenic. "AI Agent Retry Strategies: Exponential Backoff and Graceful Degradation." [[https://getathenic.com/blog/ai-agent-retry-strategies-exponential-backoff|getathenic.com]]))

Design pattern: Primary model -> Cheaper/faster model -> Cached response -> Graceful error message.

<code python>
class FallbackChain:
    def __init__(self, handlers: list[Callable]):
        self.handlers = handlers

    async def execute(self, *args, **kwargs):
        last_error = None
        for handler in self.handlers:
            try:
                return await retry_with_backoff(
                    lambda h=handler: h(*args, **kwargs)
                )
            except Exception as e:
                last_error = e
                continue
        raise last_error


Usage: try GPT-4 -> [[claude|Claude]] -> cached response
chain = FallbackChain([
    call_gpt4,
    call_claude,
    get_cached_response,
])
result = await chain.execute(prompt="Summarize this document")
</code>

===== Graceful Degradation =====
Operate at three levels:

  * **Operation-level**, Retry individual LLM calls or tool invocations
  * **Step-level**, Re-plan or substitute alternative actions when a step fails
  * **System-level**, Fall back to simpler models, cached responses, or human escalation

The key principle is that workflows should complete with available data rather than failing entirely when a single component encounters issues.

===== Self-Healing Patterns =====
Self-healing allows agents to resolve errors without manual intervention:((TechFind777. "Building Self-Healing AI Agents: 7 Error Handling Patterns." [[https://dev.to/techfind777/building-self-healing-ai-agents-7-error-handling-patterns-that-keep-your-agent-running-at-3-am-5h81|dev.to]]))

  * **Try-Rewrite-Retry**, Feed the error message back to the LLM so it can debug its own output and produce a corrected version
  * **Automated resource scaling**, Dynamically allocate resources under load
  * **Task redistribution**, Move work from failed components to healthy ones
  * **Checkpoint recovery**, Resume from the last successful step rather than restarting from scratch
  * **Self-Healing Query Loop**, Replace the standard request-response cycle with a continuous state machine that handles errors silently. This architecture recovers automatically when a model exhausts its output budget or fails a task by injecting [[meta|meta]]-messages to resume generation. Context window efficiency is maintained through compaction techniques that trim low-value information while preserving critical state for recovery(([[https://alphasignalai.substack.com/p/anthropics-512k-line-code-leak-reveals|Alpha Signal AI - Anthropic's 512K Line Code Leak Reveals Self-Healing Query Loop Architecture]]))

===== Error Recovery Flow =====
<mermaid>
graph TD
    A[Agent Action] --> B{Execute}
    B -->|Success| C[Return Result]
    B -->|Error| D{Classify Error}
    D -->|Transient| E[Retry with Backoff]
    D -->|Permanent| F[Fail Fast]
    D -->|Degraded| G[Fallback Chain]
    E -->|Success| C
    E -->|Max Retries| G
    G -->|Primary Model| H{Try Next}
    H -->|Success| C
    H -->|Fail| I[Cheaper Model]
    I -->|Success| C
    I -->|Fail| J[Cached Response]
    J -->|Available| C
    J -->|None| K[Human Escalation]
    F --> L[Log & Alert]
    K --> L
</mermaid>

===== Circuit Breakers =====
Stop cascading failures by tracking error rates and temporarily disabling failing services. After a cooldown period, allow a test request through to check recovery.((Datagrid. "Exception Handling Frameworks for AI Agents." [[https://www.datagrid.com/blog/exception-handling-frameworks-ai-agents|datagrid.com]]))

States: Closed (normal) -> Open (blocking requests) -> Half-Open (testing recovery).

Circuit breakers prevent agents from burning through API budgets during outages and protect downstream services from retry storms.

===== Observability =====
Production error recovery requires comprehensive monitoring:

  * Standardized observability with OpenTelemetry for logs, metrics, and traces
  * Real-time dashboards for task success/failure rates and rollback frequency
  * Automated alerting on [[anomaly_detection|anomaly detection]]
  * Structured logging with session IDs, inputs/outputs, and duration

A financial services case study showed that implementing these patterns achieved a 70% reduction in task failure rates and 50% decrease in mean time to resolution.((Arun Baby. "Error Handling and Recovery." [[https://www.arunbaby.com/ai-agents/0033-error-handling-recovery/|arunbaby.com]]))

===== See Also =====

  * [[production_reliability_patterns|Production Reliability Patterns]]
  * [[error_recovery|Error Recovery and Self-Correction]]
  * [[agent_blind_spot_benchmarking|Agent Blind Spot Benchmarking]]
  * [[agent_benchmark_blind_spots|Benchmarks for Agent Blind Spots]]
  * [[durable_execution_for_agents|Durable Execution for Agents]]

===== References =====