This is an old revision of the document!

Agent Error Recovery

Patterns for handling failures in AI agent systems, including retry with backoff, fallback chains, graceful degradation, error classification, and self-healing mechanisms.

Overview

AI agents fail differently than traditional software. LLM-powered systems face non-deterministic failures, partial successes, cascading errors, silent failures, and rate limiting. Production agent error handling requires patterns designed specifically for these unique failure modes. The goal is not to prevent all errors but to enable autonomous recovery.

Error Classification

Before recovering, classify the error. Agent failures fall into four categories:

Syntactic Errors – Malformed JSON, broken tool call formatting, invalid output structure
Semantic Errors – Valid structure but nonsensical or contradictory content
Environmental Errors – API timeouts, rate limits (429), network failures, service outages
Cognitive Errors – Hallucinated tool names, incorrect reasoning, confidence miscalibration

Classification determines the recovery strategy. Rate limits need backoff; auth errors need no retry; malformed output needs re-prompting; hallucinations need model escalation.

Retry with Exponential Backoff

The foundational recovery pattern. Increases wait time between attempts to avoid retry storms and respect rate limits. Always add jitter to prevent synchronized retries across multiple agents.

Key principles:

Differentiate transient vs permanent errors before retrying
Add random jitter to prevent thundering herd
Set maximum retry count and total timeout
Only retry on retryable error classes (429, 500, 503, timeouts)

import asyncio
import random
from enum import Enum
from typing import Callable, TypeVar
 
T = TypeVar("T")
 
 
class ErrorSeverity(Enum):
    TRANSIENT = "transient"      # retry with backoff
    PERMANENT = "permanent"      # fail immediately
    DEGRADED = "degraded"        # try fallback
 
 
def classify_error(error: Exception) -> ErrorSeverity:
    status = getattr(error, "status_code", None)
    if status in (429, 500, 502, 503, 504):
        return ErrorSeverity.TRANSIENT
    if status in (401, 403, 404):
        return ErrorSeverity.PERMANENT
    if "context_length" in str(error).lower():
        return ErrorSeverity.DEGRADED
    return ErrorSeverity.TRANSIENT
 
 
async def retry_with_backoff(
    fn: Callable,
    max_retries: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
) -> T:
    for attempt in range(max_retries):
        try:
            return await fn()
        except Exception as e:
            severity = classify_error(e)
            if severity == ErrorSeverity.PERMANENT:
                raise
            if severity == ErrorSeverity.DEGRADED:
                raise  # let fallback chain handle it
            if attempt == max_retries - 1:
                raise
            delay = min(base_delay * (2 ** attempt), max_delay)
            jitter = random.uniform(0, delay * 0.5)
            await asyncio.sleep(delay + jitter)

Fallback Chains

When a primary model or service fails, switch to progressively simpler alternatives. A fallback chain ensures the agent always produces some response.

Design pattern: Primary model → Cheaper/faster model → Cached response → Graceful error message.

class FallbackChain:
    def __init__(self, handlers: list[Callable]):
        self.handlers = handlers
 
    async def execute(self, *args, **kwargs):
        last_error = None
        for handler in self.handlers:
            try:
                return await retry_with_backoff(
                    lambda h=handler: h(*args, **kwargs)
                )
            except Exception as e:
                last_error = e
                continue
        raise last_error
 
 
# Usage: try GPT-4 -> Claude -> cached response
chain = FallbackChain([
    call_gpt4,
    call_claude,
    get_cached_response,
])
result = await chain.execute(prompt="Summarize this document")

Graceful Degradation

Operate at three levels:

Operation-level – Retry individual LLM calls or tool invocations
Step-level – Re-plan or substitute alternative actions when a step fails
System-level – Fall back to simpler models, cached responses, or human escalation

The key principle is that workflows should complete with available data rather than failing entirely when a single component encounters issues.

Self-Healing Patterns

Self-healing allows agents to resolve errors without manual intervention:

Try-Rewrite-Retry – Feed the error message back to the LLM so it can debug its own output and produce a corrected version
Automated resource scaling – Dynamically allocate resources under load
Task redistribution – Move work from failed components to healthy ones
Checkpoint recovery – Resume from the last successful step rather than restarting from scratch

Error Recovery Flow

%%% Mermaid diagram - render at mermaid.live %%%

graph TD
    A[Agent Action] --> B{Execute}
    B -->|Success| C[Return Result]
    B -->|Error| D{Classify Error}
    D -->|Transient| E[Retry with Backoff]
    D -->|Permanent| F[Fail Fast]
    D -->|Degraded| G[Fallback Chain]
    E -->|Success| C
    E -->|Max Retries| G
    G -->|Primary Model| H{Try Next}
    H -->|Success| C
    H -->|Fail| I[Cheaper Model]
    I -->|Success| C
    I -->|Fail| J[Cached Response]
    J -->|Available| C
    J -->|None| K[Human Escalation]
    F --> L[Log & Alert]
    K --> L

Circuit Breakers

Stop cascading failures by tracking error rates and temporarily disabling failing services. After a cooldown period, allow a test request through to check recovery.

States: Closed (normal) → Open (blocking requests) → Half-Open (testing recovery).

Circuit breakers prevent agents from burning through API budgets during outages and protect downstream services from retry storms.

Observability

Production error recovery requires comprehensive monitoring:

Standardized observability with OpenTelemetry for logs, metrics, and traces
Real-time dashboards for task success/failure rates and rollback frequency
Automated alerting on anomaly detection
Structured logging with session IDs, inputs/outputs, and duration

A financial services case study showed that implementing these patterns achieved a 70% reduction in task failure rates and 50% decrease in mean time to resolution.

AI Agent Knowledge Base

Sidebar

Table of Contents

Agent Error Recovery

Overview

Error Classification

Retry with Exponential Backoff

Fallback Chains

Graceful Degradation

Self-Healing Patterns

Error Recovery Flow

Circuit Breakers

Observability

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Agent Error Recovery

Overview

Error Classification

Retry with Exponential Backoff

Fallback Chains

Graceful Degradation

Self-Healing Patterns

Error Recovery Flow

Circuit Breakers

Observability

References

See Also

Page Tools