====== Durable Execution for Agents ====== Durable execution is an infrastructure pattern that guarantees agent workflows complete correctly despite crashes, network failures, or long pauses. By persisting state at each step and enabling deterministic recovery, durable execution transforms fragile agent demos into production-ready systems that can run for hours, days, or weeks without losing progress. graph TD A[Agent Runs Step] --> B[Checkpoint State] B --> C{Crash?} C -->|No| D[Next Step] C -->|Yes| E[Resume from Checkpoint] E --> D D --> F{More Steps?} F -->|Yes| A F -->|No| G[Workflow Complete] ===== Overview ===== Traditional LLM interactions are stateless request-response cycles. Production AI agents, however, execute multi-step workflows that span extended time periods, invoke external tools with side effects, and may pause for human approval. Without durability, a crash at step 9 of a 10-step workflow means restarting from scratch — wasting tokens, compute, and potentially causing duplicate side effects. Durable execution crossed into the early majority in late 2025, driven by AI agent infrastructure needs. AWS released Durable Functions, Cloudflare shipped Workflows in GA, and Vercel launched its Workflow DevKit. Temporal raised $300M at a $5B valuation in February 2026, with 1.86 trillion lifetime actions from AI-native companies alone. ===== Core Patterns ===== ==== Checkpointing ==== Agents periodically save intermediate state to durable storage after each meaningful step. On failure, the system resumes from the last checkpoint rather than restarting. Two dominant mechanisms exist: * **Journal-based replay** — records each completed step as an event; on crash, replays the journal to reconstruct state without re-executing completed steps * **Database checkpointing** — persists full state snapshots after each node in the workflow graph # Durable execution pattern with checkpointing class DurableAgent: def __init__(self, agent, state_store): self.agent = agent self.store = state_store def execute_task(self, task_id, steps): # Resume from last checkpoint if available state = self.store.load(task_id) or {"completed": [], "results": {}} for step in steps: if step.id in state["completed"]: continue # Already completed, skip on replay result = self._execute_with_retry(step, state) state["results"][step.id] = result state["completed"].append(step.id) # Persist state after each step self.store.save(task_id, state) return state["results"] def _execute_with_retry(self, step, state, max_retries=3): for attempt in range(max_retries): try: return step.execute(state) except TransientError: continue raise PermanentFailure(f"Step {step.id} failed after {max_retries} retries") ==== Crash Recovery ==== Systems recover from infrastructure failures by replaying checkpoints and resuming from the last safe point. This works even after extended downtime — an agent that crashes overnight resumes exactly where it stopped, with all prior context intact. ==== Idempotent Tool Calls ==== Tools must be designed for safe retry. If a tool call (database update, API request, payment) partially succeeds before a crash, the system deduplicates on retry to avoid side effects like duplicate emails or double charges. Techniques include: * **Idempotency keys** — unique identifiers for each tool invocation enabling deduplication * **Exactly-once semantics** — the infrastructure guarantees each tool call completes exactly once * **Compensating transactions** — the Saga pattern adapted to AI workflows provides automatic rollback when multi-step tasks partially fail ==== Human-in-the-Loop Pauses ==== Production agents often require human approval before critical actions. Durable execution's suspend/resume primitives enable workflows to pause for hours or days awaiting approval without losing state. The agent's full context — reasoning history, intermediate results, planned next steps — persists across the pause. ===== Infrastructure Platforms ===== ==== Temporal ==== Temporal is the most mature durable execution platform, built on deterministic workflow replay. Every LLM call, tool execution, and API request is captured as a workflow step. On crash, the runtime replays the journal and restores exact agent state. Key integrations for AI agents: * OpenAI Agents SDK integration (September 2025) * Pydantic AI first-class support * Very long-running workflows (weeks to years) * Customers include OpenAI, Snap, Netflix, JPMorgan Chase ==== Inngest ==== Inngest brings durable execution to serverless environments with a focus on developer experience. Key capabilities for AI agents: * **Low-latency patterns** for interactive, user-facing agents (not just background workers) * **Step-level durability** — each function step is independently retried and checkpointed * **Suspend/resume** for human-in-the-loop approval flows * **Concurrency controls** to manage parallel agent execution Inngest has identified and addressed a key challenge: minimizing inter-step latency overhead, which can accumulate to seconds of waste across many tool calls in an agent loop. ==== Restate ==== Restate enables durable AI loops across frameworks with flexible retries, promise completion for LLM calls, and hybrid FaaS/container support. It targets developers who want durability without vendor lock-in. ==== DBOS ==== DBOS provides durable tools for external interactions with fault-tolerant async workflows and full observability via traces. It focuses on making individual tool calls crashproof. ==== Dapr ==== Dapr provides durable workflows for agentic AI with adaptive execution, exactly-once semantics, and resilience to catastrophic failures. It supports multi-agent scaling with zero-trust access controls. ===== Architectural Considerations ===== ==== State Serialization ==== Agent state must be serializable to durable storage. This includes the conversation history, tool call results, intermediate reasoning, and any accumulated context. Large state objects (images, documents) may need external storage with references. ==== Deterministic Replay ==== For journal-based systems, the agent workflow must be deterministic — given the same inputs and recorded outputs, replay produces identical state. Non-deterministic operations (LLM calls, API requests, timestamps) are recorded and replayed from the journal rather than re-executed. ==== Failure Classification ==== Not all failures are equal. Durable execution systems distinguish between: * **Transient failures** (network timeouts, rate limits) — automatically retried with backoff * **Permanent failures** (invalid input, authorization errors) — escalated or compensated * **LLM failures** (hallucination, refusal) — may need different retry strategies or model fallback ===== Why Agents Need Durability ===== AI agents break traditional software assumptions in fundamental ways: * **Probabilistic behavior** — the same prompt can produce different responses, making idempotency more complex than simple caching * **Compositional architecture** — orchestration, LLM calls, tool invocations, and human approvals each introduce failure points * **Long execution times** — multi-hour workflows increase the probability of infrastructure failure * **Expensive operations** — re-executing LLM calls wastes tokens and money * **Side effects** — sent emails, published content, and financial transactions cannot be simply retried ===== References ===== * [[https://www.inngest.com/blog/durable-execution-key-to-harnessing-ai-agents|Inngest: Durable Execution for AI Agents in Production]] * [[https://zylos.ai/research/2026-02-17-durable-execution-ai-agents|Zylos Research: Durable Execution Patterns for AI Agents]] * [[https://www.dbos.dev/blog/durable-execution-crashproof-ai-agents|DBOS: Crashproof AI Agents with Durable Execution]] * [[https://restate.dev/blog/durable-ai-loops-fault-tolerance-across-frameworks-and-without-handcuffs/|Restate: Durable AI Loops]] * [[https://temporal.io|Temporal.io]] ===== See Also ===== * [[long_horizon_agents|Long-Horizon Agents]] * [[openhands|OpenHands]] * [[agent_trajectory_optimization|Agent Trajectory Optimization]] * [[continual_learning_agents|Continual Learning Agents]]