Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Core Concepts
Reasoning Techniques
Memory Systems
Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools & Products
Safety & Governance
Evaluation
Research
Development
Meta
Durable execution is an infrastructure pattern that guarantees agent workflows complete correctly despite crashes, network failures, or long pauses. By persisting state at each step and enabling deterministic recovery, durable execution transforms fragile agent demos into production-ready systems that can run for hours, days, or weeks without losing progress.
Traditional LLM interactions are stateless request-response cycles. Production AI agents, however, execute multi-step workflows that span extended time periods, invoke external tools with side effects, and may pause for human approval. Without durability, a crash at step 9 of a 10-step workflow means restarting from scratch — wasting tokens, compute, and potentially causing duplicate side effects.
Durable execution crossed into the early majority in late 2025, driven by AI agent infrastructure needs. AWS released Durable Functions, Cloudflare shipped Workflows in GA, and Vercel launched its Workflow DevKit. Temporal raised $300M at a $5B valuation in February 2026, with 1.86 trillion lifetime actions from AI-native companies alone.
Agents periodically save intermediate state to durable storage after each meaningful step. On failure, the system resumes from the last checkpoint rather than restarting. Two dominant mechanisms exist:
# Durable execution pattern with checkpointing class DurableAgent: def __init__(self, agent, state_store): self.agent = agent self.store = state_store def execute_task(self, task_id, steps): # Resume from last checkpoint if available state = self.store.load(task_id) or {"completed": [], "results": {}} for step in steps: if step.id in state["completed"]: continue # Already completed, skip on replay result = self._execute_with_retry(step, state) state["results"][step.id] = result state["completed"].append(step.id) # Persist state after each step self.store.save(task_id, state) return state["results"] def _execute_with_retry(self, step, state, max_retries=3): for attempt in range(max_retries): try: return step.execute(state) except TransientError: continue raise PermanentFailure(f"Step {step.id} failed after {max_retries} retries")
Systems recover from infrastructure failures by replaying checkpoints and resuming from the last safe point. This works even after extended downtime — an agent that crashes overnight resumes exactly where it stopped, with all prior context intact.
Tools must be designed for safe retry. If a tool call (database update, API request, payment) partially succeeds before a crash, the system deduplicates on retry to avoid side effects like duplicate emails or double charges. Techniques include:
Production agents often require human approval before critical actions. Durable execution's suspend/resume primitives enable workflows to pause for hours or days awaiting approval without losing state. The agent's full context — reasoning history, intermediate results, planned next steps — persists across the pause.
Temporal is the most mature durable execution platform, built on deterministic workflow replay. Every LLM call, tool execution, and API request is captured as a workflow step. On crash, the runtime replays the journal and restores exact agent state.
Key integrations for AI agents:
Inngest brings durable execution to serverless environments with a focus on developer experience. Key capabilities for AI agents:
Inngest has identified and addressed a key challenge: minimizing inter-step latency overhead, which can accumulate to seconds of waste across many tool calls in an agent loop.
Restate enables durable AI loops across frameworks with flexible retries, promise completion for LLM calls, and hybrid FaaS/container support. It targets developers who want durability without vendor lock-in.
DBOS provides durable tools for external interactions with fault-tolerant async workflows and full observability via traces. It focuses on making individual tool calls crashproof.
Dapr provides durable workflows for agentic AI with adaptive execution, exactly-once semantics, and resilience to catastrophic failures. It supports multi-agent scaling with zero-trust access controls.
Agent state must be serializable to durable storage. This includes the conversation history, tool call results, intermediate reasoning, and any accumulated context. Large state objects (images, documents) may need external storage with references.
For journal-based systems, the agent workflow must be deterministic — given the same inputs and recorded outputs, replay produces identical state. Non-deterministic operations (LLM calls, API requests, timestamps) are recorded and replayed from the journal rather than re-executed.
Not all failures are equal. Durable execution systems distinguish between:
AI agents break traditional software assumptions in fundamental ways: