Durable Execution for Agents

Durable execution is an infrastructure pattern that guarantees agent workflows complete correctly despite crashes, network failures, or long pauses.¹⁾²⁾³⁾ By persisting state at each step and enabling deterministic recovery, durable execution transforms fragile agent demos into production-ready systems that can run for hours, days, or weeks without losing progress. This framework enables long-running agents to maintain state across sessions through persistent workspaces, file-as-bus patterns, and resumable execution, supporting sub-agents, session hygiene, thread branching, and checkpoint/snapshot capabilities.⁴⁾ Durable agents are long-running systems capable of persistent execution, session management, and state preservation across task boundaries, maintaining workspace artifacts and enabling fork/resume snapshots for resumable computation.⁵⁾ Modern durable execution runtimes provide first-class support for pause/resume semantics, checkpointing, and replay, addressing deployment complexity through explicit intervention points and persistent state management for consequential actions.⁶⁾ Production infrastructure increasingly leverages subagent architectures and enterprise deployment patterns with streaming output capabilities, enabling agents-as-tools for sophisticated multi-agent systems.⁷⁾

graph TD A[Agent Runs Step] --> B[Checkpoint State] B --> C{Crash?} C -->|No| D[Next Step] C -->|Yes| E[Resume from Checkpoint] E --> D D --> F{More Steps?} F -->|Yes| A F -->|No| G[Workflow Complete]

Overview

Traditional LLM interactions are stateless request-response cycles. Production AI agents, however, execute multi-step workflows that span extended time periods, invoke external tools with side effects, and may pause for human approval. Without durability, a crash at step 9 of a 10-step workflow means restarting from scratch — wasting tokens, compute, and potentially causing duplicate side effects.

Durable execution crossed into the early majority in late 2025, driven by AI agent infrastructure needs. AWS released Durable Functions, Cloudflare shipped Workflows in GA, and Vercel launched its Workflow DevKit. Temporal raised $300M at a $5B valuation in February 2026, with 1.86 trillion lifetime actions from AI-native companies alone.⁸⁾ As a runtime feature, durable execution enables checkpointing, replay, and resumption of agent plans across service restarts or interruptions, with platforms like Cloudflare's Dynamic Workflows and LangChain's primitives making this a first-class feature for production agent deployment.⁹⁾ Emerging infrastructure solutions including Mistral Workflows support enterprise deployment patterns with fault tolerance and observability as core features.¹⁰⁾

Core Patterns

Checkpointing

Agents periodically save intermediate state to durable storage after each meaningful step. On failure, the system resumes from the last checkpoint rather than restarting. Two dominant mechanisms exist:

Journal-based replay — records each completed step as an event; on crash, replays the journal to reconstruct state without re-executing completed steps
Database checkpointing — persists full state snapshots after each node in the workflow graph

Durable execution pattern with checkpointing
class DurableAgent:
    def __init__(self, agent, state_store):
        self.agent = agent
        self.store = state_store
 
    def execute_task(self, task_id, steps):
        # Resume from last checkpoint if available
        state = self.store.load(task_id) or {"completed": [], "results": {}}
 
        for step in steps:
            if step.id in state["completed"]:
                continue  # Already completed, skip on replay
 
            result = self._execute_with_retry(step, state)
            state["results"][step.id] = result
            state["completed"].append(step.id)
 
            # Persist state after each step
            self.store.save(task_id, state)
 
        return state["results"]
 
    def _execute_with_retry(self, step, state, max_retries=3):
        for attempt in range(max_retries):
            try:
                return step.execute(state)
            except TransientError:
                continue
        raise PermanentFailure(f"Step {step.id} failed after {max_retries} retries")

Crash Recovery

Systems recover from infrastructure failures by replaying checkpoints and resuming from the last safe point. This works even after extended downtime — an agent that crashes overnight resumes exactly where it stopped, with all prior context intact.

Idempotent Tool Calls

Tools must be designed for safe retry. If a tool call (database update, API request, payment) partially succeeds before a crash, the system deduplicates on retry to avoid side effects like duplicate emails or double charges. Techniques include:

Idempotency keys — unique identifiers for each tool invocation enabling deduplication
Exactly-once semantics — the infrastructure guarantees each tool call completes exactly once
Compensating transactions — the Saga pattern adapted to AI workflows provides automatic rollback when multi-step tasks partially fail

Human-in-the-Loop Pauses

Production agents often require human approval before critical actions. Durable execution's suspend/resume primitives enable workflows to pause for hours or days awaiting approval without losing state. The agent's full context — reasoning history, intermediate results, planned next steps — persists across the pause.

Infrastructure Platforms

Temporal

Temporal is the most mature durable execution platform, built on deterministic workflow replay. Every LLM call, tool execution, and API request is captured as a workflow step. On crash, the runtime replays the journal and restores exact agent state.

Key integrations for AI agents:

OpenAI Agents SDK integration (September 2025)
Pydantic AI first-class support
Very long-running workflows (weeks to years)
Customers include OpenAI, Snap, Netflix, JPMorgan Chase

Inngest

Inngest brings durable execution to serverless environments with a focus on developer experience. Key capabilities for AI agents:

Low-latency patterns for interactive, user-facing agents (not just background workers)
Step-level durability — each function step is independently retried and checkpointed
Suspend/resume for human-in-the-loop approval flows
Concurrency controls to manage parallel agent execution

Inngest has identified and addressed a key challenge: minimizing inter-step latency overhead, which can accumulate to seconds of waste across many tool calls in an agent loop.

Restate

Restate enables durable AI loops across frameworks with flexible retries, promise completion for LLM calls, and hybrid FaaS/container support.¹¹⁾ It targets developers who want durability without vendor lock-in.

DBOS

DBOS provides durable tools for external interactions with fault-tolerant async workflows and full observability via traces. It focuses on making individual tool calls crashproof.

Dapr

Dapr provides durable workflows for agentic AI with adaptive execution, exactly-once semantics, and resilience to catastrophic failures. It supports multi-agent scaling with zero-trust access controls.

Mistral Workflows

Mistral Workflows deliver production infrastructure with fault tolerance and observability optimized for enterprise agent deployment, with support for subagent architectures and streaming output patterns.¹²⁾

Architectural Considerations

State Serialization

Agent state must be serializable to durable storage. This includes the conversation history, tool call results, intermediate reasoning, and any accumulated context. Large state objects (images, documents) may need external storage with references.

Deterministic Replay

For journal-based systems, the agent workflow must be deterministic — given the same inputs and recorded outputs, replay produces identical state. Non-deterministic operations (LLM calls, API requests, timestamps) are recorded and replayed from the journal rather than re-executed.

Failure Classification

Not all failures are equal. Durable execution systems distinguish between:

Transient failures (network timeouts, rate limits) — automatically retried with backoff
Permanent failures (invalid input, authorization errors) — escalated or compensated
LLM failures (hallucination, refusal) — may need different retry strategies or model fallback

Why Agents Need Durability

AI agents break traditional software assumptions in fundamental ways:

Probabilistic behavior — the same prompt can produce different responses, making idempotency more complex than simple caching
Compositional architecture — orchestration, LLM calls, tool invocations, and human approvals each introduce failure points
Long execution times — multi-hour workflows increase the probability of infrastructure failure
Expensive operations — re-executing LLM calls wastes tokens and money
Side effects — sent emails, published content, and financial transactions cannot be simply retried

References

AI Agent Knowledge Base

Sidebar

Table of Contents

Durable Execution for Agents

Overview

Core Patterns

Checkpointing

Crash Recovery

Idempotent Tool Calls

Human-in-the-Loop Pauses

Infrastructure Platforms

Temporal

Inngest

Restate

DBOS

Dapr

Mistral Workflows

Architectural Considerations

State Serialization

Deterministic Replay

Failure Classification

Why Agents Need Durability

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Durable Execution for Agents

Overview

Core Patterns

Checkpointing

Crash Recovery

Idempotent Tool Calls

Human-in-the-Loop Pauses

Infrastructure Platforms

Temporal

Inngest

Restate

DBOS

Dapr

Mistral Workflows

Architectural Considerations

State Serialization

Deterministic Replay

Failure Classification

Why Agents Need Durability

See Also

References

Page Tools