====== Durable Execution for Agents ======
Durable execution is an infrastructure pattern that guarantees agent workflows complete correctly despite crashes, network failures, or long pauses.(([[https://www.inngest.com/blog/durable-execution-key-to-harnessing-ai-agents|Inngest: Durable Execution for AI Agents in Production]]))(([[https://zylos.ai/research/2026-02-17-durable-execution-ai-agents|Zylos Research: Durable Execution Patterns for AI Agents]]))(([[https://www.dbos.dev/blog/durable-execution-crashproof-ai-agents|DBOS: Crashproof AI Agents with Durable Execution]])) By persisting state at each step and enabling deterministic recovery, durable execution transforms fragile agent demos into production-ready systems that can run for hours, days, or weeks without losing progress. This framework enables long-running agents to maintain state across sessions through persistent workspaces, file-as-bus patterns, and resumable execution, supporting sub-agents, session hygiene, thread branching, and checkpoint/snapshot capabilities.(([[https://news.smol.ai/issues/26-04-15-not-much/|AI News (smol.ai) - Durable Agent Execution (2026]])) Durable agents are long-running systems capable of persistent execution, session management, and state preservation across task boundaries, maintaining workspace artifacts and enabling fork/resume snapshots for resumable computation.(([[https://www.latent.space/p/ainews-rip-pull-requests-2005-2026|Latent Space - AI News (2026]])) Modern durable execution runtimes provide first-class support for pause/resume semantics, checkpointing, and replay, addressing deployment complexity through explicit intervention points and persistent state management for consequential actions.(([[https://news.smol.ai/issues/26-05-01-not-much/|AI News (smol.ai) - Durable Execution Runtime (2026]])) Production infrastructure increasingly leverages subagent architectures and enterprise deployment patterns with streaming output capabilities, enabling agents-as-tools for sophisticated multi-agent systems.(([[https://www.latent.space/p/ainews-not-much-happened-today|Latent Space (2026]]))


<mermaid>
graph TD
    A[Agent Runs Step] --> B[Checkpoint State]
    B --> C{Crash?}
    C -->|No| D[Next Step]
    C -->|Yes| E[Resume from Checkpoint]
    E --> D
    D --> F{More Steps?}
    F -->|Yes| A
    F -->|No| G[Workflow Complete]
</mermaid>

===== Overview =====
Traditional LLM interactions are stateless request-response cycles. Production AI agents, however, execute multi-step workflows that span extended time periods, invoke external tools with side effects, and may pause for human approval. Without durability, a crash at step 9 of a 10-step workflow means restarting from scratch — wasting tokens, compute, and potentially causing duplicate side effects.

Durable execution crossed into the early majority in late 2025, driven by AI agent infrastructure needs. AWS released Durable Functions, [[cloudflare|Cloudflare]] shipped Workflows in GA, and [[vercel|Vercel]] launched its Workflow DevKit. Temporal raised $300M at a $5B valuation in February 2026, with 1.86 trillion lifetime actions from AI-native companies alone.(([[https://temporal.io|Temporal.io]])) As a runtime feature, durable execution enables checkpointing, replay, and resumption of agent plans across service restarts or interruptions, with platforms like Cloudflare's Dynamic Workflows and LangChain's primitives making this a first-class feature for production agent deployment.(([[https://www.latent.space/p/ainews-ai-engineer-worlds-fair-autoresearch|Latent Space (2026]])) Emerging infrastructure solutions including Mistral Workflows support enterprise deployment patterns with fault tolerance and observability as core features.(([[https://www.latent.space/p/ainews-not-much-happened-today|Latent Space (2026]]))

===== Core Patterns =====
==== Checkpointing ====
Agents periodically save intermediate state to durable storage after each meaningful step. On failure, the system resumes from the last checkpoint rather than restarting. Two dominant mechanisms exist:

  * **Journal-based replay** — records each completed step as an event; on crash, replays the journal to reconstruct state without re-executing completed steps
  * **Database checkpointing** — persists full state snapshots after each node in the workflow graph

<code python>
Durable execution pattern with checkpointing
class DurableAgent:
    def __init__(self, agent, state_store):
        self.agent = agent
        self.store = state_store

    def execute_task(self, task_id, steps):
        # Resume from last checkpoint if available
        state = self.store.load(task_id) or {"completed": [], "results": {}}

        for step in steps:
            if step.id in state["completed"]:
                continue  # Already completed, skip on replay

            result = self._execute_with_retry(step, state)
            state["results"][step.id] = result
            state["completed"].append(step.id)

            # Persist state after each step
            self.store.save(task_id, state)

        return state["results"]

    def _execute_with_retry(self, step, state, max_retries=3):
        for attempt in range(max_retries):
            try:
                return step.execute(state)
            except TransientError:
                continue
        raise PermanentFailure(f"Step {step.id} failed after {max_retries} retries")
</code>

==== Crash Recovery ====
Systems recover from infrastructure failures by replaying checkpoints and resuming from the last safe point. This works even after extended downtime — an agent that crashes overnight resumes exactly where it stopped, with all prior context intact.

==== Idempotent Tool Calls ====
Tools must be designed for safe retry. If a tool call (database update, API request, payment) partially succeeds before a crash, the system deduplicates on retry to avoid side effects like duplicate emails or double charges. Techniques include:

  * **Idempotency keys** — unique identifiers for each tool invocation enabling deduplication
  * **Exactly-once semantics** — the infrastructure guarantees each tool call completes exactly once
  * **Compensating transactions** — the Saga pattern adapted to AI workflows provides automatic rollback when multi-step tasks partially fail

==== Human-in-the-Loop Pauses ====
Production agents often require human approval before critical actions. Durable execution's suspend/resume primitives enable workflows to pause for hours or days awaiting approval without losing state. The agent's full context — reasoning history, intermediate results, planned next steps — persists across the pause.

===== Infrastructure Platforms =====
==== Temporal ====
Temporal is the most mature durable execution platform, built on deterministic workflow replay. Every LLM call, tool execution, and API request is captured as a workflow step. On crash, the runtime replays the journal and restores exact agent state.

Key integrations for AI agents:
  * [[openai_agents_sdk|OpenAI Agents SDK]] integration (September 2025)
  * Pydantic AI first-class support
  * Very long-running workflows (weeks to years)
  * Customers include [[openai|OpenAI]], [[snap|Snap]], Netflix, JPMorgan Chase

==== Inngest ====
Inngest brings durable execution to serverless environments with a focus on developer experience. Key capabilities for AI agents:

  * **Low-latency patterns** for interactive, user-facing agents (not just background workers)
  * **Step-level durability** — each function step is independently retried and checkpointed
  * **Suspend/resume** for [[human_in_the_loop|human-in-the-loop]] approval flows
  * **Concurrency controls** to manage [[parallel_agents|parallel agent execution]]

Inngest has identified and addressed a key challenge: minimizing inter-step latency overhead, which can accumulate to seconds of waste across many tool calls in an [[agent_loop|agent loop]].

==== Restate ====
Restate enables durable AI loops across frameworks with flexible retries, promise completion for LLM calls, and hybrid FaaS/container support.(([[https://restate.dev/blog/durable-ai-loops-fault-tolerance-across-frameworks-and-without-handcuffs/|Restate: Durable AI Loops]])) It targets developers who want durability without vendor lock-in.

==== DBOS ====
DBOS provides durable tools for external interactions with fault-tolerant async workflows and full observability via traces. It focuses on making individual tool calls crashproof.

==== Dapr ====
Dapr provides durable workflows for [[agentic_ai|agentic AI]] with adaptive execution, exactly-once semantics, and resilience to catastrophic failures. It supports multi-agent scaling with zero-trust access controls.

==== Mistral Workflows ====
Mistral Workflows deliver production infrastructure with fault tolerance and observability optimized for enterprise agent deployment, with support for subagent architectures and streaming output patterns.(([[https://www.latent.space/p/ainews-not-much-happened-today|Latent Space (2026]]))

===== Architectural Considerations =====
==== State Serialization ====
Agent state must be serializable to durable storage. This includes the conversation history, tool call results, intermediate reasoning, and any accumulated context. Large state objects (images, documents) may need external storage with references.

==== Deterministic Replay ====
For journal-based systems, the agent workflow must be deterministic — given the same inputs and recorded outputs, replay produces identical state. Non-deterministic operations (LLM calls, API requests, timestamps) are recorded and replayed from the journal rather than re-executed.

==== Failure Classification ====
Not all failures are equal. Durable execution systems distinguish between:

  * **Transient failures** (network timeouts, rate limits) — automatically retried with backoff
  * **Permanent failures** (invalid input, authorization errors) — escalated or compensated
  * **LLM failures** (hallucination, refusal) — may need different retry strategies or model fallback

===== Why Agents Need Durability =====
AI agents break traditional software assumptions in fundamental ways:

  * **Probabilistic behavior** — the same prompt can produce different responses, making idempotency more complex than simple caching
  * **Compositional architecture** — orchestration, LLM calls, tool invocations, and human approvals each introduce failure points
  * **Long execution times** — multi-hour workflows increase the probability of infrastructure failure
  * **Expensive operations** — re-executing LLM calls wastes tokens and money
  * **Side effects** — sent emails, published content, and financial transactions cannot be simply retried

===== See Also =====

  * [[durable_memory_pattern|Durable Memory Pattern]]
  * [[agent_error_recovery|Agent Error Recovery]]
  * [[cloudflare_dynamic_workflows|Cloudflare Dynamic Workflows]]
  * [[agent_observability|Agent Observability]]
  * [[agentless|Agentless]]

===== References =====