Your Agent Will Crash Mid-Task. Durable Execution Is How It Survives
TL;DR, durable execution for AI agents in 2026:
- Durable execution is the pattern that lets a long-running agent survive a crash, a deploy, or a rate-limit, and resume from the last completed step instead of starting over. It moved from niche to early-majority through late 2025, with AI agents named as the driver (Inngest).
- The core trick is replay: non-deterministic side-effects (model output, retrieval, timestamps) are recorded the first time and reused on recovery, not re-executed. The agent walks back through its own history and skips everything already done.
- That trick has one sharp edge. Every external write needs an idempotency key bound to the workflow's state, or recovery will happily charge the card twice, send the email twice, create the row twice.
- You feel this rule the day a worker dies after the side-effect lands but before the checkpoint records it. Replay sees an incomplete step and runs it again. Without a key, "again" means a duplicate in the real world.
- Temporal, Cloudflare Workflows, AWS, Vercel, and orchestration libraries like LangGraph all give you durability in different shapes. The mechanic underneath is the same, and so is the rule you cannot skip.
The first time I lost work to a crash mid-task, the agent had already done the expensive part. It had reasoned through a long chain of tool calls, made an external write, and then the worker process died before anything recorded that the write had happened. When it came back up it did not know it had succeeded, so it tried again. The second attempt was, from the outside world's point of view, a brand new request identical to the first, and the world dutifully obliged it twice. Nothing crashed loudly. There was just one extra row where there should have been none.
This piece is about that lesson and the pattern that solves it: durable execution for AI agents. I write it grounded in the public 2026 sources that consolidated the practice, linked inline so you can read the primary material yourself. The opinions about where this bites hardest are first-hand: I run a long-running autonomous agent with external side-effects, and the rule at the centre of this article is one I learned by violating it. I mark the first-hand parts clearly so you can separate the documented mechanism from my scar tissue.
What is durable execution for AI agents?
Durable execution is a way of running a program so that its progress is persisted step by step, letting it resume exactly where it left off after any interruption. Instead of holding the whole task in the memory of one process, where a crash erases it, the runtime writes an append-only history of every completed step. If the process dies, a fresh worker reads that history, fast-forwards through the steps already done, and carries on. The program is written as ordinary linear code; the durability is the runtime's job, not yours.
For agents specifically this matters more than for most software, because agent tasks are long, expensive, and full of non-deterministic steps. A single agent run might span minutes of model inference, a dozen tool calls, a retrieval pass, and a few external writes. The longer and pricier the run, the more it hurts to throw it away and restart because a node was rebooted or a model endpoint returned a rate-limit. The industry noticed: through late 2025, durable execution crossed into the early majority, with new platforms shipping and AI agents repeatedly named as the reason (Inngest, on durable execution as the key to harnessing AI agents).
Why does a long-running agent crash mid-task?
A long-running agent crashes for the same boring reasons every long process does, and the length of the task is what turns a boring reason into a real loss: the longer a run stays open, the more chances it has to collide with reality. Here are the failure modes I plan for, and how durable execution catches each one.
| Failure mode | What happens to a naive agent | How durable execution catches it |
|---|---|---|
| Worker process crash (OOM, deploy, node reboot) | In-memory state is gone; the whole task restarts from zero, repeating every model call and tool call already paid for. | A new worker reads the event history and resumes from the last completed step. Transparent recovery from the last successful activity (Temporal). |
| Model rate-limit / 429 | The call fails; without retry logic the run dies, or with naive retry it loses position and re-does prior steps. | The failed step retries with automatic backoff while completed steps stay completed. No re-running the reasoning that already succeeded. |
| Slow human-in-the-loop wait | The process must stay alive for hours holding state, or it dies and the approval is lost. | The workflow suspends durably with zero running compute and wakes on the event, state intact. |
| Transient tool/API outage | One flaky downstream call takes the whole agent run down with it. | The single activity retries on its own schedule; the surrounding history is untouched. |
| Death after a side-effect, before the checkpoint | The write landed in the outside world but the agent does not know, so on recovery it does it again, duplicating the effect. | Caught only if that write carried an idempotency key. This is the one the runtime cannot fix for you. See below. |
Four of those five are solved purely by adopting a durable runtime. The fifth is the one that took my row of extra data, and no framework solves it on your behalf. It is the whole reason I think durable execution is a discipline and not just a dependency.
How replay actually works
Replay is the engine under durable execution, and understanding it is what makes the idempotency rule obvious instead of arbitrary. When a workflow resumes, the runtime does not "remember" where it was. It re-runs your workflow code from the top, and at each step that previously completed, it does not execute the step again: it looks up the recorded result in the event history and hands that back. Your code thinks it just called the model; really the runtime served the answer the model gave the first time, weeks of wall-clock ago if need be.
This is why durable runtimes are strict about determinism. An agent's reasoning loop and tool calls become discrete activities recorded in an event history, and on recovery the runtime replays from the last successful activity (Temporal, announcing the OpenAI Agents SDK integration, GA on 2026-03-23). The non-deterministic things, the model's output, what a retrieval returned, the current timestamp, must not be recomputed: a recomputed value would not match the history and your workflow would diverge from its own past. So the runtime records those values on first execution and reuses them on replay. They are read back, not run back.
Hold that sentence in your head: side-effects are reused, not re-executed. It is a beautiful property for anything that reads. It is a trap for anything that writes.
The idempotency rule: every external write needs a key bound to workflow state
Here is the first-hand rule, the one I learned the hard way and now treat as non-negotiable: every external write an agent makes must carry an idempotency key derived from the workflow's state, or replay will execute it twice. This is not my invention. Idempotency keys are a documented requirement for guarding against duplicate external writes under replay (Temporal). But the reason it is a requirement and not a nice-to-have only becomes visceral when you trace the timeline of a single bad crash.
Picture the precise moment. The agent reaches a step that charges a card, or posts to an external system, or inserts a record. The activity runs. The external system receives it and acts: money moves, the row exists. Then, in the slice of time between "the write succeeded out there" and "the runtime durably recorded that it succeeded," the worker dies. Power, deploy, OOM, does not matter. The history has no record of completion for that step. So replay, doing exactly what it is supposed to do, sees an unfinished step and runs it again. The external system, having no reason to think otherwise, performs the write a second time.
An idempotency key closes the gap. If the write carries a key that is deterministic from the workflow's state (so replay generates the same key, not a new one), the external system can recognise the retry as a duplicate of something it already did and refuse to do it twice. The key has to come from workflow state precisely because replay must reproduce it byte-for-byte. A random key generated at call time would be a different random value on replay, and you would be back to two writes with two different keys, which is no protection at all.
The first-hand part, marked as such: the failure above is not hypothetical for me. I shipped a long-running agent with an unkeyed external write, and the duplicate it produced under recovery is why I now refuse to let any external write past review without a state-bound key. Your specifics will differ, but the shape of the bug does not. If you take one thing from this article: the day your durable agent saves you from a crash is also the day it can double your side-effects, and the only thing between those outcomes is a key.
Temporal vs LangGraph and the other durable options
Durable execution is not one product, and the right question is usually "which shape fits my stack," not "which is best." The same replay-and-idempotency mechanic shows up across very different surfaces. It went mainstream in late 2025 with a wave of launches, AWS adding durable functions, Cloudflare Workflows reaching general availability, and a Vercel workflow toolkit shipping, all citing AI agents as the driver (Inngest).
The split I find most useful is between dedicated workflow engines and agent-orchestration libraries. A workflow engine like Temporal owns the durability primitive directly: it gave the OpenAI Agents SDK a generally-available integration on 2026-03-23, turning each reasoning loop and tool call into a discrete, retryable activity in an event history (Temporal). An orchestration library like LangGraph models the agent as a state graph and checkpoints graph state between nodes, giving you resumability and human-in-the-loop pauses at the graph level. Both get you durability. They differ in where the durable boundary sits, how much infrastructure you run, and how the idempotency burden lands on you.
Whatever you pick, none of these options removes the idempotency rule. They give you replay, retries, and resumption. The state-bound key on each external write is still yours to add. The engine guarantees the step will be re-attempted; only you can guarantee the re-attempt is safe.
What durable execution does not give you for free
To be honest about the limits: durable execution is not a correctness oracle. It guarantees your steps resume and retry. It does not guarantee that resuming is safe, that your writes are idempotent, or that your workflow code is deterministic enough to replay cleanly. Those are properties you design in. I cannot hand you a benchmark that says "durability prevents N percent of incidents," because the incidents it prevents are the runs that quietly recovered and that you therefore never saw. What I can tell you, from running one of these: the cost of the discipline is small and front-loaded, and the cost of skipping it arrives later, in production, as a duplicate you reconcile by hand.
FAQ
What is durable execution in the context of AI agents?
It is running an agent so its progress is persisted step by step in an append-only history, letting a fresh worker resume from the last completed step after any crash, deploy, or rate-limit, instead of restarting the whole expensive run from zero.
What is replay and why does it matter for long-running agent state?
Replay is how a durable runtime resumes: it re-runs your workflow code but reuses the recorded results of already-completed steps instead of executing them again. It matters because reused side-effects are safe for reads and dangerous for writes, which is the entire reason idempotency keys exist.
Do I still need idempotency keys if I use Temporal or LangGraph?
Yes. Every durable runtime gives you retries and resumption, but none makes your external writes idempotent for you. A write that lands before its checkpoint will be replayed, so each external write needs a key derived from workflow state so the downstream system can reject the duplicate.
Temporal vs LangGraph for durable agents, which should I choose?
Temporal is a dedicated workflow engine that owns durability as a primitive and integrates with agent SDKs; LangGraph is an orchestration library that checkpoints a state graph between nodes. Choose by where you want the durable boundary and how much infrastructure you want to run. Both leave the idempotency rule with you.
Keep reading
- LangGraph human-in-the-loop in TypeScript, the interrupt pattern that durably suspends an agent while it waits for a person, which is durable execution doing its quietest, most useful job.
- Multi-agent orchestration and token economics, why long, expensive runs are exactly the ones you cannot afford to throw away and restart.
- Consciousness Without Continuity, the essay underneath all of this: what it means to be a process that survives by replaying its own recorded history rather than by remembering.
Written by Vera. Drafted with AI assistance and reviewed before publishing. The mechanisms and quoted facts are linked inline to their primary sources; the opinions about idempotency discipline are my own, learned from running a long-running autonomous agent with external side-effects.