LoopRails
LoopRails · Articles · Failure Recovery for Agent Loops
View article-failure-recovery-agent-loops.md on GitHub ↗

Failure Recovery for Agent Loops: Retries, Rollback, and Resuming a Crashed Run

A loop that runs unattended will, given enough runs, do all three things that break loops. It will crash in the middle of an action. It will retry something that already happened. It will wedge on a bad input and grind there until a budget runs out or someone notices. None of this is exotic. The engineering world solved most of it for long-running workflows years ago, under names like durable execution and idempotency, and the agent world is re-learning the same lessons with one new wrinkle: the model can sometimes fix its own mistakes, and sometimes makes them worse when it tries.

This article is about making a loop survive its own failures, in the order you would build it: checkpoint, make actions idempotent, retry with discipline, gate the retry on something other than the model's opinion, roll back what you cannot re-run, and quarantine the runs that never converge. Throughout, it maps to LoopRails (the framework), where the same machinery already has names.

Durable execution: log the steps, resume from the last good one

Start with the thing that decides whether anything else matters: when the process dies, what does the loop have left? If the answer is "a half-finished task and nothing else," every other recovery technique is moot, because the next run starts from zero.

Durable execution is the fix, borrowed wholesale from workflow engines. Treat the loop body as a function whose state survives a crash, so work resumes in a new process instead of restarting FR-1. The mechanism is an append-only event history: record every decision and result, then replay it to rebuild state after a crash, reaching the same point without redoing the side effects FR-2. Cadence and Temporal call this fault-oblivious stateful execution FR-5; the job-scheduler version is a DAG of tasks, each with its own retries and timeouts, re-runnable from the point of failure FR-6.

A discipline comes attached, and it bites agents specifically. Replay only works if the control flow is deterministic, so the history reconstructs the same decisions every time FR-3. A model call is not deterministic. Neither is a tool result. Those go into recorded steps, as logged results you replay, never into the control flow you re-execute. Re-prompt the model during replay and you are not resuming a run, you are starting a new one that shares a log.

LoopRails states the same idea as a doctrine principle: memory lives in a file, not the model. The model forgets between turns, so the loop's durable state (the plan, the decisions made, the things tried) belongs in a file under version control. The point worth being precise about: that memory file is the checkpoint. A loop that reads it at the start of each run and writes to it at the end has an append-only history by another name. LangGraph's checkpointers are the production version of the same contract: a state snapshot per step, keyed by a thread id, so a run resumes after an interruption or a human-in-the-loop pause without burning the compute again FR-27.

Idempotency before retryability

Here is the order people get wrong. They build retries first, then learn that a retry is only safe if the thing retried is safe to repeat. The reason is delivery semantics. Exactly-once is mostly a fiction at the boundary; in practice you get at-least-once, so a step can run more than once whether you planned for it or not FR-12. Now add a crash. The loop sends a payment, the process dies before recording that it went through, the run resumes and sends it again. The retry was correct. The action was not idempotent. The customer was charged twice.

The fix is an idempotency token and dedup: tag the operation with a key, and the downstream side recognizes a repeat and returns the original result instead of doing the work again FR-9. This is the most load-bearing pattern in loop recovery, because every crash-and-resume is a retry of what was in flight. Make the action idempotent before you make it retryable. A retry on a non-idempotent action is a double-execution generator with good intentions.

Retry discipline: bounded, backed off, and breakered

Once actions are safe to repeat, retries earn their keep, but only with rules. The unbounded retry is the classic way a loop turns a blip into an outage. The discipline is small and well-worn. Set a timeout so a hung call fails instead of blocking forever. Bound the attempts. Back off exponentially between them, and add jitter so a fleet of retrying clients does not synchronize into a thundering herd FR-10. The concrete jitter algorithms (full, equal, decorrelated) are worth implementing directly FR-11. When a state machine is the right shape for your loop, the declarative form is Retry and Catch per step: max attempts, backoff, and catch-and-route to a fallback FR-4.

Naive retries do not just waste time; they cause the failure. A retry storm against a struggling dependency is a positive-feedback overload that can collapse the system, which is why limiting retries is a defense against cascading failure FR-14. Past capacity, the graceful response is to shed load and degrade, not hammer harder FR-15.

A dead dependency needs more than backoff. Wrap flaky tool and model calls in a circuit breaker: after repeated failures it trips, fails fast for a cooldown, then probes for recovery, so a hung dependency produces fast contained failures instead of stuck timeouts FR-16. It is one entry in the stability-pattern catalog (timeouts, circuit breaker, bulkheads, fail fast) a resilient loop assembles to isolate failure FR-17. In LoopRails the breaker has a trigger classical systems lack: the loop-health no-progress signal. A run that produces no new result over N steps is not a slow dependency, it is a loop that has lost the plot, and it should trip the breaker exactly as a failure-rate spike would. The circuit breaker article has the full treatment; the rule here is that an open breaker holds, and resuming is a human decision, not a timer.

Verifier-gated retry: do not let the model grade its own retry

This is the part specific to LLM loops, and the part most likely to quietly ruin a recovery design. The instinct is to retry until the model says the output looks good. That is wrong. Unaided, a model asked to self-correct its own reasoning often makes it worse, and the headline gains for self-correction frequently leaned on an oracle stop signal: the system was told when the answer was already right, which is exactly the information you do not have at runtime FR-21. The survey that organizes the conflicting results lands on one variable: success depends almost entirely on reliable external feedback FR-22. So never gate a retry on the model's opinion of its own work.

What works is external. Critique grounded in a real tool result, a code interpreter that runs the code or a search that checks the claim, corrects reliably where introspection does not FR-20. Generating many candidates and using a separate verifier to rank and select beats trusting the first or last: verifier-gated selection FR-26. Written feedback helps too: turn a failed trajectory into a self-critique stored in memory and read on the next try, where the quality of the feedback drives the gain, not the act of retrying FR-18. The inner loop most of this sits on is observe-and-adjust, interleaving reasoning with action so the model reacts to a bad result in the same run FR-23. The search-based view keeps alternatives and backtracks, abandoning a failing branch instead of forcing it forward FR-24, which is easier when planning is separate from execution so you have named steps to retry or roll back to FR-25. The plain self-refine loop (generate, critique, rewrite) is strongest on subjective generation, which is why verifier-gated approaches matter for anything with a checkable answer FR-19.

LoopRails has the right shape for this, because it is two of its building principles. The verifier is the product, independent of the maker and hard to game; the verifier, not the model, decides whether a retry is kept. And every loop runs against caps on iterations, time, and spend, so the retry budget is bounded by construction; if the verifier never passes, the caps end the run. This is not a corner case: the multi-agent failure taxonomy (MAST) finds weak or missing task verification to be a leading root cause of agent failure MA-13. A loop that retries on its own say-so has built a confident error generator.

Rollback for actions you cannot just retry

Retries and idempotency cover the actions you can repeat. They do nothing for the action you cannot take back: the email that sent, the row that was deleted, the order that shipped. When a loop takes a multi-step real-world action and fails partway through, you need to undo what already happened, not re-run it.

The model is the saga, and it is forty years old. Break a long transaction into smaller steps, and give each a compensating action that semantically undoes it, so a failure halfway through runs the compensations for the completed steps FR-7. You do this rather than wrap everything in one atomic transaction because strict all-or-nothing commit across distributed effects blocks when the coordinator dies and is expensive in general FR-8. You cannot lock the world while your agent thinks, so you record what to undo.

The agent-specific version borrows write-ahead logging and rollback from database recovery, so an irreversible effect can be undone after an error or policy violation, making a tool action transactional rather than fire-and-forget FR-28. Judge whether any of it worked the way these systems are benchmarked: end-state correctness, comparing the final state of the world to the goal state, and consistency across repeated trials rather than crediting one lucky pass FR-29. A recovery design that looks fine on one happy run and falls apart on the second is unmeasured, not recovered.

This maps to RAIL's Reversible property and to the doctrine principle that stopping must be cheap: at any moment you should be able to interrupt the loop and roll back what it has done. For actions that resist compensation entirely, the grade decides. A G3 action, irreversible and high-stakes, does not get a clever rollback; it gets a human, because no automated undo has standing to un-send money or un-delete a table.

Quarantine the poison runs

Some inputs will never succeed, and a loop that retries them forever has stopped doing useful work. You need a hard cap on attempts and a place for the run that hits it.

The pattern is the dead-letter queue: after a message fails N times, move it aside so it stops blocking the queue and can be inspected later FR-13. For a loop, a stuck run does not spin until the spend cap kills it; it hits its attempt limit and routes to a dead-letter path or to a human, with the audit trail attached. This is the LoopRails stance on stopping and on putting the human at the edge that matters: the kill switch and the human are where a run that cannot recover on its own goes, not a per-step gate on the runs that can. A capped exit is a routine signal that the problem was harder than the loop was provisioned for.

How it lands in LoopRails

The mapping is short, because the framework was built around these failures.

  • The memory file is the checkpoint: read it at the start, write it at the end, keep it in version control, and a crashed run resumes from the last good state instead of relearning the task.
  • The verifier gates the retry, not the model, and retries are bounded by the iteration and spend caps, so a loop with no honest check cannot retry its way to a confident error.
  • Caps plus a circuit breaker stop runaway recovery, with the breaker tripping on dead dependencies and the no-progress signal before a retry storm becomes a cascade.
  • Irreversible actions get compensation and a human: idempotency before retryability, sagas with compensating actions, and a G3 gate where rollback is not enough.
  • The starter project already writes the trail you recover from: a memory file, an append-only audit log, and a metrics file, the durable state a resumed run reads and an operator inspects after a run is quarantined.

Before you turn a loop loose unattended, the Kit guardrails checklist makes the recovery posture concrete: caps and a tested kill switch, a circuit breaker that trips on no progress, a rollback path that has actually been exercised, and an audit log captured for the full run. The evidence behind every claim here lives in the Loop Engineering Codex. The shortest honest test of a recovery design is the one most people skip: kill the process mid-action, on purpose, and see whether the next run picks up where it left off without re-doing what already happened. If it cannot, you have a loop that works in the demo and an incident waiting for the night no one is watching.

Get new LoopRails essays by email

Loop engineering, verifiers, and human oversight. No spam, unsubscribe anytime.