Error Recovery Patterns in AI Agents

Shane

2/13/2026

32 min read

Build resilient AI agents with checkpoint/resume, automatic retry, rollback, self-healing, and supervision patterns in Node.js.

llm nodejs resilience agents error recovery fault tolerance

Error Recovery Patterns in AI Agents

Overview

AI agents fail in ways that traditional software does not. When an LLM hallucinates a tool parameter, when a multi-step workflow dies on step seven of twelve, or when a model returns valid JSON that makes no logical sense, you need recovery patterns purpose-built for non-deterministic systems. This article covers battle-tested error recovery strategies for AI agents in Node.js, from checkpoint/resume and rollback to self-healing loops and supervision trees, with a complete working framework you can adapt to production workloads.

Prerequisites

Strong working knowledge of Node.js and asynchronous programming
Familiarity with AI agent concepts (tool calling, multi-step workflows, prompt chaining)
Experience with the OpenAI or Anthropic API (code examples use OpenAI)
Understanding of basic error handling patterns (try/catch, retries, circuit breakers)
Node.js v18+ installed

Why Agents Fail Differently Than Traditional Software

Traditional software is deterministic. Given the same input, you get the same output. When a REST API call fails, you know exactly what went wrong because HTTP gives you a status code and the server returns a structured error. You retry with the same parameters, and if the transient issue has cleared, it works.

AI agents break every one of those assumptions.

Non-determinism is the default. The same prompt can produce different outputs on every call. A retry might succeed not because the underlying issue resolved, but because the model happened to generate a different response. Conversely, a retry might fail in a completely different way than the original attempt. Temperature, model load, token sampling — all of these introduce variance that makes traditional retry logic insufficient.

Multi-step workflows create cascading failures. An agent that plans a sequence of ten tool calls will accumulate state across those calls. A failure on step six means you have five steps of completed work that may need to be rolled back, preserved, or resumed. The failure itself may be caused by drift in the agent's reasoning that started three steps earlier, making the root cause invisible at the point of failure.

Tool failures are ambiguous. When an agent calls a tool and gets an error, the agent has to interpret that error through the LLM. The model might misinterpret a 404 as "the resource doesn't exist" when it actually means "I constructed the URL wrong." It might retry the same broken call indefinitely, or it might abandon a valid approach after a single transient failure.

Logic errors look like success. The most dangerous agent failures return 200 OK. The model generates syntactically valid output that is semantically wrong — it books the wrong flight, deletes the wrong record, or summarizes a document by fabricating facts. These failures pass every technical health check and only surface when a human reviews the output.

These characteristics demand a fundamentally different approach to error recovery.

Categorizing Agent Failures

Before building recovery mechanisms, you need a taxonomy of what can go wrong. I categorize agent failures into five types:

1. LLM Errors

These are failures in the model layer itself: rate limits, context window overflow, malformed responses, refusals, and hallucinated tool calls.

var LLM_ERROR_TYPES = {
  RATE_LIMIT: "rate_limit",
  CONTEXT_OVERFLOW: "context_overflow",
  MALFORMED_RESPONSE: "malformed_response",
  REFUSAL: "refusal",
  HALLUCINATED_TOOL: "hallucinated_tool",
  TIMEOUT: "timeout"
};

2. Tool Errors

Failures in external tools the agent invokes: API errors, permission denied, resource not found, invalid parameters, and timeouts from slow downstream services.

3. Logic Errors

The model's reasoning goes off-track: incorrect plan generation, wrong tool selection, misinterpretation of tool output, circular reasoning loops, and goal drift across multi-step workflows.

4. State Errors

Corruption or inconsistency in the agent's accumulated state: partial writes, stale data references, conflicting actions, and orphaned resources created by incomplete workflows.

5. Resource Errors

Infrastructure-level failures: out of memory from large context windows, disk full when persisting checkpoints, network partitions between the agent and its tools, and process crashes from unhandled promise rejections.

Implementing Retry with Context Adjustment

Naive retry — calling the same prompt again — is sometimes enough for transient LLM errors. But for anything beyond rate limits, you need to adjust the context before retrying. The idea is simple: if the model failed with a given prompt, rephrase or augment the prompt to give it a better chance on the next attempt.

var EventEmitter = require("events");

function RetryWithAdjustment(options) {
  this.maxRetries = options.maxRetries || 3;
  this.adjusters = options.adjusters || {};
  this.baseDelay = options.baseDelay || 1000;
}

RetryWithAdjustment.prototype.execute = function (callLLM, context) {
  var self = this;
  var attempt = 0;

  function tryCall(currentContext) {
    attempt++;
    return callLLM(currentContext).catch(function (error) {
      if (attempt >= self.maxRetries) {
        error.attempts = attempt;
        error.finalContext = currentContext;
        throw error;
      }

      var adjustedContext = self.adjustContext(currentContext, error, attempt);
      var delay = self.baseDelay * Math.pow(2, attempt - 1);

      console.log(
        "[Retry] Attempt " + attempt + " failed: " + error.message +
        ". Retrying in " + delay + "ms with adjusted context."
      );

      return new Promise(function (resolve) {
        setTimeout(resolve, delay);
      }).then(function () {
        return tryCall(adjustedContext);
      });
    });
  }

  return tryCall(context);
};

RetryWithAdjustment.prototype.adjustContext = function (context, error, attempt) {
  var adjusted = JSON.parse(JSON.stringify(context));

  // Strategy 1: Add error feedback to the prompt
  if (error.type === LLM_ERROR_TYPES.MALFORMED_RESPONSE) {
    adjusted.messages.push({
      role: "user",
      content: "Your previous response was not valid JSON. " +
        "Please respond with ONLY valid JSON matching the required schema. " +
        "Error: " + error.message
    });
  }

  // Strategy 2: Simplify the request on repeated failures
  if (attempt >= 2 && adjusted.messages.length > 6) {
    adjusted.messages = [
      adjusted.messages[0],
      {
        role: "user",
        content: "Previous attempts failed. Here is a simplified version of the request: " +
          extractCoreRequest(adjusted.messages)
      }
    ];
  }

  // Strategy 3: Lower temperature for more deterministic output
  if (attempt >= 2) {
    adjusted.temperature = Math.max(0, (adjusted.temperature || 0.7) - 0.3);
  }

  // Strategy 4: Truncate context if approaching token limits
  if (error.type === LLM_ERROR_TYPES.CONTEXT_OVERFLOW) {
    adjusted.messages = truncateMiddleMessages(adjusted.messages);
  }

  return adjusted;
};

function extractCoreRequest(messages) {
  var lastUserMessage = "";
  for (var i = messages.length - 1; i >= 0; i--) {
    if (messages[i].role === "user") {
      lastUserMessage = messages[i].content;
      break;
    }
  }
  return lastUserMessage;
}

function truncateMiddleMessages(messages) {
  if (messages.length <= 4) return messages;
  var system = messages[0];
  var recent = messages.slice(-3);
  return [system].concat([{
    role: "user",
    content: "[Earlier conversation truncated to fit context window]"
  }]).concat(recent);
}

The key insight is that different failure types demand different adjustments. A malformed response needs stricter formatting instructions. A context overflow needs message truncation. A logic error might need the prompt simplified or the approach changed entirely.

Checkpoint and Resume

For long-running agent workflows, checkpointing is non-negotiable. If an agent crashes on step eight of a twelve-step plan, you should not have to restart from scratch.

var fs = require("fs");
var path = require("path");
var crypto = require("crypto");

function CheckpointManager(options) {
  this.checkpointDir = options.checkpointDir || "./checkpoints";
  this.ttlMs = options.ttlMs || 24 * 60 * 60 * 1000; // 24 hours default

  if (!fs.existsSync(this.checkpointDir)) {
    fs.mkdirSync(this.checkpointDir, { recursive: true });
  }
}

CheckpointManager.prototype.generateId = function (workflowName, input) {
  var hash = crypto.createHash("sha256");
  hash.update(workflowName + ":" + JSON.stringify(input));
  return hash.digest("hex").substring(0, 16);
};

CheckpointManager.prototype.save = function (workflowId, state) {
  var checkpoint = {
    workflowId: workflowId,
    timestamp: Date.now(),
    currentStep: state.currentStep,
    completedSteps: state.completedSteps,
    plan: state.plan,
    results: state.results,
    context: state.context,
    metadata: state.metadata || {}
  };

  var filePath = path.join(this.checkpointDir, workflowId + ".json");
  fs.writeFileSync(filePath, JSON.stringify(checkpoint, null, 2));
  console.log("[Checkpoint] Saved at step " + state.currentStep + " for workflow " + workflowId);
  return checkpoint;
};

CheckpointManager.prototype.load = function (workflowId) {
  var filePath = path.join(this.checkpointDir, workflowId + ".json");

  if (!fs.existsSync(filePath)) {
    return null;
  }

  var checkpoint = JSON.parse(fs.readFileSync(filePath, "utf8"));

  if (Date.now() - checkpoint.timestamp > this.ttlMs) {
    console.log("[Checkpoint] Expired checkpoint for workflow " + workflowId);
    fs.unlinkSync(filePath);
    return null;
  }

  console.log(
    "[Checkpoint] Resuming workflow " + workflowId +
    " from step " + checkpoint.currentStep
  );
  return checkpoint;
};

CheckpointManager.prototype.clear = function (workflowId) {
  var filePath = path.join(this.checkpointDir, workflowId + ".json");
  if (fs.existsSync(filePath)) {
    fs.unlinkSync(filePath);
  }
};

// Usage in an agent workflow
function ResumableAgent(options) {
  this.checkpoints = new CheckpointManager(options);
  this.llm = options.llmClient;
  this.tools = options.tools;
}

ResumableAgent.prototype.run = function (workflowName, input) {
  var self = this;
  var workflowId = this.checkpoints.generateId(workflowName, input);
  var existing = this.checkpoints.load(workflowId);

  var state;
  if (existing) {
    state = existing;
    console.log("[Agent] Resuming from step " + state.currentStep);
  } else {
    state = {
      currentStep: 0,
      completedSteps: [],
      plan: null,
      results: {},
      context: { input: input }
    };
  }

  return self.executePlan(workflowId, state);
};

ResumableAgent.prototype.executePlan = function (workflowId, state) {
  var self = this;

  function executeStep(stepIndex) {
    if (!state.plan || stepIndex >= state.plan.length) {
      self.checkpoints.clear(workflowId);
      return Promise.resolve(state.results);
    }

    var step = state.plan[stepIndex];
    state.currentStep = stepIndex;

    // Skip already-completed steps
    if (state.completedSteps.indexOf(stepIndex) !== -1) {
      return executeStep(stepIndex + 1);
    }

    // Save checkpoint before each step
    self.checkpoints.save(workflowId, state);

    return self.executeToolCall(step, state.context).then(function (result) {
      state.results[step.id] = result;
      state.completedSteps.push(stepIndex);
      state.context.lastResult = result;
      return executeStep(stepIndex + 1);
    });
  }

  if (!state.plan) {
    return self.generatePlan(state.context).then(function (plan) {
      state.plan = plan;
      return executeStep(0);
    });
  }

  return executeStep(state.currentStep);
};

A few critical details here. The checkpoint ID is derived from the workflow name and input, so retrying the same request automatically picks up where it left off. Checkpoints have a TTL so stale state doesn't linger indefinitely. And completed steps are tracked by index so the agent knows exactly where to resume.

Rollback Patterns

Checkpointing saves progress. Rollback undoes it. When an agent creates resources, modifies data, or triggers side effects across multiple steps, a failure partway through can leave the system in an inconsistent state. You need compensating actions.

function RollbackManager() {
  this.undoStack = [];
}

RollbackManager.prototype.record = function (action, undoAction) {
  this.undoStack.push({
    action: action,
    undo: undoAction,
    timestamp: Date.now()
  });
};

RollbackManager.prototype.rollback = function () {
  var self = this;
  var errors = [];

  console.log("[Rollback] Rolling back " + self.undoStack.length + " actions");

  function rollbackNext() {
    if (self.undoStack.length === 0) {
      if (errors.length > 0) {
        console.error("[Rollback] Completed with " + errors.length + " errors:", errors);
      }
      return Promise.resolve({ errors: errors });
    }

    var entry = self.undoStack.pop();
    console.log("[Rollback] Undoing: " + entry.action);

    return entry.undo().catch(function (err) {
      errors.push({ action: entry.action, error: err.message });
      // Continue rolling back even if one undo fails
    }).then(function () {
      return rollbackNext();
    });
  }

  return rollbackNext();
};

// Usage with an agent that creates resources
function executeWithRollback(agent, steps, toolClient) {
  var rollback = new RollbackManager();
  var results = [];

  function executeStep(index) {
    if (index >= steps.length) {
      return Promise.resolve(results);
    }

    var step = steps[index];

    return toolClient.execute(step).then(function (result) {
      results.push(result);

      // Record the compensating action
      if (step.type === "create_record") {
        rollback.record("Created record " + result.id, function () {
          return toolClient.execute({
            type: "delete_record",
            id: result.id
          });
        });
      } else if (step.type === "update_record") {
        var previousValue = result.previousValue;
        rollback.record("Updated record " + step.id, function () {
          return toolClient.execute({
            type: "update_record",
            id: step.id,
            data: previousValue
          });
        });
      } else if (step.type === "send_notification") {
        // Some actions cannot be undone — log them for manual intervention
        rollback.record("Sent notification " + result.notificationId, function () {
          console.warn("[Rollback] Cannot unsend notification " + result.notificationId);
          return Promise.resolve();
        });
      }

      return executeStep(index + 1);
    }).catch(function (error) {
      console.error("[Agent] Step " + index + " failed: " + error.message);
      console.log("[Agent] Initiating rollback of " + results.length + " completed steps");
      return rollback.rollback().then(function () {
        throw error; // Re-throw after rollback
      });
    });
  }

  return executeStep(0);
}

The pattern follows the Saga pattern from distributed systems. Each forward action records a compensating action on an undo stack. On failure, the stack unwinds in reverse order. Not every action can be undone — sent emails, published messages, charged credit cards — so the rollback handler needs to account for irreversible actions and flag them for manual review.

Self-Healing Agents

A self-healing agent diagnoses its own errors and corrects course without external intervention. The technique is straightforward: when the agent encounters an error, feed the error back into the LLM along with the original goal, and ask it to devise an alternative approach.

function SelfHealingExecutor(options) {
  this.llm = options.llmClient;
  this.maxHealingAttempts = options.maxHealingAttempts || 3;
  this.tools = options.tools;
}

SelfHealingExecutor.prototype.execute = function (goal, plan) {
  var self = this;
  var healingAttempts = 0;
  var errorHistory = [];

  function attemptExecution(currentPlan) {
    return self.executePlan(currentPlan).catch(function (error) {
      healingAttempts++;
      errorHistory.push({
        attempt: healingAttempts,
        plan: currentPlan,
        error: error.message,
        step: error.failedStep || "unknown"
      });

      if (healingAttempts >= self.maxHealingAttempts) {
        var finalError = new Error(
          "Agent failed after " + healingAttempts + " healing attempts"
        );
        finalError.errorHistory = errorHistory;
        throw finalError;
      }

      console.log("[SelfHeal] Attempt " + healingAttempts + " failed. Requesting diagnosis.");

      return self.diagnoseAndReplan(goal, errorHistory).then(function (newPlan) {
        console.log("[SelfHeal] Generated alternative plan with " + newPlan.length + " steps");
        return attemptExecution(newPlan);
      });
    });
  }

  return attemptExecution(plan);
};

SelfHealingExecutor.prototype.diagnoseAndReplan = function (goal, errorHistory) {
  var diagnosisPrompt = {
    role: "user",
    content: "You are an AI agent that encountered errors while trying to achieve a goal.\n\n" +
      "GOAL: " + goal + "\n\n" +
      "ERROR HISTORY:\n" + JSON.stringify(errorHistory, null, 2) + "\n\n" +
      "Analyze the errors and create a NEW plan that avoids the same failures. " +
      "Use different tools or approaches if the previous ones are unreliable. " +
      "Respond with a JSON array of step objects."
  };

  return this.llm.call([diagnosisPrompt]).then(function (response) {
    return JSON.parse(response.content);
  });
};

Self-healing works well for tool errors and logic errors. If the agent tried to query an API with the wrong parameters, it can read the error message, understand the correct parameters, and retry. If it chose the wrong tool entirely, it can select an alternative. The error history prevents it from repeating the same mistakes.

The danger is that self-healing can mask deeper problems. If the agent keeps "healing" by trying increasingly creative workarounds, it might produce results that technically succeed but are wrong. Set a hard limit on healing attempts and require validation of the final output.

Fallback Strategies

When a complex approach fails, fall back to a simpler one. This is the agent equivalent of graceful degradation.

function FallbackChain(strategies) {
  this.strategies = strategies; // Array of { name, execute, validate }
}

FallbackChain.prototype.run = function (input) {
  var self = this;
  var index = 0;
  var errors = [];

  function tryNext() {
    if (index >= self.strategies.length) {
      var err = new Error("All " + self.strategies.length + " strategies exhausted");
      err.strategyErrors = errors;
      return Promise.reject(err);
    }

    var strategy = self.strategies[index];
    index++;

    console.log("[Fallback] Trying strategy: " + strategy.name);

    return strategy.execute(input).then(function (result) {
      if (strategy.validate && !strategy.validate(result)) {
        errors.push({ strategy: strategy.name, error: "Validation failed" });
        return tryNext();
      }
      result._strategy = strategy.name;
      return result;
    }).catch(function (error) {
      errors.push({ strategy: strategy.name, error: error.message });
      console.log("[Fallback] Strategy '" + strategy.name + "' failed: " + error.message);
      return tryNext();
    });
  }

  return tryNext();
};

// Example: Summarization with fallback
var summarizationChain = new FallbackChain([
  {
    name: "full-analysis",
    execute: function (input) {
      return agentSummarize(input, { depth: "deep", tools: ["search", "extract"] });
    },
    validate: function (result) {
      return result.summary && result.summary.length > 100;
    }
  },
  {
    name: "simple-summary",
    execute: function (input) {
      return directLLMCall("Summarize this document: " + input.text);
    },
    validate: function (result) {
      return result.summary && result.summary.length > 50;
    }
  },
  {
    name: "extractive-fallback",
    execute: function (input) {
      // No LLM at all — just extract first and last paragraphs
      var paragraphs = input.text.split("\n\n").filter(Boolean);
      return Promise.resolve({
        summary: paragraphs[0] + "\n\n" + paragraphs[paragraphs.length - 1],
        partial: true
      });
    }
  }
]);

Notice how the chain degrades from a full agentic analysis (multiple tool calls, deep reasoning) to a single LLM call, and finally to a purely mechanical extraction that does not use the LLM at all. Each level trades quality for reliability.

Dead Letter Handling

Some failures are unrecoverable. The input is malformed, the required resource genuinely does not exist, or the task violates a constraint that no amount of retrying will resolve. These need a dead letter queue.

function DeadLetterQueue(options) {
  this.store = options.store; // File, database, or message queue
  this.onDeadLetter = options.onDeadLetter || function () {};
}

DeadLetterQueue.prototype.send = function (workflowId, error, context) {
  var entry = {
    workflowId: workflowId,
    timestamp: new Date().toISOString(),
    error: {
      message: error.message,
      type: error.type || "unknown",
      stack: error.stack
    },
    context: context,
    attempts: error.attempts || 0,
    errorHistory: error.errorHistory || [],
    status: "unresolved"
  };

  this.store.save("dead_letters", entry);
  this.onDeadLetter(entry);

  console.error(
    "[DeadLetter] Workflow " + workflowId + " moved to dead letter queue. " +
    "Error: " + error.message
  );

  return entry;
};

Dead letter entries preserve the full error history and agent context so an engineer can diagnose the failure, fix the underlying issue, and optionally replay the workflow. Without this, unrecoverable failures vanish into log noise.

Supervision Trees

Borrowed from Erlang/OTP, a supervision tree manages agent processes and restarts them according to a defined strategy. This is particularly useful when running multiple agents concurrently.

function Supervisor(options) {
  this.name = options.name;
  this.strategy = options.strategy || "one_for_one"; // or "one_for_all", "rest_for_one"
  this.maxRestarts = options.maxRestarts || 5;
  this.windowMs = options.windowMs || 60000;
  this.children = [];
  this.restartLog = [];
}

Supervisor.prototype.addChild = function (child) {
  this.children.push({
    spec: child,
    process: null,
    status: "stopped"
  });
};

Supervisor.prototype.startAll = function () {
  var self = this;
  return Promise.all(this.children.map(function (child, index) {
    return self.startChild(index);
  }));
};

Supervisor.prototype.startChild = function (index) {
  var self = this;
  var child = this.children[index];

  child.status = "running";
  child.process = child.spec.start();

  return child.process.then(function (result) {
    child.status = "completed";
    child.result = result;
    return result;
  }).catch(function (error) {
    child.status = "failed";
    child.error = error;
    console.error(
      "[Supervisor:" + self.name + "] Child '" + child.spec.name +
      "' failed: " + error.message
    );
    return self.handleChildFailure(index, error);
  });
};

Supervisor.prototype.handleChildFailure = function (failedIndex, error) {
  var self = this;
  var now = Date.now();

  // Clean old entries outside the window
  self.restartLog = self.restartLog.filter(function (t) {
    return now - t < self.windowMs;
  });

  self.restartLog.push(now);

  if (self.restartLog.length > self.maxRestarts) {
    var err = new Error(
      "Supervisor '" + self.name + "' exceeded max restarts (" +
      self.maxRestarts + " in " + self.windowMs + "ms)"
    );
    err.failedChildren = self.children
      .filter(function (c) { return c.status === "failed"; })
      .map(function (c) { return c.spec.name; });
    throw err;
  }

  if (self.strategy === "one_for_one") {
    // Restart only the failed child
    return self.startChild(failedIndex);
  } else if (self.strategy === "one_for_all") {
    // Restart all children
    return self.startAll();
  } else if (self.strategy === "rest_for_one") {
    // Restart the failed child and all children after it
    var promises = [];
    for (var i = failedIndex; i < self.children.length; i++) {
      promises.push(self.startChild(i));
    }
    return Promise.all(promises);
  }
};

// Usage
var agentSupervisor = new Supervisor({
  name: "agent-pool",
  strategy: "one_for_one",
  maxRestarts: 5,
  windowMs: 60000
});

agentSupervisor.addChild({
  name: "data-collector",
  start: function () { return dataCollectorAgent.run(); }
});

agentSupervisor.addChild({
  name: "analyzer",
  start: function () { return analyzerAgent.run(); }
});

agentSupervisor.addChild({
  name: "reporter",
  start: function () { return reporterAgent.run(); }
});

The three strategies serve different dependency models. one_for_one restarts only the failed agent — use this when agents are independent. one_for_all restarts everything — use this when agents share state and a partial restart would create inconsistency. rest_for_one restarts the failed agent and everything that depends on it — use this for pipeline architectures.

Circuit Breakers for Flaky Tools

When an external tool fails repeatedly, stop calling it. A circuit breaker prevents an agent from wasting tokens and time on a tool that is clearly down.

function CircuitBreaker(options) {
  this.name = options.name;
  this.failureThreshold = options.failureThreshold || 5;
  this.resetTimeoutMs = options.resetTimeoutMs || 30000;
  this.state = "CLOSED"; // CLOSED, OPEN, HALF_OPEN
  this.failureCount = 0;
  this.lastFailureTime = null;
  this.successCount = 0;
}

CircuitBreaker.prototype.call = function (fn) {
  var self = this;

  if (self.state === "OPEN") {
    if (Date.now() - self.lastFailureTime > self.resetTimeoutMs) {
      self.state = "HALF_OPEN";
      console.log("[CircuitBreaker:" + self.name + "] Transitioning to HALF_OPEN");
    } else {
      return Promise.reject(new Error(
        "Circuit breaker OPEN for tool '" + self.name +
        "'. Tool is temporarily unavailable."
      ));
    }
  }

  return fn().then(function (result) {
    if (self.state === "HALF_OPEN") {
      self.successCount++;
      if (self.successCount >= 2) {
        self.state = "CLOSED";
        self.failureCount = 0;
        self.successCount = 0;
        console.log("[CircuitBreaker:" + self.name + "] Circuit CLOSED — tool recovered");
      }
    }
    return result;
  }).catch(function (error) {
    self.failureCount++;
    self.lastFailureTime = Date.now();
    self.successCount = 0;

    if (self.failureCount >= self.failureThreshold) {
      self.state = "OPEN";
      console.error(
        "[CircuitBreaker:" + self.name + "] Circuit OPEN after " +
        self.failureCount + " failures"
      );
    }

    throw error;
  });
};

// Wrap agent tools with circuit breakers
function wrapToolsWithCircuitBreakers(tools) {
  var wrapped = {};

  Object.keys(tools).forEach(function (toolName) {
    var breaker = new CircuitBreaker({ name: toolName });
    var originalFn = tools[toolName];

    wrapped[toolName] = function () {
      var args = Array.prototype.slice.call(arguments);
      return breaker.call(function () {
        return originalFn.apply(null, args);
      });
    };

    wrapped[toolName].circuitBreaker = breaker;
  });

  return wrapped;
}

When the circuit breaker is open, the agent gets an immediate error telling it the tool is unavailable. A well-designed agent will then select an alternative tool or adjust its plan, rather than waiting for a timeout on every call.

Error Budgets for Agent Workflows

An error budget sets a threshold for acceptable failures within a workflow. Instead of failing on the first error or retrying until some maximum, you define a budget — say, two errors out of ten steps — and let the workflow continue as long as the budget is not exhausted.

function ErrorBudget(options) {
  this.maxErrors = options.maxErrors;
  this.maxErrorRate = options.maxErrorRate || 1.0;
  this.errors = [];
  this.totalSteps = 0;
}

ErrorBudget.prototype.record = function (step, error) {
  this.errors.push({ step: step, error: error.message, timestamp: Date.now() });
};

ErrorBudget.prototype.step = function () {
  this.totalSteps++;
};

ErrorBudget.prototype.isExhausted = function () {
  if (this.errors.length >= this.maxErrors) return true;
  if (this.totalSteps > 0) {
    var errorRate = this.errors.length / this.totalSteps;
    if (errorRate > this.maxErrorRate) return true;
  }
  return false;
};

ErrorBudget.prototype.remaining = function () {
  return this.maxErrors - this.errors.length;
};

This is particularly useful for data processing agents that operate on batches. If you are processing a hundred records and three fail, you probably want the agent to continue and report the failures at the end rather than aborting the entire batch.

Structured Error Logging

Post-mortem analysis of agent failures requires structured logs, not just stack traces. You need to know what the agent was thinking, what it tried, and where its reasoning diverged from reality.

function AgentLogger(options) {
  this.workflowId = options.workflowId;
  this.entries = [];
  this.output = options.output || console;
}

AgentLogger.prototype.log = function (level, category, message, data) {
  var entry = {
    timestamp: new Date().toISOString(),
    workflowId: this.workflowId,
    level: level,
    category: category,
    message: message,
    data: data || {}
  };

  this.entries.push(entry);
  this.output.log(JSON.stringify(entry));
};

AgentLogger.prototype.logStep = function (stepIndex, stepName, status, details) {
  this.log("info", "step", stepName, {
    stepIndex: stepIndex,
    status: status,
    details: details
  });
};

AgentLogger.prototype.logLLMCall = function (prompt, response, durationMs, tokens) {
  this.log("debug", "llm", "LLM call completed", {
    promptLength: prompt.length,
    responseLength: response.length,
    durationMs: durationMs,
    tokensUsed: tokens
  });
};

AgentLogger.prototype.logToolCall = function (toolName, args, result, error) {
  this.log(error ? "error" : "info", "tool", "Tool call: " + toolName, {
    tool: toolName,
    arguments: args,
    result: error ? null : result,
    error: error ? error.message : null
  });
};

AgentLogger.prototype.logError = function (error, context) {
  this.log("error", "agent", error.message, {
    errorType: error.type || error.constructor.name,
    stack: error.stack,
    context: context,
    errorHistory: error.errorHistory || []
  });
};

AgentLogger.prototype.getTrace = function () {
  return {
    workflowId: this.workflowId,
    entries: this.entries,
    summary: {
      totalEntries: this.entries.length,
      errors: this.entries.filter(function (e) { return e.level === "error"; }).length,
      llmCalls: this.entries.filter(function (e) { return e.category === "llm"; }).length,
      toolCalls: this.entries.filter(function (e) { return e.category === "tool"; }).length,
      duration: this.entries.length > 1
        ? new Date(this.entries[this.entries.length - 1].timestamp) -
          new Date(this.entries[0].timestamp)
        : 0
    }
  };
};

Every LLM call, tool invocation, step transition, and error gets a structured log entry. After a failure, you can pull the full trace and see exactly what happened: what the model was asked, what it responded, which tool it called, what that tool returned, and where the chain broke.

Complete Working Example: Resilient Agent Framework

Here is a complete Node.js framework that ties all these patterns together into a production-ready resilient agent.

var EventEmitter = require("events");
var fs = require("fs");
var path = require("path");
var crypto = require("crypto");

// ============================================================
// ResilientAgent — Full Framework
// ============================================================

function ResilientAgent(options) {
  EventEmitter.call(this);

  this.name = options.name || "agent";
  this.llm = options.llmClient;
  this.tools = wrapToolsWithBreakers(options.tools || {});
  this.checkpoints = new CheckpointStore(options.checkpointDir || "./agent-checkpoints");
  this.logger = new StructuredLogger({ workflowId: null });
  this.deadLetters = options.deadLetterStore || new InMemoryStore();

  // Configuration
  this.maxRetries = options.maxRetries || 3;
  this.maxHealingAttempts = options.maxHealingAttempts || 2;
  this.errorBudget = options.errorBudget || 3;
  this.checkpointInterval = options.checkpointInterval || 1; // every N steps
}

ResilientAgent.prototype = Object.create(EventEmitter.prototype);
ResilientAgent.prototype.constructor = ResilientAgent;

ResilientAgent.prototype.execute = function (goal, input) {
  var self = this;
  var workflowId = crypto.randomBytes(8).toString("hex");
  self.logger.workflowId = workflowId;

  self.logger.log("info", "workflow", "Starting workflow", { goal: goal, input: input });
  self.emit("workflow:start", { workflowId: workflowId, goal: goal });

  // Check for existing checkpoint
  var checkpoint = self.checkpoints.load(workflowId, goal, input);

  var state;
  if (checkpoint) {
    state = checkpoint;
    self.logger.log("info", "workflow", "Resuming from checkpoint", {
      step: state.currentStep
    });
  } else {
    state = {
      workflowId: workflowId,
      goal: goal,
      input: input,
      currentStep: 0,
      completedSteps: [],
      plan: null,
      results: {},
      rollbackStack: [],
      errorCount: 0,
      healingAttempts: 0
    };
  }

  return self.runWorkflow(state).then(function (results) {
    self.checkpoints.clear(workflowId);
    self.logger.log("info", "workflow", "Workflow completed successfully", {
      steps: state.completedSteps.length,
      errors: state.errorCount
    });
    self.emit("workflow:complete", { workflowId: workflowId, results: results });
    return { success: true, results: results, trace: self.logger.getTrace() };
  }).catch(function (error) {
    self.logger.logError(error, state);

    // Send to dead letter queue
    self.deadLetters.save("dead_letters", {
      workflowId: workflowId,
      goal: goal,
      error: error.message,
      state: state,
      trace: self.logger.getTrace(),
      timestamp: new Date().toISOString()
    });

    self.emit("workflow:failed", { workflowId: workflowId, error: error });
    return { success: false, error: error.message, trace: self.logger.getTrace() };
  });
};

ResilientAgent.prototype.runWorkflow = function (state) {
  var self = this;

  // Step 1: Generate or use existing plan
  var planPromise;
  if (state.plan) {
    planPromise = Promise.resolve(state.plan);
  } else {
    planPromise = self.generatePlan(state.goal, state.input).then(function (plan) {
      state.plan = plan;
      return plan;
    });
  }

  return planPromise.then(function (plan) {
    return self.executeSteps(state);
  });
};

ResilientAgent.prototype.executeSteps = function (state) {
  var self = this;

  function nextStep(index) {
    if (index >= state.plan.length) {
      return Promise.resolve(state.results);
    }

    // Skip completed steps
    if (state.completedSteps.indexOf(index) !== -1) {
      return nextStep(index + 1);
    }

    state.currentStep = index;
    var step = state.plan[index];

    // Checkpoint before risky steps
    if (index % self.checkpointInterval === 0) {
      self.checkpoints.save(state);
    }

    self.logger.logStep(index, step.name, "started", step);

    // Execute with retry and context adjustment
    return self.executeStepWithRetry(step, state).then(function (result) {
      state.results[step.id || index] = result;
      state.completedSteps.push(index);

      // Record rollback action
      if (step.rollback) {
        state.rollbackStack.push({
          stepIndex: index,
          stepName: step.name,
          undo: step.rollback,
          result: result
        });
      }

      self.logger.logStep(index, step.name, "completed", { result: result });
      return nextStep(index + 1);

    }).catch(function (error) {
      state.errorCount++;
      self.logger.logStep(index, step.name, "failed", { error: error.message });

      // Check error budget
      if (state.errorCount > self.errorBudget) {
        // Rollback all completed steps
        return self.rollbackAll(state).then(function () {
          // Attempt self-healing
          if (state.healingAttempts < self.maxHealingAttempts) {
            state.healingAttempts++;
            return self.selfHeal(state, error).then(function (newPlan) {
              state.plan = newPlan;
              state.currentStep = 0;
              state.completedSteps = [];
              state.results = {};
              state.errorCount = 0;
              state.rollbackStack = [];
              return self.executeSteps(state);
            });
          }
          throw error;
        });
      }

      // Budget remaining — skip this step and continue
      self.logger.log("warn", "workflow",
        "Step failed but error budget allows continuation. " +
        "Budget remaining: " + (self.errorBudget - state.errorCount),
        { step: step.name }
      );

      state.results[step.id || index] = { skipped: true, error: error.message };
      state.completedSteps.push(index);
      return nextStep(index + 1);
    });
  }

  return nextStep(state.currentStep);
};

ResilientAgent.prototype.executeStepWithRetry = function (step, state) {
  var self = this;
  var attempts = 0;

  function attempt(adjustedStep) {
    attempts++;

    var toolName = adjustedStep.tool;
    var toolFn = self.tools[toolName];

    if (!toolFn) {
      return Promise.reject(new Error("Unknown tool: " + toolName));
    }

    self.logger.logToolCall(toolName, adjustedStep.args, null, null);

    return toolFn(adjustedStep.args).then(function (result) {
      self.logger.logToolCall(toolName, adjustedStep.args, result, null);
      return result;
    }).catch(function (error) {
      self.logger.logToolCall(toolName, adjustedStep.args, null, error);

      if (attempts >= self.maxRetries) {
        error.attempts = attempts;
        throw error;
      }

      // Adjust and retry
      var delay = 1000 * Math.pow(2, attempts - 1);
      console.log(
        "[" + self.name + "] Step '" + step.name + "' attempt " +
        attempts + " failed. Retrying in " + delay + "ms."
      );

      return new Promise(function (resolve) {
        setTimeout(resolve, delay);
      }).then(function () {
        var adjusted = self.adjustStep(adjustedStep, error, attempts);
        return attempt(adjusted);
      });
    });
  }

  return attempt(step);
};

ResilientAgent.prototype.adjustStep = function (step, error, attempt) {
  var adjusted = JSON.parse(JSON.stringify(step));

  // Add error context so the LLM (if involved) can adapt
  if (!adjusted.metadata) adjusted.metadata = {};
  adjusted.metadata.previousError = error.message;
  adjusted.metadata.attempt = attempt;

  // If args contain a prompt, append error feedback
  if (adjusted.args && adjusted.args.prompt) {
    adjusted.args.prompt += "\n\n[Previous attempt failed with: " + error.message +
      ". Please adjust your approach.]";
  }

  return adjusted;
};

ResilientAgent.prototype.rollbackAll = function (state) {
  var errors = [];

  function rollbackNext(stack) {
    if (stack.length === 0) {
      return Promise.resolve(errors);
    }

    var entry = stack.pop();
    console.log("[Rollback] Undoing step: " + entry.stepName);

    return Promise.resolve().then(function () {
      if (typeof entry.undo === "function") {
        return entry.undo(entry.result);
      }
    }).catch(function (err) {
      errors.push({ step: entry.stepName, error: err.message });
    }).then(function () {
      return rollbackNext(stack);
    });
  }

  return rollbackNext(state.rollbackStack.slice());
};

ResilientAgent.prototype.selfHeal = function (state, error) {
  var self = this;

  self.logger.log("info", "healing", "Attempting self-healing", {
    attempt: state.healingAttempts,
    error: error.message
  });

  var prompt = "You are an AI agent planner. Your previous plan failed.\n\n" +
    "GOAL: " + state.goal + "\n" +
    "AVAILABLE TOOLS: " + Object.keys(self.tools).join(", ") + "\n" +
    "FAILED PLAN: " + JSON.stringify(state.plan, null, 2) + "\n" +
    "ERROR: " + error.message + "\n\n" +
    "Create a new plan that avoids this failure. " +
    "Return a JSON array of steps.";

  return self.llm.call(prompt).then(function (response) {
    return JSON.parse(response);
  });
};

ResilientAgent.prototype.generatePlan = function (goal, input) {
  var self = this;

  var prompt = "Create a step-by-step plan to achieve this goal.\n\n" +
    "GOAL: " + goal + "\n" +
    "INPUT: " + JSON.stringify(input) + "\n" +
    "AVAILABLE TOOLS: " + Object.keys(self.tools).join(", ") + "\n\n" +
    "Return a JSON array of step objects with: name, tool, args, and optional rollback description.";

  return self.llm.call(prompt).then(function (response) {
    return JSON.parse(response);
  });
};

// ============================================================
// Helper Classes
// ============================================================

function CheckpointStore(dir) {
  this.dir = dir;
  if (!fs.existsSync(dir)) fs.mkdirSync(dir, { recursive: true });
}

CheckpointStore.prototype.save = function (state) {
  var filePath = path.join(this.dir, state.workflowId + ".json");
  fs.writeFileSync(filePath, JSON.stringify(state, null, 2));
};

CheckpointStore.prototype.load = function (workflowId) {
  var filePath = path.join(this.dir, workflowId + ".json");
  if (!fs.existsSync(filePath)) return null;
  return JSON.parse(fs.readFileSync(filePath, "utf8"));
};

CheckpointStore.prototype.clear = function (workflowId) {
  var filePath = path.join(this.dir, workflowId + ".json");
  if (fs.existsSync(filePath)) fs.unlinkSync(filePath);
};

function StructuredLogger(options) {
  this.workflowId = options.workflowId;
  this.entries = [];
}

StructuredLogger.prototype.log = function (level, category, message, data) {
  var entry = {
    timestamp: new Date().toISOString(),
    workflowId: this.workflowId,
    level: level,
    category: category,
    message: message,
    data: data || {}
  };
  this.entries.push(entry);
  if (level === "error") {
    console.error(JSON.stringify(entry));
  }
};

StructuredLogger.prototype.logStep = function (index, name, status, details) {
  this.log("info", "step", name + " — " + status, { stepIndex: index, details: details });
};

StructuredLogger.prototype.logToolCall = function (tool, args, result, error) {
  this.log(error ? "error" : "debug", "tool", "Tool: " + tool, {
    args: args,
    result: error ? null : result,
    error: error ? error.message : null
  });
};

StructuredLogger.prototype.logError = function (error, context) {
  this.log("error", "agent", error.message, {
    stack: error.stack,
    context: { step: context.currentStep, goal: context.goal }
  });
};

StructuredLogger.prototype.getTrace = function () {
  return { workflowId: this.workflowId, entries: this.entries };
};

function InMemoryStore() {
  this.data = {};
}

InMemoryStore.prototype.save = function (collection, item) {
  if (!this.data[collection]) this.data[collection] = [];
  this.data[collection].push(item);
};

function wrapToolsWithBreakers(tools) {
  var wrapped = {};
  Object.keys(tools).forEach(function (name) {
    var breaker = new CircuitBreaker({ name: name });
    var fn = tools[name];
    wrapped[name] = function (args) {
      return breaker.call(function () { return fn(args); });
    };
  });
  return wrapped;
}

// ============================================================
// Usage
// ============================================================

var agent = new ResilientAgent({
  name: "data-pipeline",
  llmClient: { call: function (prompt) { /* your LLM call */ } },
  tools: {
    fetch_data: function (args) { /* fetch from API */ },
    transform: function (args) { /* transform data */ },
    store: function (args) { /* write to database */ }
  },
  maxRetries: 3,
  maxHealingAttempts: 2,
  errorBudget: 3,
  checkpointDir: "./pipeline-checkpoints"
});

agent.on("workflow:complete", function (event) {
  console.log("Workflow " + event.workflowId + " completed");
});

agent.on("workflow:failed", function (event) {
  console.error("Workflow " + event.workflowId + " failed: " + event.error.message);
});

agent.execute("Process and store daily analytics", { date: "2026-02-11" })
  .then(function (result) {
    console.log("Result:", JSON.stringify(result, null, 2));
  });

module.exports = ResilientAgent;

Common Issues and Troubleshooting

1. Infinite Healing Loops

Symptom:

[SelfHeal] Attempt 1 failed. Requesting diagnosis.
[SelfHeal] Generated alternative plan with 5 steps
[SelfHeal] Attempt 2 failed. Requesting diagnosis.
[SelfHeal] Generated alternative plan with 4 steps
[SelfHeal] Attempt 3 failed. Requesting diagnosis.
Error: Agent failed after 3 healing attempts

Cause: The self-healing mechanism generates plans that hit the same underlying issue. Common when the model does not have enough context about why the previous plans failed.

Fix: Pass the full error history to the replanning prompt, not just the most recent error. Include the failed plans themselves so the model can see what it already tried. Add explicit instructions like "Do NOT use tool X" if that tool is the source of the failure.

2. Checkpoint Corruption After Schema Changes

Symptom:

TypeError: Cannot read properties of undefined (reading 'length')
    at ResilientAgent.executeSteps (agent.js:142)

Cause: You updated your agent's state schema (added or renamed fields), but an existing checkpoint was saved with the old schema. The agent resumes from the checkpoint and encounters undefined fields.

Fix: Add a schema version to your checkpoints. When loading, compare the checkpoint version against the current version. If they differ, discard the checkpoint and start fresh. Alternatively, write migration logic that backfills missing fields with defaults.

3. Circuit Breaker Stays Open During Deployments

Symptom:

Error: Circuit breaker OPEN for tool 'payment_api'. Tool is temporarily unavailable.

Cause: A downstream service went down during a deployment. The circuit breaker tripped (correct behavior), but the agent does not know the deployment is temporary and keeps rejecting calls long after the service recovers.

Fix: Tune the resetTimeoutMs to match your deployment window. For most services, 30 to 60 seconds is appropriate. Use the HALF_OPEN state to probe with a single request before fully closing the circuit. If your deployment process supports it, send a health check signal that resets the circuit breaker proactively.

4. Rollback Fails on Irreversible Actions

Symptom:

[Rollback] Undoing step: send_confirmation_email
WARN: Cannot unsend notification ntf_abc123
[Rollback] Completed with 1 errors: [{"action":"send_confirmation_email","error":"Irreversible action"}]

Cause: The agent executed a step with side effects that cannot be undone (emails sent, webhooks fired, payments charged), and a subsequent step failed.

Fix: Order your plan so irreversible steps come last. Use a two-phase approach: first execute all reversible steps, validate the results, and only then execute irreversible steps. For truly critical workflows, implement a "prepare and commit" pattern where irreversible actions are staged but not executed until the entire workflow validates successfully.

5. Token Budget Exhaustion During Self-Healing

Symptom:

Error: 400 Bad Request - maximum context length exceeded

Cause: Each healing attempt adds the error history, failed plans, and diagnosis to the context. After two or three healing cycles, the accumulated context exceeds the model's context window.

Fix: Summarize the error history instead of including it verbatim. Keep only the most recent two error entries. When regenerating plans, start with a fresh system prompt rather than appending to the existing conversation. Use truncateMiddleMessages as shown in the retry adjustment section.

Best Practices

Classify errors before choosing a recovery strategy. A rate limit needs a backoff delay. A logic error needs a prompt adjustment. A corrupted state needs a rollback. Applying the wrong recovery pattern wastes tokens and time, and can make the failure worse.
Set hard limits on every recovery loop. Every retry, healing attempt, and fallback chain must have a maximum iteration count. Without limits, non-deterministic systems can loop indefinitely, burning through API credits and compute while producing no useful output.
Checkpoint before every state-changing step, not after. If you checkpoint after a step completes, a crash during the step means you resume with stale state and re-execute the step, potentially creating duplicates. Checkpoint before the step so the resume logic knows the step still needs to execute.
Design for partial results. Not every workflow needs to succeed completely or fail completely. An agent that processed 97 of 100 records successfully should report those 97 results along with the 3 failures, not discard everything.
Log the model's reasoning, not just the outcome. When debugging agent failures, you need to know what the model was thinking. Log the full prompt, the raw response, and the parsed result for every LLM call. In production, use sampling (log 10% of successful calls, 100% of failures) to manage volume.
Test your recovery paths as thoroughly as your happy paths. Inject failures deliberately: mock a tool to fail after the third call, simulate a rate limit on every other request, corrupt a checkpoint file. If you have never seen your rollback logic execute in a test, you cannot trust it in production.
Use error budgets over strict pass/fail for batch workflows. When processing a hundred items, define what "acceptable" means. Three failures out of a hundred might be fine. Three failures out of five is not. The budget should scale with the batch size and the business criticality of the data.
Keep rollback actions simple and idempotent. A compensating action that itself can fail creates a nested recovery problem. Rollbacks should be straightforward API calls (delete this record, revert this field) that can be safely retried without side effects.
Separate the supervision layer from the agent logic. The agent should not know or care that it is being supervised. The supervisor watches from outside, restarts on failure, and escalates when restart limits are exceeded. This separation makes both components easier to test and reason about.
Implement dead letter queues from day one. You will have unrecoverable failures. If you do not capture them with full context, you will not be able to diagnose them, fix the underlying issue, or replay them after the fix.

References

Erlang/OTP Supervision Principles — The original supervision tree pattern that inspired the approach in this article.
Microsoft: Retry Pattern — Cloud design pattern for transient fault handling with exponential backoff.
Martin Fowler: Circuit Breaker — The definitive explanation of the circuit breaker pattern.
Saga Pattern — Distributed transaction management through compensating actions, adapted here for agent workflows.
OpenAI API Error Handling — Reference for LLM-specific error codes and recommended retry behavior.
Google SRE: Error Budgets — The error budget concept from Site Reliability Engineering, applied here to agent fault tolerance.

Error Recovery Patterns in AI Agents

Error Recovery Patterns in AI Agents

Overview

Prerequisites

Why Agents Fail Differently Than Traditional Software

Categorizing Agent Failures

1. LLM Errors

2. Tool Errors

3. Logic Errors

4. State Errors

5. Resource Errors

Implementing Retry with Context Adjustment

Checkpoint and Resume

Rollback Patterns

Self-Healing Agents

Fallback Strategies

Dead Letter Handling

Supervision Trees

Circuit Breakers for Flaky Tools

Error Budgets for Agent Workflows

Structured Error Logging

Complete Working Example: Resilient Agent Framework

Common Issues and Troubleshooting

1. Infinite Healing Loops

2. Checkpoint Corruption After Schema Changes

3. Circuit Breaker Stays Open During Deployments

4. Rollback Fails on Irreversible Actions

5. Token Budget Exhaustion During Self-Healing

Best Practices

References

Quick Links

Need Expert Help?