Llm Apis

LLM API Error Handling and Retry Patterns

Production patterns for handling LLM API errors including retries, circuit breakers, fallback chains, and graceful degradation in Node.js.

LLM API Error Handling and Retry Patterns

Every production system that depends on LLM APIs will encounter failures. Rate limits, timeouts, server errors, content filtering rejections, and partial streaming failures are not edge cases — they are the normal operating reality when you call external AI services at scale. The difference between a fragile prototype and a production-grade integration comes down to how you handle these failures.

This article covers the full taxonomy of LLM API errors you will encounter, concrete retry and circuit breaker patterns, fallback strategies, and a complete working client wrapper that ties everything together in Node.js.

Prerequisites

  • Node.js 18+ installed
  • Familiarity with HTTP APIs and async/await patterns
  • Basic understanding of LLM API request/response structure (OpenAI, Anthropic, or similar)
  • Experience with axios or node-fetch for HTTP requests

Taxonomy of LLM API Errors

Before you can handle errors intelligently, you need to classify them. Not all errors deserve the same response. Some should be retried immediately, some after a delay, and some should never be retried at all.

Rate Limit Errors (429)

Every LLM provider enforces rate limits — requests per minute, tokens per minute, or both. When you exceed them, you get a 429 status code. These are the most common errors in high-traffic systems, and they are always retryable.

// Typical rate limit response from OpenAI
{
  "error": {
    "message": "Rate limit reached for gpt-4 in organization org-xxx on requests per min (RPM): Limit 500, Used 500, Requested 1.",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded"
  }
}

The response often includes a Retry-After header or an x-ratelimit-reset-tokens header that tells you exactly when you can retry. Always check for these headers before falling back to your own backoff calculation.

Timeout Errors

LLM completions can take anywhere from 200 milliseconds to over 60 seconds depending on the model, prompt length, and output length. Timeouts come in two forms: your client-side timeout firing before the server responds, or the server itself timing out (504 Gateway Timeout). Both are retryable, but you need to be careful about idempotency.

Server Errors (500, 502, 503)

LLM providers have outages. OpenAI, Anthropic, Google — all of them have had multi-hour incidents. A 500 means something broke internally. A 502 or 503 usually means overload or deployment issues. These are retryable, but you should back off aggressively because hammering a struggling service makes things worse for everyone.

Invalid Request Errors (400)

These are your fault. The prompt was too long, the model name was wrong, a required parameter was missing, or the JSON was malformed. Never retry 400 errors — fix the request instead.

// Context length exceeded
{
  "error": {
    "message": "This model's maximum context length is 128000 tokens. However, your messages resulted in 131072 tokens.",
    "type": "invalid_request_error",
    "code": "context_length_exceeded"
  }
}

The one exception is context_length_exceeded. You can handle this programmatically by truncating the input and retrying with a shorter prompt.

Content Filtering Errors

LLM providers reject requests that trigger safety filters. These are not retryable with the same input — the content was flagged for a reason. You can sometimes retry with a modified prompt, but that requires careful handling to avoid an infinite loop of filtered content.

Authentication Errors (401, 403)

Your API key is invalid, expired, or lacks permissions. Never retry these. Log them, alert, and fail fast.

Overloaded Model Errors

Some providers return specific error codes when a particular model is overloaded. Anthropic returns a 529 "overloaded" status. These are retryable and are a strong signal to use a fallback model.

Implementing Retry with Exponential Backoff and Jitter

The naive approach to retries — wait a fixed amount and try again — creates thundering herd problems. When a rate limit hits multiple clients simultaneously and they all retry at the same interval, you get a synchronized burst that triggers another rate limit. Exponential backoff with jitter solves this.

var axios = require("axios");

function calculateBackoff(attempt, baseDelay, maxDelay) {
  // Exponential backoff: 1s, 2s, 4s, 8s, 16s...
  var exponentialDelay = baseDelay * Math.pow(2, attempt);
  // Add jitter: random value between 0 and the exponential delay
  var jitter = Math.random() * exponentialDelay;
  // Cap at max delay
  return Math.min(exponentialDelay + jitter, maxDelay);
}

function isRetryableError(error) {
  if (!error.response) {
    // Network error or timeout — retryable
    return true;
  }
  var status = error.response.status;
  // 429 (rate limit), 500, 502, 503, 504, 529 (overloaded)
  var retryableStatuses = [429, 500, 502, 503, 504, 529];
  return retryableStatuses.indexOf(status) !== -1;
}

function getRetryAfterMs(error) {
  if (!error.response || !error.response.headers) {
    return null;
  }
  var retryAfter = error.response.headers["retry-after"];
  if (retryAfter) {
    var seconds = parseFloat(retryAfter);
    if (!isNaN(seconds)) {
      return seconds * 1000;
    }
  }
  return null;
}

function retryWithBackoff(fn, options) {
  var maxRetries = options.maxRetries || 3;
  var baseDelay = options.baseDelay || 1000;
  var maxDelay = options.maxDelay || 60000;

  return new Promise(function (resolve, reject) {
    function attempt(retryCount) {
      fn()
        .then(resolve)
        .catch(function (error) {
          if (retryCount >= maxRetries || !isRetryableError(error)) {
            reject(error);
            return;
          }

          var serverDelay = getRetryAfterMs(error);
          var calculatedDelay = calculateBackoff(retryCount, baseDelay, maxDelay);
          var delay = serverDelay ? Math.max(serverDelay, calculatedDelay) : calculatedDelay;

          console.log(
            "[Retry] Attempt " + (retryCount + 1) + "/" + maxRetries +
            " after " + Math.round(delay) + "ms" +
            " (status: " + (error.response ? error.response.status : "network") + ")"
          );

          setTimeout(function () {
            attempt(retryCount + 1);
          }, delay);
        });
    }
    attempt(0);
  });
}

The key details here: we always respect the server's Retry-After header when present, we add randomized jitter to prevent synchronized retries across clients, and we classify errors before deciding whether to retry.

Circuit Breaker Pattern for LLM Endpoints

Retries help with transient failures, but when a provider is genuinely down, retries just waste time and resources. A circuit breaker tracks failure rates and short-circuits requests when the failure rate exceeds a threshold, failing fast instead of waiting for inevitable timeouts.

function CircuitBreaker(options) {
  this.failureThreshold = options.failureThreshold || 5;
  this.resetTimeout = options.resetTimeout || 60000;
  this.monitorWindow = options.monitorWindow || 30000;

  this.state = "CLOSED"; // CLOSED, OPEN, HALF_OPEN
  this.failures = [];
  this.lastFailureTime = null;
  this.successCount = 0;
}

CircuitBreaker.prototype.recordFailure = function () {
  var now = Date.now();
  this.failures.push(now);
  this.lastFailureTime = now;

  // Only count failures within the monitoring window
  var windowStart = now - this.monitorWindow;
  this.failures = this.failures.filter(function (t) {
    return t > windowStart;
  });

  if (this.failures.length >= this.failureThreshold) {
    this.state = "OPEN";
    console.log("[CircuitBreaker] State changed to OPEN — failing fast for " +
      this.resetTimeout + "ms");
  }
};

CircuitBreaker.prototype.recordSuccess = function () {
  if (this.state === "HALF_OPEN") {
    this.successCount++;
    if (this.successCount >= 2) {
      this.state = "CLOSED";
      this.failures = [];
      this.successCount = 0;
      console.log("[CircuitBreaker] State changed to CLOSED — service recovered");
    }
  }
};

CircuitBreaker.prototype.canExecute = function () {
  if (this.state === "CLOSED") {
    return true;
  }
  if (this.state === "OPEN") {
    var elapsed = Date.now() - this.lastFailureTime;
    if (elapsed >= this.resetTimeout) {
      this.state = "HALF_OPEN";
      this.successCount = 0;
      console.log("[CircuitBreaker] State changed to HALF_OPEN — testing service");
      return true;
    }
    return false;
  }
  // HALF_OPEN — allow limited requests to test recovery
  return true;
};

CircuitBreaker.prototype.execute = function (fn) {
  var self = this;
  if (!this.canExecute()) {
    return Promise.reject(new Error("Circuit breaker is OPEN — request blocked"));
  }
  return fn().then(
    function (result) {
      self.recordSuccess();
      return result;
    },
    function (error) {
      self.recordFailure();
      throw error;
    }
  );
};

In practice, you want one circuit breaker per provider or per model endpoint, not one global breaker. If OpenAI's GPT-4 endpoint is down but GPT-3.5 is fine, you want to fail fast on GPT-4 while still routing to GPT-3.5.

Handling Partial and Streaming Failures

Streaming responses (Server-Sent Events) introduce a category of failure that does not exist with regular HTTP requests: the response starts successfully but breaks partway through. You get half a response, then the connection drops. Your retry logic needs to handle this differently from a complete failure.

var https = require("https");

function streamCompletion(url, payload, headers, options) {
  var timeout = options.timeout || 120000;
  var onChunk = options.onChunk || function () {};

  return new Promise(function (resolve, reject) {
    var buffer = "";
    var chunks = [];
    var finished = false;

    var requestOptions = {
      method: "POST",
      headers: headers,
      timeout: timeout
    };

    var parsedUrl = new URL(url);
    requestOptions.hostname = parsedUrl.hostname;
    requestOptions.path = parsedUrl.pathname;

    var req = https.request(requestOptions, function (res) {
      if (res.statusCode !== 200) {
        var errorBody = "";
        res.on("data", function (chunk) { errorBody += chunk; });
        res.on("end", function () {
          var error = new Error("HTTP " + res.statusCode);
          error.response = { status: res.statusCode, data: errorBody, headers: res.headers };
          reject(error);
        });
        return;
      }

      res.on("data", function (chunk) {
        buffer += chunk.toString();
        var lines = buffer.split("\n");
        buffer = lines.pop(); // Keep incomplete line in buffer

        lines.forEach(function (line) {
          if (line.startsWith("data: ")) {
            var data = line.slice(6).trim();
            if (data === "[DONE]") {
              finished = true;
              return;
            }
            try {
              var parsed = JSON.parse(data);
              chunks.push(parsed);
              onChunk(parsed);
            } catch (e) {
              // Malformed chunk — log but continue
              console.warn("[Stream] Malformed SSE chunk: " + data);
            }
          }
        });
      });

      res.on("end", function () {
        if (!finished && chunks.length > 0) {
          // Stream ended prematurely — we have partial data
          var error = new Error("Stream terminated prematurely");
          error.partialData = chunks;
          error.isPartialFailure = true;
          reject(error);
          return;
        }
        resolve(chunks);
      });

      res.on("error", function (err) {
        err.partialData = chunks;
        err.isPartialFailure = chunks.length > 0;
        reject(err);
      });
    });

    req.on("timeout", function () {
      req.destroy();
      var error = new Error("Request timeout after " + timeout + "ms");
      error.code = "ETIMEDOUT";
      error.partialData = chunks;
      error.isPartialFailure = chunks.length > 0;
      reject(error);
    });

    req.write(JSON.stringify(payload));
    req.end();
  });
}

When you detect a partial failure, you have a decision to make: retry the entire request, or use the partial output. If you received 80% of a response and the content is usable, it might be better to use it than to burn another API call. Your application logic should make this call based on context.

Timeout Strategies for Long-Running Completions

Different LLM operations have wildly different latency profiles. A short classification task might complete in 500ms. A long-form generation with a large model could take 90 seconds. Setting a single timeout for all operations is a mistake.

var TIMEOUT_PROFILES = {
  classification: { timeout: 10000, description: "Short classification or extraction" },
  summarization: { timeout: 30000, description: "Text summarization" },
  generation: { timeout: 60000, description: "Content generation" },
  longGeneration: { timeout: 120000, description: "Long-form content or complex reasoning" },
  embedding: { timeout: 15000, description: "Embedding generation" }
};

function getTimeoutForTask(taskType, estimatedTokens) {
  var profile = TIMEOUT_PROFILES[taskType] || TIMEOUT_PROFILES.generation;
  var baseTimeout = profile.timeout;

  // Scale timeout based on expected output tokens
  // Roughly 50 tokens/second for large models
  if (estimatedTokens) {
    var estimatedDuration = (estimatedTokens / 50) * 1000;
    // Use the larger of the base timeout or estimated duration + 30% buffer
    return Math.max(baseTimeout, estimatedDuration * 1.3);
  }

  return baseTimeout;
}

A useful pattern is to set an aggressive initial timeout and increase it on retry. If the first attempt times out at 30 seconds, the second attempt might use 60 seconds. This handles the case where the model genuinely needs more time versus the case where something is broken.

Graceful Degradation Strategies

When your primary LLM provider fails and retries are exhausted, you need a plan B. Graceful degradation means your application continues to function — maybe with reduced capability — rather than returning a 500 error to your users.

Fallback Model Chains

var FALLBACK_CHAIN = [
  { provider: "openai", model: "gpt-4o", priority: 1 },
  { provider: "anthropic", model: "claude-3-5-sonnet-20241022", priority: 2 },
  { provider: "openai", model: "gpt-4o-mini", priority: 3 },
  { provider: "openai", model: "gpt-3.5-turbo", priority: 4 }
];

function executeFallbackChain(prompt, options) {
  var chain = FALLBACK_CHAIN.slice(); // Copy the chain
  var lastError = null;

  function tryNext() {
    if (chain.length === 0) {
      return Promise.reject(lastError || new Error("All fallback models exhausted"));
    }

    var target = chain.shift();
    console.log("[Fallback] Trying " + target.provider + "/" + target.model);

    return callProvider(target.provider, target.model, prompt, options)
      .catch(function (error) {
        lastError = error;
        console.log("[Fallback] " + target.provider + "/" + target.model +
          " failed: " + error.message);
        return tryNext();
      });
  }

  return tryNext();
}

Cached Response Fallback

For requests that are not unique — think classification, common questions, or template-based generation — a cache can serve as your final fallback. Stale data is often better than no data.

var NodeCache = require("node-cache");
var crypto = require("crypto");

var responseCache = new NodeCache({ stdTTL: 3600, maxKeys: 10000 });

function getCacheKey(prompt, model) {
  var hash = crypto.createHash("sha256")
    .update(model + ":" + JSON.stringify(prompt))
    .digest("hex")
    .substring(0, 16);
  return "llm:" + hash;
}

function cachedLlmCall(prompt, model, options) {
  var cacheKey = getCacheKey(prompt, model);
  var cached = responseCache.get(cacheKey);

  return callLlm(prompt, model, options)
    .then(function (response) {
      // Cache successful responses
      responseCache.set(cacheKey, {
        response: response,
        timestamp: Date.now(),
        model: model
      });
      return response;
    })
    .catch(function (error) {
      if (cached) {
        console.log("[Cache] Serving stale response from " +
          new Date(cached.timestamp).toISOString());
        cached.response._fromCache = true;
        cached.response._cacheAge = Date.now() - cached.timestamp;
        return cached.response;
      }
      throw error;
    });
}

Simpler Prompt Fallback

When context length errors hit, degrade by simplifying the prompt. Strip examples, reduce system instructions, or truncate input data.

function degradePrompt(messages, maxTokens) {
  var degraded = JSON.parse(JSON.stringify(messages)); // Deep copy
  var estimatedTokens = estimateTokenCount(degraded);

  // Step 1: Remove few-shot examples
  if (estimatedTokens > maxTokens) {
    degraded = degraded.filter(function (msg) {
      return msg.role !== "assistant" || msg === degraded[degraded.length - 1];
    });
    estimatedTokens = estimateTokenCount(degraded);
  }

  // Step 2: Truncate system prompt
  if (estimatedTokens > maxTokens && degraded[0] && degraded[0].role === "system") {
    var systemMsg = degraded[0].content;
    degraded[0].content = systemMsg.substring(0, Math.floor(systemMsg.length * 0.5));
    estimatedTokens = estimateTokenCount(degraded);
  }

  // Step 3: Truncate user content from the middle
  if (estimatedTokens > maxTokens) {
    var lastUserMsg = null;
    for (var i = degraded.length - 1; i >= 0; i--) {
      if (degraded[i].role === "user") {
        lastUserMsg = degraded[i];
        break;
      }
    }
    if (lastUserMsg) {
      var content = lastUserMsg.content;
      var targetLength = Math.floor(content.length * 0.6);
      var half = Math.floor(targetLength / 2);
      lastUserMsg.content = content.substring(0, half) +
        "\n\n[... content truncated for length ...]\n\n" +
        content.substring(content.length - half);
    }
  }

  return degraded;
}

function estimateTokenCount(messages) {
  var text = messages.map(function (m) { return m.content || ""; }).join(" ");
  // Rough estimate: 1 token ≈ 4 characters for English
  return Math.ceil(text.length / 4);
}

Dead Letter Queues for Failed Requests

When all retry and fallback strategies are exhausted, the request still matters. A dead letter queue captures failed requests for later processing — manual review, batch retry when the service recovers, or analysis to identify patterns.

var fs = require("fs");
var path = require("path");

function DeadLetterQueue(options) {
  this.storePath = options.storePath || "./dlq";
  this.maxAge = options.maxAge || 7 * 24 * 60 * 60 * 1000; // 7 days

  if (!fs.existsSync(this.storePath)) {
    fs.mkdirSync(this.storePath, { recursive: true });
  }
}

DeadLetterQueue.prototype.enqueue = function (request, error, metadata) {
  var entry = {
    id: Date.now() + "-" + Math.random().toString(36).substring(2, 10),
    timestamp: new Date().toISOString(),
    request: {
      provider: request.provider,
      model: request.model,
      messages: request.messages,
      options: request.options
    },
    error: {
      message: error.message,
      status: error.response ? error.response.status : null,
      code: error.code || null
    },
    metadata: metadata || {},
    retryCount: 0
  };

  var filePath = path.join(this.storePath, entry.id + ".json");
  fs.writeFileSync(filePath, JSON.stringify(entry, null, 2));
  console.log("[DLQ] Enqueued failed request: " + entry.id);
  return entry.id;
};

DeadLetterQueue.prototype.processQueue = function (handler) {
  var self = this;
  var files = fs.readdirSync(this.storePath).filter(function (f) {
    return f.endsWith(".json");
  });

  var processed = 0;
  var failed = 0;

  return files.reduce(function (chain, file) {
    return chain.then(function () {
      var filePath = path.join(self.storePath, file);
      var entry = JSON.parse(fs.readFileSync(filePath, "utf8"));

      // Skip entries older than maxAge
      var age = Date.now() - new Date(entry.timestamp).getTime();
      if (age > self.maxAge) {
        fs.unlinkSync(filePath);
        return;
      }

      return handler(entry)
        .then(function () {
          fs.unlinkSync(filePath);
          processed++;
        })
        .catch(function () {
          entry.retryCount++;
          fs.writeFileSync(filePath, JSON.stringify(entry, null, 2));
          failed++;
        });
    });
  }, Promise.resolve()).then(function () {
    console.log("[DLQ] Processed: " + processed + ", Failed: " + failed +
      ", Remaining: " + (files.length - processed));
  });
};

Error Classification and Alerting

Not every error deserves the same alert level. Rate limits at 2 AM are normal. Authentication failures are urgent. A good error classification system prevents alert fatigue.

var ERROR_CLASSIFICATIONS = {
  rate_limit: { severity: "warning", alert: false, metric: "llm.rate_limit" },
  timeout: { severity: "warning", alert: false, metric: "llm.timeout" },
  server_error: { severity: "error", alert: true, metric: "llm.server_error" },
  auth_error: { severity: "critical", alert: true, metric: "llm.auth_error" },
  content_filter: { severity: "info", alert: false, metric: "llm.content_filter" },
  context_length: { severity: "warning", alert: false, metric: "llm.context_length" },
  invalid_request: { severity: "error", alert: true, metric: "llm.invalid_request" },
  network_error: { severity: "error", alert: true, metric: "llm.network_error" },
  unknown: { severity: "error", alert: true, metric: "llm.unknown_error" }
};

function classifyError(error) {
  if (!error.response) {
    if (error.code === "ETIMEDOUT" || error.code === "ESOCKETTIMEDOUT") {
      return "timeout";
    }
    return "network_error";
  }

  var status = error.response.status;
  var body = error.response.data || {};
  var errorType = body.error ? body.error.type : "";
  var errorCode = body.error ? body.error.code : "";

  if (status === 429) return "rate_limit";
  if (status === 401 || status === 403) return "auth_error";
  if (status === 529) return "server_error";
  if (status >= 500) return "server_error";
  if (errorCode === "context_length_exceeded") return "context_length";
  if (errorCode === "content_filter" || errorType === "content_filter") return "content_filter";
  if (status === 400) return "invalid_request";

  return "unknown";
}

function handleClassifiedError(error, requestContext) {
  var classification = classifyError(error);
  var config = ERROR_CLASSIFICATIONS[classification];

  var logEntry = {
    classification: classification,
    severity: config.severity,
    message: error.message,
    provider: requestContext.provider,
    model: requestContext.model,
    timestamp: new Date().toISOString()
  };

  // Always log
  console.log("[LLM Error] " + JSON.stringify(logEntry));

  // Emit metric
  if (typeof emitMetric === "function") {
    emitMetric(config.metric, 1, { provider: requestContext.provider });
  }

  // Alert on critical/error with alert flag
  if (config.alert) {
    sendAlert(logEntry);
  }

  return classification;
}

Idempotency Considerations

LLM calls are inherently non-idempotent — calling the same endpoint with the same prompt twice produces different outputs (unless temperature is 0). This matters for retries because you cannot simply compare outputs to detect duplicate processing. Consider these strategies:

  • Request deduplication: Track in-flight requests by a hash of the prompt and parameters. If a retry fires while the original request is still pending, do not send a duplicate.
  • Idempotency keys: Some providers support idempotency keys. OpenAI accepts an Idempotency-Key header that ensures the same request is not processed twice within a window.
  • At-least-once vs. at-most-once: Decide which is worse for your use case. For content generation, duplicate outputs are usually acceptable. For actions triggered by LLM output (sending emails, updating databases), you need deduplication on the action side.
var inFlightRequests = {};

function deduplicatedCall(key, fn) {
  if (inFlightRequests[key]) {
    console.log("[Dedup] Reusing in-flight request: " + key);
    return inFlightRequests[key];
  }

  var promise = fn().finally(function () {
    delete inFlightRequests[key];
  });

  inFlightRequests[key] = promise;
  return promise;
}

Unified Multi-Provider Error Handling

When you integrate multiple LLM providers, each one has its own error format, status codes, and quirks. A unified error wrapper normalizes these differences so your application code does not need to handle provider-specific cases.

function LlmApiError(message, options) {
  Error.call(this, message);
  this.name = "LlmApiError";
  this.message = message;
  this.provider = options.provider;
  this.model = options.model;
  this.statusCode = options.statusCode || null;
  this.errorType = options.errorType || "unknown";
  this.retryable = options.retryable || false;
  this.retryAfterMs = options.retryAfterMs || null;
  this.originalError = options.originalError || null;
}

LlmApiError.prototype = Object.create(Error.prototype);
LlmApiError.prototype.constructor = LlmApiError;

function normalizeOpenAIError(error) {
  var status = error.response ? error.response.status : null;
  var body = error.response ? error.response.data : {};
  var errorInfo = body.error || {};

  return new LlmApiError(errorInfo.message || error.message, {
    provider: "openai",
    model: error._model || "unknown",
    statusCode: status,
    errorType: errorInfo.type || "unknown",
    retryable: [429, 500, 502, 503, 504].indexOf(status) !== -1,
    retryAfterMs: getRetryAfterMs(error),
    originalError: error
  });
}

function normalizeAnthropicError(error) {
  var status = error.response ? error.response.status : null;
  var body = error.response ? error.response.data : {};
  var errorInfo = body.error || {};

  return new LlmApiError(errorInfo.message || error.message, {
    provider: "anthropic",
    model: error._model || "unknown",
    statusCode: status,
    errorType: errorInfo.type || "unknown",
    retryable: [429, 500, 502, 503, 529].indexOf(status) !== -1,
    retryAfterMs: getRetryAfterMs(error),
    originalError: error
  });
}

Complete Working Example: Resilient LLM Client

Here is the full client wrapper that combines retries, circuit breakers, fallback chains, caching, dead letter queues, and unified error handling into a single cohesive module.

var axios = require("axios");
var crypto = require("crypto");
var EventEmitter = require("events");

// ─── ResilientLlmClient ────────────────────────────────────────

function ResilientLlmClient(config) {
  EventEmitter.call(this);

  this.providers = config.providers || {};
  this.fallbackChain = config.fallbackChain || [];
  this.maxRetries = config.maxRetries || 3;
  this.baseDelay = config.baseDelay || 1000;
  this.maxDelay = config.maxDelay || 60000;
  this.enableCache = config.enableCache !== false;
  this.cacheTTL = config.cacheTTL || 3600;

  // One circuit breaker per provider+model
  this.breakers = {};
  this.cache = {};
  this.cacheTimestamps = {};
  this.stats = {
    requests: 0,
    successes: 0,
    retries: 0,
    fallbacks: 0,
    cacheHits: 0,
    failures: 0
  };
}

ResilientLlmClient.prototype = Object.create(EventEmitter.prototype);
ResilientLlmClient.prototype.constructor = ResilientLlmClient;

// ─── Circuit Breaker Management ─────────────────────────────────

ResilientLlmClient.prototype.getBreaker = function (provider, model) {
  var key = provider + ":" + model;
  if (!this.breakers[key]) {
    this.breakers[key] = new CircuitBreaker({
      failureThreshold: 5,
      resetTimeout: 60000,
      monitorWindow: 30000
    });
  }
  return this.breakers[key];
};

// ─── Cache Management ───────────────────────────────────────────

ResilientLlmClient.prototype.getCacheKey = function (messages, model, options) {
  var payload = JSON.stringify({ messages: messages, model: model, temperature: options.temperature });
  return crypto.createHash("sha256").update(payload).digest("hex").substring(0, 20);
};

ResilientLlmClient.prototype.getFromCache = function (key) {
  var entry = this.cache[key];
  if (!entry) return null;
  var age = Date.now() - this.cacheTimestamps[key];
  if (age > this.cacheTTL * 1000) {
    delete this.cache[key];
    delete this.cacheTimestamps[key];
    return null;
  }
  return entry;
};

ResilientLlmClient.prototype.setCache = function (key, value) {
  this.cache[key] = value;
  this.cacheTimestamps[key] = Date.now();
};

// ─── Provider Dispatch ──────────────────────────────────────────

ResilientLlmClient.prototype.callProvider = function (provider, model, messages, options) {
  var providerConfig = this.providers[provider];
  if (!providerConfig) {
    return Promise.reject(new Error("Unknown provider: " + provider));
  }

  var timeout = options.timeout || 60000;

  if (provider === "openai") {
    return axios.post("https://api.openai.com/v1/chat/completions", {
      model: model,
      messages: messages,
      temperature: options.temperature || 0.7,
      max_tokens: options.maxTokens || 4096
    }, {
      headers: {
        "Authorization": "Bearer " + providerConfig.apiKey,
        "Content-Type": "application/json"
      },
      timeout: timeout
    }).then(function (res) {
      return {
        content: res.data.choices[0].message.content,
        model: model,
        provider: provider,
        usage: res.data.usage,
        _fromCache: false,
        _fromFallback: false
      };
    });
  }

  if (provider === "anthropic") {
    return axios.post("https://api.anthropic.com/v1/messages", {
      model: model,
      max_tokens: options.maxTokens || 4096,
      messages: messages.filter(function (m) { return m.role !== "system"; }),
      system: (messages.find(function (m) { return m.role === "system"; }) || {}).content || ""
    }, {
      headers: {
        "x-api-key": providerConfig.apiKey,
        "Content-Type": "application/json",
        "anthropic-version": "2023-06-01"
      },
      timeout: timeout
    }).then(function (res) {
      return {
        content: res.data.content[0].text,
        model: model,
        provider: provider,
        usage: res.data.usage,
        _fromCache: false,
        _fromFallback: false
      };
    });
  }

  return Promise.reject(new Error("Unsupported provider: " + provider));
};

// ─── Core Request Method ────────────────────────────────────────

ResilientLlmClient.prototype.complete = function (messages, options) {
  var self = this;
  options = options || {};
  var chain = options.fallbackChain || this.fallbackChain;

  if (chain.length === 0) {
    return Promise.reject(new Error("No models configured in fallback chain"));
  }

  self.stats.requests++;

  // Check cache first
  if (self.enableCache && options.temperature === 0) {
    var cacheKey = self.getCacheKey(messages, chain[0].model, options);
    var cached = self.getFromCache(cacheKey);
    if (cached) {
      self.stats.cacheHits++;
      cached._fromCache = true;
      self.emit("cacheHit", { model: chain[0].model });
      return Promise.resolve(cached);
    }
  }

  var chainIndex = 0;
  var lastError = null;

  function tryNextModel() {
    if (chainIndex >= chain.length) {
      self.stats.failures++;
      self.emit("exhausted", { error: lastError });
      return Promise.reject(lastError || new Error("All models in fallback chain failed"));
    }

    var target = chain[chainIndex];
    chainIndex++;

    var breaker = self.getBreaker(target.provider, target.model);

    if (!breaker.canExecute()) {
      console.log("[Client] Circuit open for " + target.provider + "/" + target.model + " — skipping");
      if (chainIndex > 1) self.stats.fallbacks++;
      return tryNextModel();
    }

    return retryWithBackoff(
      function () {
        return breaker.execute(function () {
          return self.callProvider(target.provider, target.model, messages, options);
        });
      },
      {
        maxRetries: self.maxRetries,
        baseDelay: self.baseDelay,
        maxDelay: self.maxDelay
      }
    ).then(function (result) {
      if (chainIndex > 1) {
        result._fromFallback = true;
        self.stats.fallbacks++;
      }
      self.stats.successes++;

      // Cache deterministic responses
      if (self.enableCache && options.temperature === 0) {
        var key = self.getCacheKey(messages, target.model, options);
        self.setCache(key, result);
      }

      self.emit("success", {
        provider: target.provider,
        model: target.model,
        fromFallback: result._fromFallback
      });

      return result;
    }).catch(function (error) {
      lastError = error;
      console.log("[Client] " + target.provider + "/" + target.model +
        " exhausted retries: " + error.message);
      self.emit("modelFailed", {
        provider: target.provider,
        model: target.model,
        error: error.message
      });
      return tryNextModel();
    });
  }

  return tryNextModel();
};

// ─── Stats ──────────────────────────────────────────────────────

ResilientLlmClient.prototype.getStats = function () {
  return JSON.parse(JSON.stringify(this.stats));
};

ResilientLlmClient.prototype.resetStats = function () {
  this.stats = {
    requests: 0, successes: 0, retries: 0,
    fallbacks: 0, cacheHits: 0, failures: 0
  };
};

// ─── Usage ──────────────────────────────────────────────────────

/*
var client = new ResilientLlmClient({
  providers: {
    openai: { apiKey: process.env.OPENAI_API_KEY },
    anthropic: { apiKey: process.env.ANTHROPIC_API_KEY }
  },
  fallbackChain: [
    { provider: "openai", model: "gpt-4o" },
    { provider: "anthropic", model: "claude-3-5-sonnet-20241022" },
    { provider: "openai", model: "gpt-4o-mini" }
  ],
  maxRetries: 3,
  enableCache: true,
  cacheTTL: 1800
});

client.on("success", function (info) {
  console.log("Completed with " + info.provider + "/" + info.model);
});

client.on("modelFailed", function (info) {
  console.log("Model failed: " + info.provider + "/" + info.model);
});

client.complete([
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Explain circuit breakers in distributed systems." }
], {
  temperature: 0.7,
  maxTokens: 2048,
  timeout: 30000
}).then(function (result) {
  console.log("Response:", result.content);
  console.log("Provider:", result.provider);
  console.log("From fallback:", result._fromFallback);
  console.log("Stats:", client.getStats());
}).catch(function (error) {
  console.error("All models failed:", error.message);
});
*/

module.exports = {
  ResilientLlmClient: ResilientLlmClient,
  CircuitBreaker: CircuitBreaker,
  retryWithBackoff: retryWithBackoff,
  classifyError: classifyError,
  isRetryableError: isRetryableError
};

Common Issues and Troubleshooting

1. "Rate limit reached" but traffic is low

Error: Rate limit reached for gpt-4 in organization org-xxx on tokens per min (TPM):
Limit 40000, Used 38500, Requested 4000.

Rate limits apply to both requests per minute and tokens per minute. A single request with a large prompt can consume most of your token budget. Check the x-ratelimit-remaining-tokens response header on successful requests to monitor your token budget. The fix is often to reduce prompt size rather than reduce request frequency.

2. Retry storm after circuit breaker reset

When a circuit breaker transitions from OPEN to HALF_OPEN, all queued requests try to fire simultaneously, which can immediately re-trip the breaker. Fix this by allowing only a single probe request in HALF_OPEN state and queuing others until the probe succeeds.

[CircuitBreaker] State changed to HALF_OPEN — testing service
[Retry] Attempt 1/3 after 1200ms (status: 429)
[Retry] Attempt 1/3 after 980ms (status: 429)
[Retry] Attempt 1/3 after 1100ms (status: 429)
[CircuitBreaker] State changed to OPEN — failing fast for 60000ms

Add a semaphore to the HALF_OPEN state that limits concurrent probe requests to one or two.

3. Streaming connection drops silently

Error: socket hang up
  at createHangUpError (_http_client.js:322:15)
  code: 'ECONNRESET'

Long-running streaming connections can be terminated by proxies, load balancers, or the provider's infrastructure without a proper close frame. Implement a heartbeat check: if no SSE data arrives within a reasonable interval (e.g., 15 seconds for text generation), assume the connection is dead and reconnect. Track the last received chunk timestamp and set an inactivity timer.

4. Context length exceeded on retry with different model

Error: This model's maximum context length is 8192 tokens.
However, your messages resulted in 12500 tokens.

When your fallback chain includes models with different context windows, a prompt that fits in GPT-4's 128k window may not fit in GPT-3.5's 16k window. Before calling a fallback model, check the prompt size against that model's context limit and apply prompt degradation (truncation, summarization) if needed.

5. Authentication errors masquerading as rate limits

Error: You exceeded your current quota. Please check your plan and billing details.

This returns a 429 status code but it is not a transient rate limit — it means your account has no remaining credits. Your retry logic will dutifully retry 3 times, wasting 30+ seconds before failing. Inspect the error message body, not just the status code. If the message mentions "quota" or "billing," classify it as an auth error and fail immediately.

Best Practices

  • Always respect Retry-After headers. The server knows its own capacity better than your backoff algorithm. Use the server's suggested delay as a floor, not a ceiling.

  • Separate circuit breakers per model endpoint. A GPT-4 outage should not prevent GPT-3.5-turbo requests. Use the provider-plus-model combination as your breaker key.

  • Set timeouts proportional to expected output size. A 10-token classification needs 5 seconds. A 4000-token generation needs 60 seconds. Use task profiles, not one-size-fits-all timeouts.

  • Log every retry and fallback event with context. Include the provider, model, error classification, retry count, and elapsed time. When something goes wrong at 3 AM, these logs are the only way to reconstruct what happened.

  • Never retry 400-level errors other than 429. Invalid requests, authentication failures, and content filter rejections will never succeed on retry. Retrying them wastes time and money.

  • Cache deterministic responses aggressively. When temperature is 0 and the prompt is identical, the output is functionally deterministic. Cache it. A single cache hit saves both latency and API cost.

  • Monitor your fallback hit rate. If 30% of your requests are falling back to secondary models, your primary model's capacity is insufficient or its circuit breaker thresholds are too aggressive. Track this metric and alert when it exceeds your threshold.

  • Test your failure paths under load. Inject artificial 429s and 500s in staging. Verify that your circuit breakers trip correctly, fallbacks engage, and the dead letter queue captures what it should. Do not discover these bugs in production.

  • Use idempotency keys when the provider supports them. Timeout-induced retries can cause duplicate processing. An idempotency key ensures the provider deduplicates on their end.

  • Build kill switches for expensive models. When costs spike, you want the ability to instantly route all traffic to cheaper models without deploying code. A feature flag that skips expensive models in the fallback chain is essential.

References

Powered by Contentful