Failover Strategies for LLM API Dependencies
Build LLM API failover with provider switching, circuit breakers, health checks, and graceful degradation in Node.js.
Failover Strategies for LLM API Dependencies
Overview
If your production system depends on a single LLM provider, you have a single point of failure that will eventually take your application down. LLM APIs are external dependencies with their own outage patterns, rate limits, and degradation modes that differ fundamentally from traditional REST APIs. This article covers how to build a resilient LLM integration layer in Node.js with provider failover, circuit breakers, health checking, and graceful degradation so your application keeps serving users even when your primary AI provider goes dark.
Prerequisites
- Node.js v18 or later
- Working familiarity with the Anthropic, OpenAI, and other LLM provider APIs
- Understanding of HTTP clients and async patterns in Node.js
- Basic knowledge of circuit breaker and health check patterns
- Experience running Node.js applications in production
Why LLM APIs Are a Critical Dependency
Most teams treat LLM API calls the same way they treat any other HTTP dependency. That is a mistake. When your database goes down, you know exactly what breaks. When your LLM provider goes down, the blast radius is harder to predict because LLM calls are often embedded deep in business logic: content generation, classification, summarization, chat, moderation, extraction.
The problem is compounded by the nature of LLM providers. These are massive, GPU-intensive services running inference across millions of concurrent requests. They experience failure modes that traditional APIs do not: model-specific outages (GPT-4 is down but GPT-3.5 works), capacity-driven rate limiting that hits without warning, regional routing issues, and performance degradation where the API technically responds but takes 30 seconds instead of 3.
If your application has no failover strategy, a single provider outage means your users stare at spinners, get error pages, or worse, lose data mid-workflow.
Types of LLM API Failures
Before building failover, you need to understand what you are failing over from. LLM APIs fail in distinct ways:
Complete outages. The API returns 500, 502, or 503 errors for all requests. These are the easiest to detect and handle. Every major provider has had multi-hour outages.
Partial degradation. The API responds to some requests but not others. You might see intermittent 500 errors, or certain models become unavailable while others keep working. This is the hardest failure mode to detect because naive health checks pass.
Rate limiting. You hit 429 errors because you have exceeded your requests-per-minute or tokens-per-minute quota. This is especially dangerous during traffic spikes and often cascades: your retry logic hammers the API harder, making the rate limiting worse.
Regional issues. A provider's US-East region is degraded but US-West is fine. If your application is pinned to a single region, you are exposed to regional failures you cannot control.
Latency degradation. The API technically works but response times balloon from 2 seconds to 45 seconds. Your health checks pass, your error rates look fine, but your users are having a terrible experience. This is a silent killer.
Model-specific failures. The provider's flagship model goes down but smaller models remain available. OpenAI has had incidents where GPT-4 was unavailable while GPT-3.5-turbo continued to work.
Token or billing issues. Your API key hits a spending limit, your payment fails, or the provider suspends your account. These are operational failures, not infrastructure failures, but they look the same to your application.
Implementing Provider Failover
The core idea is simple: maintain a priority-ordered list of LLM providers and route requests to the highest-priority available provider. Here is the basic structure:
var axios = require("axios");
var providers = [
{
name: "anthropic",
baseUrl: "https://api.anthropic.com/v1/messages",
apiKey: process.env.ANTHROPIC_API_KEY,
model: "claude-sonnet-4-20250514",
priority: 1
},
{
name: "openai",
baseUrl: "https://api.openai.com/v1/chat/completions",
apiKey: process.env.OPENAI_API_KEY,
model: "gpt-4o",
priority: 2
},
{
name: "local",
baseUrl: "http://localhost:11434/api/generate",
apiKey: null,
model: "llama3",
priority: 3
}
];
function callProvider(provider, prompt) {
if (provider.name === "anthropic") {
return callAnthropic(provider, prompt);
} else if (provider.name === "openai") {
return callOpenAI(provider, prompt);
} else if (provider.name === "local") {
return callOllama(provider, prompt);
}
throw new Error("Unknown provider: " + provider.name);
}
function callWithFailover(prompt) {
var sorted = providers.slice().sort(function (a, b) {
return a.priority - b.priority;
});
return tryNextProvider(sorted, 0, prompt, []);
}
function tryNextProvider(sorted, index, prompt, errors) {
if (index >= sorted.length) {
var err = new Error("All LLM providers failed");
err.providerErrors = errors;
return Promise.reject(err);
}
var provider = sorted[index];
return callProvider(provider, prompt).catch(function (error) {
errors.push({ provider: provider.name, error: error.message });
console.error("[Failover] " + provider.name + " failed: " + error.message);
return tryNextProvider(sorted, index + 1, prompt, errors);
});
}
This sequential failover is the foundation. When Anthropic fails, we try OpenAI. When OpenAI fails, we fall back to a local Ollama instance. Each failure is logged with the provider name and error for debugging.
Health Checking LLM Endpoints
You do not want to discover that a provider is down when a user request arrives. Proactive health checking lets you skip known-bad providers immediately:
var providerHealth = {};
function checkProviderHealth(provider) {
var startTime = Date.now();
var testPrompt = "Respond with the word OK.";
return callProvider(provider, testPrompt)
.then(function (response) {
var latency = Date.now() - startTime;
providerHealth[provider.name] = {
healthy: true,
latency: latency,
lastCheck: Date.now(),
consecutiveFailures: 0
};
return true;
})
.catch(function (error) {
var existing = providerHealth[provider.name] || { consecutiveFailures: 0 };
providerHealth[provider.name] = {
healthy: false,
latency: null,
lastCheck: Date.now(),
consecutiveFailures: existing.consecutiveFailures + 1,
lastError: error.message
};
return false;
});
}
function startHealthChecks(intervalMs) {
setInterval(function () {
providers.forEach(function (provider) {
checkProviderHealth(provider);
});
}, intervalMs || 30000);
// Run immediately on startup
providers.forEach(function (provider) {
checkProviderHealth(provider);
});
}
Health checks should run every 30 seconds at minimum. The lightweight prompt ("Respond with the word OK") keeps costs negligible while verifying that authentication, networking, and the model itself are all working. Track latency in the health state so you can make routing decisions based on response time, not just availability.
Circuit Breaker Pattern for LLM Providers
A circuit breaker prevents your application from hammering a failing provider. When failures exceed a threshold, the circuit opens and all requests skip that provider until a cooldown period expires:
function CircuitBreaker(options) {
this.failureThreshold = options.failureThreshold || 5;
this.cooldownMs = options.cooldownMs || 60000;
this.state = "closed"; // closed, open, half-open
this.failures = 0;
this.lastFailureTime = null;
this.successCount = 0;
}
CircuitBreaker.prototype.canExecute = function () {
if (this.state === "closed") {
return true;
}
if (this.state === "open") {
var elapsed = Date.now() - this.lastFailureTime;
if (elapsed >= this.cooldownMs) {
this.state = "half-open";
return true;
}
return false;
}
// half-open: allow one request through to test
return true;
};
CircuitBreaker.prototype.recordSuccess = function () {
if (this.state === "half-open") {
this.successCount++;
if (this.successCount >= 2) {
this.state = "closed";
this.failures = 0;
this.successCount = 0;
}
} else {
this.failures = 0;
}
};
CircuitBreaker.prototype.recordFailure = function () {
this.failures++;
this.lastFailureTime = Date.now();
this.successCount = 0;
if (this.failures >= this.failureThreshold) {
this.state = "open";
}
};
// Attach a circuit breaker to each provider
var circuitBreakers = {};
providers.forEach(function (provider) {
circuitBreakers[provider.name] = new CircuitBreaker({
failureThreshold: 3,
cooldownMs: 60000
});
});
The half-open state is critical. After the cooldown expires, we let a single request through. If it succeeds, we close the circuit and resume normal traffic. If it fails, we reopen the circuit and wait again. This prevents a thundering herd of requests from hitting a provider that is just coming back online.
Prompt Translation Between Providers
Different LLM providers have different API formats, system prompt handling, and behavioral quirks. You cannot just swap the endpoint URL and expect the same results. A prompt translation layer normalizes your application's requests into provider-specific formats:
function translatePrompt(provider, messages, options) {
if (provider.name === "anthropic") {
var systemMsg = "";
var userMessages = [];
messages.forEach(function (msg) {
if (msg.role === "system") {
systemMsg = msg.content;
} else {
userMessages.push({ role: msg.role, content: msg.content });
}
});
return {
model: provider.model,
max_tokens: options.maxTokens || 1024,
system: systemMsg || undefined,
messages: userMessages
};
}
if (provider.name === "openai") {
return {
model: provider.model,
max_tokens: options.maxTokens || 1024,
messages: messages
};
}
if (provider.name === "local") {
var combined = messages.map(function (msg) {
return msg.role + ": " + msg.content;
}).join("\n\n");
return {
model: provider.model,
prompt: combined,
stream: false
};
}
throw new Error("No prompt translator for provider: " + provider.name);
}
function normalizeResponse(provider, rawResponse) {
if (provider.name === "anthropic") {
return {
content: rawResponse.data.content[0].text,
model: rawResponse.data.model,
provider: provider.name,
usage: {
inputTokens: rawResponse.data.usage.input_tokens,
outputTokens: rawResponse.data.usage.output_tokens
}
};
}
if (provider.name === "openai") {
return {
content: rawResponse.data.choices[0].message.content,
model: rawResponse.data.model,
provider: provider.name,
usage: {
inputTokens: rawResponse.data.usage.prompt_tokens,
outputTokens: rawResponse.data.usage.completion_tokens
}
};
}
if (provider.name === "local") {
return {
content: rawResponse.data.response,
model: provider.model,
provider: provider.name,
usage: { inputTokens: 0, outputTokens: 0 }
};
}
}
This translation layer is where most teams cut corners and pay for it later. The differences between providers are not just cosmetic. Anthropic requires the system message as a separate top-level field. OpenAI embeds it in the messages array. Local models via Ollama often expect a single prompt string. Getting this wrong means your failover technically works but produces garbage output.
Model Capability Mapping Across Providers
Not all models are equivalent. When failing over from Claude Sonnet to GPT-4o to a local Llama model, you need to understand what you are giving up:
var modelCapabilities = {
"claude-sonnet-4-20250514": {
maxContext: 200000,
maxOutput: 8192,
supportsVision: true,
supportsJson: true,
supportsTools: true,
qualityTier: "high",
costPer1kInput: 0.003,
costPer1kOutput: 0.015
},
"gpt-4o": {
maxContext: 128000,
maxOutput: 4096,
supportsVision: true,
supportsJson: true,
supportsTools: true,
qualityTier: "high",
costPer1kInput: 0.005,
costPer1kOutput: 0.015
},
"llama3": {
maxContext: 8192,
maxOutput: 2048,
supportsVision: false,
supportsJson: false,
supportsTools: false,
qualityTier: "medium",
costPer1kInput: 0,
costPer1kOutput: 0
}
};
function isProviderCapable(provider, requirements) {
var caps = modelCapabilities[provider.model];
if (!caps) return false;
if (requirements.vision && !caps.supportsVision) return false;
if (requirements.json && !caps.supportsJson) return false;
if (requirements.tools && !caps.supportsTools) return false;
if (requirements.minContext && caps.maxContext < requirements.minContext) return false;
if (requirements.minOutput && caps.maxOutput < requirements.minOutput) return false;
return true;
}
When your primary provider fails and your fallback does not support a required feature like vision or tool use, you need to handle that gracefully rather than sending an incompatible request and getting a confusing error back.
Latency-Based Failover
Sometimes a provider is not down but is unacceptably slow. Latency-based failover switches to an alternative when response time exceeds a threshold:
function callWithTimeout(provider, prompt, timeoutMs) {
return new Promise(function (resolve, reject) {
var timer = setTimeout(function () {
reject(new Error("Provider " + provider.name + " timed out after " + timeoutMs + "ms"));
}, timeoutMs);
callProvider(provider, prompt)
.then(function (result) {
clearTimeout(timer);
resolve(result);
})
.catch(function (error) {
clearTimeout(timer);
reject(error);
});
});
}
function callWithLatencyFailover(prompt, options) {
var timeoutMs = options.timeoutMs || 10000;
var sorted = getAvailableProviders();
// Check health data for latency-based pre-filtering
var fast = sorted.filter(function (p) {
var health = providerHealth[p.name];
if (!health || !health.healthy) return false;
if (health.latency && health.latency > timeoutMs) return false;
return true;
});
var providersToTry = fast.length > 0 ? fast : sorted;
return tryNextProviderWithTimeout(providersToTry, 0, prompt, timeoutMs, []);
}
function tryNextProviderWithTimeout(providers, index, prompt, timeoutMs, errors) {
if (index >= providers.length) {
var err = new Error("All providers failed or timed out");
err.providerErrors = errors;
return Promise.reject(err);
}
var provider = providers[index];
return callWithTimeout(provider, prompt, timeoutMs).catch(function (error) {
errors.push({ provider: provider.name, error: error.message });
return tryNextProviderWithTimeout(providers, index + 1, prompt, timeoutMs, errors);
});
}
The latency threshold should be tuned to your application's needs. A chatbot might tolerate 15 seconds. A real-time classification endpoint might need a response in under 3 seconds. Set the threshold based on your P95 latency requirements, not your average.
Graceful Degradation Strategies
When all providers are down, failover has nowhere to go. This is where graceful degradation keeps your application functional:
Cached responses. For common queries, cache previous LLM responses and serve them when all providers fail. This works well for classification, FAQ-style questions, and content that does not change often.
var NodeCache = require("node-cache");
var responseCache = new NodeCache({ stdTTL: 3600 });
function callWithCacheFallback(prompt, cacheKey) {
return callWithFailover(prompt)
.then(function (response) {
responseCache.set(cacheKey, response);
return response;
})
.catch(function (error) {
var cached = responseCache.get(cacheKey);
if (cached) {
console.warn("[Degradation] Serving cached response for: " + cacheKey);
cached._fromCache = true;
return cached;
}
throw error;
});
}
Pre-computed fallbacks. For critical paths like content moderation, maintain a rule-based fallback that runs locally. It will not be as good as the LLM, but it keeps the pipeline moving.
Simpler model fallback. If your primary model (GPT-4o) is down, try a smaller model (GPT-4o-mini) on the same provider before switching providers entirely. Smaller models are often more available during capacity issues.
Queue and retry. For non-real-time workloads, push failed requests into a queue and retry them later. This is ideal for batch processing, email generation, and background tasks.
Cost Implications of Failover
Failover is not free. Different providers have different pricing, and failing over from a cheaper provider to a more expensive one can blow your budget during extended outages:
function estimateCost(provider, inputTokens, outputTokens) {
var caps = modelCapabilities[provider.model];
if (!caps) return 0;
var inputCost = (inputTokens / 1000) * caps.costPer1kInput;
var outputCost = (outputTokens / 1000) * caps.costPer1kOutput;
return inputCost + outputCost;
}
function FailoverCostTracker() {
this.costs = {};
this.alerts = [];
this.hourlyBudget = parseFloat(process.env.LLM_HOURLY_BUDGET) || 50;
}
FailoverCostTracker.prototype.record = function (provider, usage) {
var hour = new Date().toISOString().slice(0, 13);
if (!this.costs[hour]) this.costs[hour] = 0;
var cost = estimateCost(provider, usage.inputTokens, usage.outputTokens);
this.costs[hour] += cost;
if (this.costs[hour] > this.hourlyBudget) {
this.alerts.push({
time: new Date().toISOString(),
hourlyCost: this.costs[hour],
budget: this.hourlyBudget,
message: "Hourly LLM cost exceeded budget during failover"
});
}
return cost;
};
During an outage of your primary provider, your failover might route all traffic to a provider that costs 3x more. Set up cost alerts specifically for failover scenarios so you are not surprised by a $500 bill from an afternoon outage.
Testing Failover with Chaos Engineering
You cannot trust failover that has never been tested. Chaos engineering for LLM dependencies means deliberately injecting failures:
var chaosEnabled = process.env.CHAOS_TESTING === "true";
var chaosConfig = {
failureRate: 0.3, // 30% of requests fail
latencyMs: 15000, // add 15s latency
targetProvider: "anthropic"
};
function maybeChaos(provider) {
if (!chaosEnabled) return Promise.resolve();
if (provider.name !== chaosConfig.targetProvider) return Promise.resolve();
if (Math.random() < chaosConfig.failureRate) {
return Promise.reject(new Error("[Chaos] Injected failure for " + provider.name));
}
return new Promise(function (resolve) {
setTimeout(resolve, chaosConfig.latencyMs);
});
}
// Wrap your provider calls:
function callProviderWithChaos(provider, prompt) {
return maybeChaos(provider).then(function () {
return callProvider(provider, prompt);
});
}
Run chaos tests in staging with realistic traffic patterns. Test specific scenarios: primary provider returns 429 for 10 minutes, primary provider has 50% error rate, primary and secondary providers both fail, primary provider responds but with 20-second latency. Each scenario exercises a different part of your failover logic.
Monitoring Failover Events
Every failover event should be logged, alerted on, and tracked:
function FailoverLogger() {
this.events = [];
}
FailoverLogger.prototype.logFailover = function (fromProvider, toProvider, reason, latencyMs) {
var event = {
timestamp: new Date().toISOString(),
from: fromProvider,
to: toProvider,
reason: reason,
latencyMs: latencyMs
};
this.events.push(event);
console.warn(
"[Failover Event] " + fromProvider + " -> " + toProvider +
" | Reason: " + reason +
" | Latency: " + latencyMs + "ms"
);
// Send to your monitoring system
if (typeof sendMetric === "function") {
sendMetric("llm.failover", 1, {
from: fromProvider,
to: toProvider,
reason: reason
});
}
return event;
};
FailoverLogger.prototype.getStats = function (windowMs) {
var cutoff = Date.now() - (windowMs || 3600000);
var recent = this.events.filter(function (e) {
return new Date(e.timestamp).getTime() > cutoff;
});
var byProvider = {};
recent.forEach(function (e) {
if (!byProvider[e.from]) byProvider[e.from] = 0;
byProvider[e.from]++;
});
return {
totalFailovers: recent.length,
byProvider: byProvider,
windowMs: windowMs || 3600000
};
};
Key metrics to track: failover rate (failovers per hour), failover duration (how long before the primary recovers), success rate per provider, latency delta (how much slower is the failover provider), and cost delta (how much more expensive is the failover path). Set alerts when your failover rate exceeds a threshold -- frequent failovers indicate a systemic issue, not just a transient glitch.
Multi-Region Deployment for LLM-Dependent Services
If your application runs in a single region and your LLM provider has a regional outage, your failover to the same provider in a different region might not help because most LLM APIs use global endpoints. Instead, multi-region deployment for LLM-dependent services means:
- Deploy your application in multiple regions.
- Use DNS-based routing to direct users to the nearest healthy region.
- Each region maintains its own failover chain with provider priority potentially varying by region (use the provider with the lowest latency from that region as primary).
- If an LLM provider has a regional routing issue, your application in another region may be unaffected.
var regionConfig = {
"us-east": {
providers: ["anthropic", "openai", "local"],
latencyThresholdMs: 8000
},
"eu-west": {
providers: ["openai", "anthropic", "local"],
latencyThresholdMs: 10000
},
"ap-southeast": {
providers: ["anthropic", "openai", "local"],
latencyThresholdMs: 12000
}
};
function getRegionalProviders(region) {
var config = regionConfig[region] || regionConfig["us-east"];
return config.providers.map(function (name, index) {
var provider = providers.find(function (p) { return p.name === name; });
return Object.assign({}, provider, { priority: index + 1 });
});
}
Complete Working Example
Here is a full LLM failover manager that ties together health checking, circuit breakers, prompt translation, provider switching, and failover event logging:
var axios = require("axios");
var EventEmitter = require("events");
// ============================================================
// LLM Failover Manager
// ============================================================
function LLMFailoverManager(options) {
EventEmitter.call(this);
this.providers = options.providers || [];
this.healthCheckIntervalMs = options.healthCheckIntervalMs || 30000;
this.defaultTimeoutMs = options.defaultTimeoutMs || 15000;
this.circuitBreakerThreshold = options.circuitBreakerThreshold || 3;
this.circuitBreakerCooldownMs = options.circuitBreakerCooldownMs || 60000;
this.health = {};
this.circuits = {};
this.failoverLog = [];
this.healthCheckTimer = null;
var self = this;
this.providers.forEach(function (p) {
self.health[p.name] = { healthy: true, latency: null, lastCheck: 0 };
self.circuits[p.name] = {
state: "closed",
failures: 0,
lastFailure: null,
successes: 0
};
});
}
LLMFailoverManager.prototype = Object.create(EventEmitter.prototype);
LLMFailoverManager.prototype.constructor = LLMFailoverManager;
// --- Circuit Breaker Logic ---
LLMFailoverManager.prototype.isCircuitOpen = function (providerName) {
var circuit = this.circuits[providerName];
if (circuit.state === "closed") return false;
if (circuit.state === "open") {
var elapsed = Date.now() - circuit.lastFailure;
if (elapsed >= this.circuitBreakerCooldownMs) {
circuit.state = "half-open";
return false;
}
return true;
}
return false; // half-open allows one through
};
LLMFailoverManager.prototype.recordSuccess = function (providerName) {
var circuit = this.circuits[providerName];
if (circuit.state === "half-open") {
circuit.successes++;
if (circuit.successes >= 2) {
circuit.state = "closed";
circuit.failures = 0;
circuit.successes = 0;
this.emit("circuit-closed", providerName);
}
} else {
circuit.failures = 0;
}
};
LLMFailoverManager.prototype.recordFailure = function (providerName) {
var circuit = this.circuits[providerName];
circuit.failures++;
circuit.lastFailure = Date.now();
circuit.successes = 0;
if (circuit.failures >= this.circuitBreakerThreshold) {
circuit.state = "open";
this.emit("circuit-open", providerName);
}
};
// --- Health Checking ---
LLMFailoverManager.prototype.startHealthChecks = function () {
var self = this;
this.healthCheckTimer = setInterval(function () {
self.runHealthChecks();
}, this.healthCheckIntervalMs);
this.runHealthChecks();
};
LLMFailoverManager.prototype.stopHealthChecks = function () {
if (this.healthCheckTimer) {
clearInterval(this.healthCheckTimer);
this.healthCheckTimer = null;
}
};
LLMFailoverManager.prototype.runHealthChecks = function () {
var self = this;
this.providers.forEach(function (provider) {
var startTime = Date.now();
var messages = [{ role: "user", content: "Respond with OK." }];
self.callSingleProvider(provider, messages, { maxTokens: 16 })
.then(function () {
self.health[provider.name] = {
healthy: true,
latency: Date.now() - startTime,
lastCheck: Date.now()
};
})
.catch(function (err) {
self.health[provider.name] = {
healthy: false,
latency: null,
lastCheck: Date.now(),
error: err.message
};
});
});
};
// --- Prompt Translation ---
LLMFailoverManager.prototype.translateRequest = function (provider, messages, options) {
var maxTokens = options.maxTokens || 1024;
if (provider.name === "anthropic") {
var systemMsg = "";
var userMsgs = [];
messages.forEach(function (m) {
if (m.role === "system") {
systemMsg = m.content;
} else {
userMsgs.push({ role: m.role, content: m.content });
}
});
return {
url: provider.baseUrl,
headers: {
"x-api-key": provider.apiKey,
"anthropic-version": "2023-06-01",
"content-type": "application/json"
},
body: {
model: provider.model,
max_tokens: maxTokens,
system: systemMsg || undefined,
messages: userMsgs
}
};
}
if (provider.name === "openai") {
return {
url: provider.baseUrl,
headers: {
"Authorization": "Bearer " + provider.apiKey,
"Content-Type": "application/json"
},
body: {
model: provider.model,
max_tokens: maxTokens,
messages: messages
}
};
}
if (provider.name === "local") {
var combined = messages.map(function (m) {
return m.role + ": " + m.content;
}).join("\n\n");
return {
url: provider.baseUrl,
headers: { "Content-Type": "application/json" },
body: {
model: provider.model,
prompt: combined,
stream: false
}
};
}
throw new Error("Unsupported provider: " + provider.name);
};
LLMFailoverManager.prototype.normalizeResponse = function (provider, data) {
if (provider.name === "anthropic") {
return {
content: data.content[0].text,
model: data.model,
provider: provider.name,
inputTokens: data.usage.input_tokens,
outputTokens: data.usage.output_tokens
};
}
if (provider.name === "openai") {
return {
content: data.choices[0].message.content,
model: data.model,
provider: provider.name,
inputTokens: data.usage.prompt_tokens,
outputTokens: data.usage.completion_tokens
};
}
if (provider.name === "local") {
return {
content: data.response,
model: provider.model,
provider: provider.name,
inputTokens: 0,
outputTokens: 0
};
}
};
// --- Single Provider Call ---
LLMFailoverManager.prototype.callSingleProvider = function (provider, messages, options) {
var self = this;
var translated = this.translateRequest(provider, messages, options);
return axios({
method: "POST",
url: translated.url,
headers: translated.headers,
data: translated.body,
timeout: options.timeoutMs || self.defaultTimeoutMs
}).then(function (response) {
return self.normalizeResponse(provider, response.data);
});
};
// --- Main Failover Call ---
LLMFailoverManager.prototype.call = function (messages, options) {
var self = this;
options = options || {};
var available = this.providers
.slice()
.sort(function (a, b) { return a.priority - b.priority; })
.filter(function (p) {
if (self.isCircuitOpen(p.name)) return false;
var h = self.health[p.name];
if (h && !h.healthy && (Date.now() - h.lastCheck < 60000)) return false;
return true;
});
if (available.length === 0) {
// All filtered out, try everyone as last resort
available = this.providers.slice().sort(function (a, b) {
return a.priority - b.priority;
});
}
return self._tryProvider(available, 0, messages, options, []);
};
LLMFailoverManager.prototype._tryProvider = function (list, index, messages, options, errors) {
var self = this;
if (index >= list.length) {
var err = new Error("All LLM providers exhausted");
err.providerErrors = errors;
self.emit("all-providers-failed", errors);
return Promise.reject(err);
}
var provider = list[index];
var startTime = Date.now();
return self.callSingleProvider(provider, messages, options)
.then(function (result) {
self.recordSuccess(provider.name);
result.latencyMs = Date.now() - startTime;
if (index > 0) {
var failoverEvent = {
timestamp: new Date().toISOString(),
from: list[0].name,
to: provider.name,
reason: errors.map(function (e) { return e.provider + ": " + e.error; }).join("; "),
latencyMs: result.latencyMs
};
self.failoverLog.push(failoverEvent);
self.emit("failover", failoverEvent);
}
return result;
})
.catch(function (error) {
self.recordFailure(provider.name);
var errorMsg = error.response
? "HTTP " + error.response.status + ": " + JSON.stringify(error.response.data)
: error.message;
errors.push({ provider: provider.name, error: errorMsg });
self.emit("provider-error", { provider: provider.name, error: errorMsg });
return self._tryProvider(list, index + 1, messages, options, errors);
});
};
// --- Stats ---
LLMFailoverManager.prototype.getStats = function () {
var self = this;
return {
health: Object.assign({}, this.health),
circuits: Object.assign({}, this.circuits),
recentFailovers: this.failoverLog.slice(-50),
providerStatus: this.providers.map(function (p) {
return {
name: p.name,
healthy: self.health[p.name] ? self.health[p.name].healthy : "unknown",
circuitState: self.circuits[p.name].state,
latency: self.health[p.name] ? self.health[p.name].latency : null
};
})
};
};
module.exports = LLMFailoverManager;
Usage in an Express application:
var express = require("express");
var LLMFailoverManager = require("./llm-failover-manager");
var app = express();
app.use(express.json());
var llm = new LLMFailoverManager({
providers: [
{
name: "anthropic",
baseUrl: "https://api.anthropic.com/v1/messages",
apiKey: process.env.ANTHROPIC_API_KEY,
model: "claude-sonnet-4-20250514",
priority: 1
},
{
name: "openai",
baseUrl: "https://api.openai.com/v1/chat/completions",
apiKey: process.env.OPENAI_API_KEY,
model: "gpt-4o",
priority: 2
},
{
name: "local",
baseUrl: "http://localhost:11434/api/generate",
apiKey: null,
model: "llama3",
priority: 3
}
],
healthCheckIntervalMs: 30000,
defaultTimeoutMs: 15000,
circuitBreakerThreshold: 3,
circuitBreakerCooldownMs: 60000
});
llm.on("failover", function (event) {
console.warn("[LLM Failover]", JSON.stringify(event));
});
llm.on("circuit-open", function (provider) {
console.error("[LLM Circuit Open] " + provider);
});
llm.on("all-providers-failed", function (errors) {
console.error("[LLM] All providers failed:", JSON.stringify(errors));
});
llm.startHealthChecks();
app.post("/api/chat", function (req, res) {
var messages = req.body.messages || [];
llm.call(messages, { maxTokens: 1024, timeoutMs: 12000 })
.then(function (result) {
res.json({
response: result.content,
provider: result.provider,
model: result.model,
latencyMs: result.latencyMs
});
})
.catch(function (error) {
res.status(503).json({
error: "All LLM providers are currently unavailable",
details: error.providerErrors || error.message
});
});
});
app.get("/api/llm-status", function (req, res) {
res.json(llm.getStats());
});
app.listen(process.env.PORT || 3000, function () {
console.log("Server running with LLM failover enabled");
});
Common Issues and Troubleshooting
1. Circuit breaker opens too aggressively during transient errors
[LLM Circuit Open] anthropic
Error: All LLM providers exhausted
If a brief network hiccup triggers 3 failures in quick succession, the circuit opens and stays open for the full cooldown period. Fix this by increasing the failure threshold to 5 and adding a time window -- only count failures within the last 60 seconds rather than cumulatively.
2. Prompt translation produces different output quality across providers
Expected JSON output, got: "Sure! Here is the JSON you requested:\n```json\n{\"key\": \"value\"}\n```"
Different models interpret the same prompt differently. OpenAI models tend to wrap JSON in markdown code blocks while Anthropic models return raw JSON more reliably. Add provider-specific prompt adjustments in your translation layer, such as appending "Return only raw JSON with no markdown formatting" for OpenAI.
3. Health checks pass but real requests fail with rate limits
HTTP 429: {"error":{"type":"rate_limit_error","message":"Number of request tokens has exceeded your per-minute rate limit"}}
Your health check uses a tiny prompt that consumes very few tokens, so it passes under the rate limit. But real requests with large prompts blow through the token-per-minute quota. Track your token usage per provider and proactively route away from providers approaching their rate limits before you start getting 429s.
4. Failover to local model times out on long prompts
Error: Provider local timed out after 15000ms
Local models running on CPU are dramatically slower than cloud APIs. A prompt that takes 3 seconds on Claude might take 90 seconds on a local Llama model. Set provider-specific timeouts rather than a global default. For local models, either increase the timeout significantly or truncate the prompt to fit within the local model's practical response window.
5. Memory leak from unbounded failover log
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
If you push every failover event into an array and never trim it, your process will eventually run out of memory during extended outages with high traffic. Cap the log at a fixed size (e.g., 1000 events) and rotate old entries out, or flush events to an external logging system and keep only a short window in memory.
Best Practices
Always have at least three levels of fallback. Primary cloud provider, secondary cloud provider, and a local or cached fallback. Two is not enough because simultaneous outages across providers happen more often than you would expect.
Test failover regularly, not just when it breaks. Schedule monthly chaos tests that disable your primary provider in staging and verify the entire failover chain works. Failover code that has not run in six months is failover code that does not work.
Set provider-specific timeouts, not global ones. A 10-second timeout makes sense for a cloud API but is unrealistic for a local model. Configure timeouts per provider based on observed P95 latency for that provider.
Log every failover event with full context. Include which provider failed, why it failed, which provider handled the request, how long it took, and the request's cost. This data is invaluable for post-incident analysis and for tuning your failover thresholds.
Monitor cost during failover events. Set up budget alerts that fire when your hourly or daily LLM spend exceeds a threshold. Failover to a more expensive provider during a multi-hour outage can generate surprising bills.
Cache aggressively for deterministic queries. If you are classifying the same types of content or answering the same FAQ-style questions repeatedly, cache the LLM responses. During a total outage, cached responses keep your application functional.
Use capability checks before routing. Do not send a vision request to a text-only model or a 100k-token prompt to a model with an 8k context window. Check model capabilities before attempting a provider call to avoid wasting time on requests that will fail.
Implement backpressure when all providers are degraded. If every provider is slow or rate-limited, shedding load is better than queuing unbounded requests. Return 503 with a Retry-After header so clients can back off gracefully instead of hammering your failover chain.
Keep your prompt translation layer well-tested. Unit test the translation for every provider with representative prompts. When a provider updates their API format, you want to catch the breakage in CI, not in production at 2 AM during a failover event.
References
- Anthropic API Documentation
- OpenAI API Reference
- Ollama API Documentation
- Circuit Breaker Pattern - Martin Fowler
- Node.js axios HTTP Client
- Release It! by Michael Nygard -- the definitive guide to stability patterns including circuit breakers and bulkheads