Llm Apis

Multi-Model Architectures: Router Patterns

Architecture patterns for multi-model LLM systems including routers, cascading, ensembles, and failover chains in Node.js.

Multi-Model Architectures: Router Patterns

Calling a single LLM endpoint for every request is the fast path to a bloated API bill and mediocre results. Production systems that handle diverse workloads need a router layer that selects the right model for each task based on complexity, capability, cost, and latency constraints. This article covers the architecture patterns, trade-offs, and working Node.js implementations for building multi-model systems that are both cost-efficient and high-quality.

Prerequisites

  • Node.js v18+ installed
  • Working familiarity with the OpenAI and Anthropic REST APIs
  • Basic understanding of prompt engineering and token economics
  • An API key for at least two LLM providers (examples use Anthropic and OpenAI)
  • Understanding of Express.js middleware patterns

Why Use Multiple Models

The single-model approach fails in production for three reasons that compound over time.

Cost. Sending a trivial "extract the email address from this text" request to Claude Opus or GPT-4o costs 10-50x more than routing it to Haiku or GPT-4o-mini. At scale, this is the difference between a $500/month API bill and a $15,000 one. Most production workloads follow a power law: 70-80% of requests are simple enough for a small model, 15-20% need a mid-tier model, and only 5-10% truly require the flagship.

Capability. No single model dominates every task. Code generation models outperform general-purpose models on programming tasks. Some models handle structured output more reliably. Others excel at creative writing or multi-turn reasoning. A router lets you match the task to the model's strengths.

Latency. A chat interface needs sub-second time-to-first-token. A batch summarization pipeline can tolerate 30 seconds per request. Routing real-time requests to smaller, faster models while sending batch work to larger, slower models keeps your user experience responsive without sacrificing quality where it matters.

The Model Router Pattern

The core architecture has three stages: classify the incoming request, select a model based on that classification and current system state, then execute against the chosen provider.

Request → Classifier → Model Selector → Provider Interface → Response
                           ↑                    ↑
                      Policy Engine         Health Monitor
                    (cost, latency,        (circuit breakers,
                     capability rules)      rate limits)

The classifier examines the request and produces metadata: estimated complexity, task type, required capabilities, and urgency. The model selector applies routing policies against that metadata and the current state of each provider. The provider interface abstracts away the differences between API formats.

This separation of concerns is critical. The classifier does not need to know which models exist. The selector does not need to understand how to call each provider. Each layer changes independently.

Complexity-Based Routing

The most impactful routing strategy is complexity classification. You are sorting requests into tiers and sending each tier to a model of appropriate power.

var COMPLEXITY_TIERS = {
  simple: {
    models: ['claude-haiku', 'gpt-4o-mini'],
    maxTokens: 500,
    description: 'Extraction, classification, simple Q&A'
  },
  medium: {
    models: ['claude-sonnet', 'gpt-4o'],
    maxTokens: 2000,
    description: 'Summarization, analysis, multi-step reasoning'
  },
  complex: {
    models: ['claude-opus', 'gpt-4o'],
    maxTokens: 4000,
    description: 'Code generation, creative writing, complex reasoning'
  }
};

function classifyComplexity(request) {
  var prompt = request.prompt || '';
  var wordCount = prompt.split(/\s+/).length;
  var hasCode = /```|function\s|class\s|def\s|import\s/.test(prompt);
  var hasMultiStep = /step\s*\d|first.*then|compare.*contrast/i.test(prompt);
  var hasAnalysis = /analyze|evaluate|critique|review|assess/i.test(prompt);
  var hasCreative = /write.*story|compose|draft.*essay|create.*content/i.test(prompt);

  var score = 0;

  // Token length signals complexity
  if (wordCount > 500) score += 2;
  else if (wordCount > 150) score += 1;

  // Task type signals
  if (hasCode) score += 2;
  if (hasMultiStep) score += 2;
  if (hasAnalysis) score += 1;
  if (hasCreative) score += 2;

  // System message complexity
  if (request.systemPrompt && request.systemPrompt.length > 500) score += 1;

  // Conversation depth
  if (request.messages && request.messages.length > 6) score += 1;

  if (score >= 4) return 'complex';
  if (score >= 2) return 'medium';
  return 'simple';
}

This heuristic classifier is fast and deterministic. It does not require an LLM call. For more nuanced classification, you can use the cheapest available model as the classifier itself, but that adds latency and cost to every request. Start with heuristics, measure accuracy, and only add an LLM classifier when the heuristics provably fail on your actual traffic.

Capability-Based Routing

Some tasks have hard requirements that only certain models meet. Code generation benefits from models trained on code. Structured JSON output works better with models that support native JSON mode. Vision tasks require multimodal models.

var CAPABILITY_MAP = {
  'code-generation': {
    required: ['code'],
    preferred: ['claude-sonnet', 'gpt-4o'],
    fallback: ['claude-haiku']
  },
  'json-extraction': {
    required: ['structured-output'],
    preferred: ['gpt-4o-mini', 'claude-haiku'],
    fallback: ['gpt-4o']
  },
  'image-analysis': {
    required: ['vision'],
    preferred: ['claude-sonnet', 'gpt-4o'],
    fallback: []
  },
  'creative-writing': {
    required: [],
    preferred: ['claude-opus', 'gpt-4o'],
    fallback: ['claude-sonnet']
  },
  'translation': {
    required: [],
    preferred: ['gpt-4o', 'claude-sonnet'],
    fallback: ['gpt-4o-mini']
  }
};

var MODEL_CAPABILITIES = {
  'claude-opus': ['code', 'structured-output', 'vision', 'long-context'],
  'claude-sonnet': ['code', 'structured-output', 'vision', 'long-context'],
  'claude-haiku': ['code', 'structured-output', 'vision'],
  'gpt-4o': ['code', 'structured-output', 'vision', 'function-calling'],
  'gpt-4o-mini': ['code', 'structured-output', 'function-calling']
};

function selectByCapability(taskType, availableModels) {
  var mapping = CAPABILITY_MAP[taskType];
  if (!mapping) return availableModels[0];

  // Filter to models that have all required capabilities
  var eligible = availableModels.filter(function(model) {
    var caps = MODEL_CAPABILITIES[model] || [];
    return mapping.required.every(function(req) {
      return caps.indexOf(req) !== -1;
    });
  });

  if (eligible.length === 0) {
    throw new Error('No available model supports required capabilities: ' + mapping.required.join(', '));
  }

  // Prefer models in the preferred list, in order
  for (var i = 0; i < mapping.preferred.length; i++) {
    if (eligible.indexOf(mapping.preferred[i]) !== -1) {
      return mapping.preferred[i];
    }
  }

  return eligible[0];
}

The key insight is separating "required" capabilities from "preferred" models. A required capability is a hard constraint: if no model supports it, the request fails. Preferred models are soft rankings within the eligible set.

Cost-Aware Routing with Budget Constraints

Real systems operate under budget constraints. You need to track spend per model and shift traffic to cheaper models when approaching limits.

var MODEL_PRICING = {
  'claude-opus': { inputPer1k: 0.015, outputPer1k: 0.075 },
  'claude-sonnet': { inputPer1k: 0.003, outputPer1k: 0.015 },
  'claude-haiku': { inputPer1k: 0.00025, outputPer1k: 0.00125 },
  'gpt-4o': { inputPer1k: 0.0025, outputPer1k: 0.01 },
  'gpt-4o-mini': { inputPer1k: 0.00015, outputPer1k: 0.0006 }
};

function CostTracker(dailyBudget) {
  this.dailyBudget = dailyBudget;
  this.spending = {};
  this.totalToday = 0;
  this.resetDate = new Date().toISOString().split('T')[0];
}

CostTracker.prototype.checkBudget = function() {
  var today = new Date().toISOString().split('T')[0];
  if (today !== this.resetDate) {
    this.spending = {};
    this.totalToday = 0;
    this.resetDate = today;
  }
  return this.totalToday;
};

CostTracker.prototype.estimateCost = function(model, inputTokens, estimatedOutputTokens) {
  var pricing = MODEL_PRICING[model];
  if (!pricing) return 0;
  return (inputTokens / 1000) * pricing.inputPer1k +
         (estimatedOutputTokens / 1000) * pricing.outputPer1k;
};

CostTracker.prototype.recordUsage = function(model, inputTokens, outputTokens) {
  var cost = this.estimateCost(model, inputTokens, outputTokens);
  if (!this.spending[model]) this.spending[model] = 0;
  this.spending[model] += cost;
  this.totalToday += cost;
  return cost;
};

CostTracker.prototype.getBudgetRatio = function() {
  this.checkBudget();
  return this.totalToday / this.dailyBudget;
};

CostTracker.prototype.shouldDowngrade = function() {
  var ratio = this.getBudgetRatio();
  if (ratio > 0.9) return 'aggressive';  // Only use cheapest models
  if (ratio > 0.7) return 'moderate';     // Avoid flagship models
  return 'none';
};

When the budget ratio crosses thresholds, the router automatically shifts traffic to cheaper models. This prevents surprise bills while gracefully degrading rather than hard-failing.

Latency-Based Routing

Different use cases have different latency budgets. A chat interface needs time-to-first-token under 500ms. A background summarization pipeline can wait minutes.

var MODEL_LATENCY_PROFILES = {
  'claude-haiku': { avgTTFT: 200, avgTokensPerSec: 120 },
  'claude-sonnet': { avgTTFT: 400, avgTokensPerSec: 80 },
  'claude-opus': { avgTTFT: 800, avgTokensPerSec: 40 },
  'gpt-4o-mini': { avgTTFT: 180, avgTokensPerSec: 130 },
  'gpt-4o': { avgTTFT: 350, avgTokensPerSec: 70 }
};

function selectByLatency(maxTTFTMs, estimatedOutputTokens, maxTotalMs, candidates) {
  return candidates.filter(function(model) {
    var profile = MODEL_LATENCY_PROFILES[model];
    if (!profile) return false;

    if (profile.avgTTFT > maxTTFTMs) return false;

    var estimatedTotalMs = profile.avgTTFT +
      (estimatedOutputTokens / profile.avgTokensPerSec) * 1000;

    return estimatedTotalMs <= maxTotalMs;
  }).sort(function(a, b) {
    return MODEL_LATENCY_PROFILES[a].avgTTFT - MODEL_LATENCY_PROFILES[b].avgTTFT;
  });
}

The latency profiles should be updated from real measurements in production, not static values. Instrument your provider calls and feed the rolling averages back into the routing table.

Cascading Patterns

A cascading pattern tries the cheapest model first and escalates to more expensive models only when quality is insufficient. This is the highest-impact pattern for cost optimization.

function CascadingExecutor(providers) {
  this.providers = providers;
}

CascadingExecutor.prototype.execute = function(request, qualityCheck, callback) {
  var cascade = [
    { model: 'claude-haiku', provider: 'anthropic' },
    { model: 'claude-sonnet', provider: 'anthropic' },
    { model: 'claude-opus', provider: 'anthropic' }
  ];

  var attempt = 0;
  var self = this;
  var attempts = [];

  function tryNext() {
    if (attempt >= cascade.length) {
      return callback(new Error('All cascade levels exhausted'), null, attempts);
    }

    var level = cascade[attempt];
    var provider = self.providers[level.provider];
    var startTime = Date.now();

    provider.complete(level.model, request, function(err, response) {
      var elapsed = Date.now() - startTime;

      attempts.push({
        model: level.model,
        elapsed: elapsed,
        success: !err,
        escalated: false
      });

      if (err) {
        attempt++;
        return tryNext();
      }

      // Run quality check on the response
      var quality = qualityCheck(request, response);

      if (quality.acceptable) {
        return callback(null, response, attempts);
      }

      // Mark this attempt as escalated due to quality
      attempts[attempts.length - 1].escalated = true;
      attempts[attempts.length - 1].qualityScore = quality.score;
      attempts[attempts.length - 1].reason = quality.reason;

      attempt++;
      tryNext();
    });
  }

  tryNext();
};

// Quality checker example: confidence-based
function confidenceQualityCheck(request, response) {
  var text = response.text || '';

  // Check for hedging language that indicates low confidence
  var hedgePatterns = [
    /i('m| am) not (sure|certain)/i,
    /i don't (know|have enough)/i,
    /it's (difficult|hard) to say/i,
    /this (might|may|could) (be|not be) (correct|accurate)/i
  ];

  var hedgeCount = hedgePatterns.reduce(function(count, pattern) {
    return count + (pattern.test(text) ? 1 : 0);
  }, 0);

  // Check for refusals or empty responses
  var isRefusal = /i (can't|cannot) (help|assist|provide)/i.test(text);
  var isTooShort = text.length < 50 && request.expectedMinLength > 100;

  var score = 1.0;
  score -= hedgeCount * 0.15;
  if (isRefusal) score -= 0.5;
  if (isTooShort) score -= 0.3;

  return {
    acceptable: score >= 0.6,
    score: score,
    reason: score < 0.6 ? 'Low confidence or insufficient response' : 'OK'
  };
}

The quality check function is the critical design decision. It must be fast (you are adding latency with each cascade level), deterministic (no LLM calls for checking), and aligned with your actual quality requirements. Hedging detection, response length, and structured output validation are good starting points.

Ensemble Patterns

When correctness is critical and cost is secondary, an ensemble pattern sends the same request to multiple models and aggregates the results.

function EnsembleExecutor(providers) {
  this.providers = providers;
}

EnsembleExecutor.prototype.execute = function(request, models, strategy, callback) {
  var self = this;
  var results = [];
  var completed = 0;
  var errors = [];

  models.forEach(function(modelConfig) {
    var provider = self.providers[modelConfig.provider];

    provider.complete(modelConfig.model, request, function(err, response) {
      completed++;

      if (err) {
        errors.push({ model: modelConfig.model, error: err });
      } else {
        results.push({
          model: modelConfig.model,
          response: response,
          weight: modelConfig.weight || 1
        });
      }

      if (completed === models.length) {
        if (results.length === 0) {
          return callback(new Error('All ensemble models failed'), null);
        }
        var aggregated = strategy(results);
        callback(null, aggregated);
      }
    });
  });
};

// Majority vote strategy for classification tasks
function majorityVote(results) {
  var votes = {};

  results.forEach(function(result) {
    var answer = result.response.text.trim().toLowerCase();
    if (!votes[answer]) votes[answer] = 0;
    votes[answer] += result.weight;
  });

  var winner = Object.keys(votes).sort(function(a, b) {
    return votes[b] - votes[a];
  })[0];

  return {
    text: winner,
    confidence: votes[winner] / results.length,
    allVotes: votes
  };
}

// Best-of-N strategy: return the longest/most detailed response
function bestOfN(results) {
  results.sort(function(a, b) {
    return b.response.text.length - a.response.text.length;
  });
  return results[0].response;
}

Ensembles are expensive by definition. Use them for high-stakes decisions where a wrong answer costs more than 3x the API bill: content moderation, medical information, financial calculations, or safety-critical classifications.

Provider Failover

Provider outages happen. Rate limits get hit. A production system needs automatic failover across providers.

function ProviderRegistry() {
  this.providers = {};
  this.health = {};
  this.circuitBreakers = {};
}

ProviderRegistry.prototype.register = function(name, provider, config) {
  this.providers[name] = provider;
  this.health[name] = {
    healthy: true,
    lastCheck: Date.now(),
    consecutiveFailures: 0,
    totalRequests: 0,
    totalFailures: 0
  };
  this.circuitBreakers[name] = {
    state: 'closed',  // closed = healthy, open = failing, half-open = testing
    openedAt: null,
    cooldownMs: (config && config.cooldownMs) || 30000,
    failureThreshold: (config && config.failureThreshold) || 5
  };
};

ProviderRegistry.prototype.isAvailable = function(name) {
  var cb = this.circuitBreakers[name];
  if (cb.state === 'closed') return true;
  if (cb.state === 'open') {
    // Check if cooldown has passed
    if (Date.now() - cb.openedAt > cb.cooldownMs) {
      cb.state = 'half-open';
      return true;
    }
    return false;
  }
  // half-open: allow one request through
  return true;
};

ProviderRegistry.prototype.recordSuccess = function(name) {
  var h = this.health[name];
  h.totalRequests++;
  h.consecutiveFailures = 0;
  h.healthy = true;

  var cb = this.circuitBreakers[name];
  if (cb.state === 'half-open') {
    cb.state = 'closed';
  }
};

ProviderRegistry.prototype.recordFailure = function(name) {
  var h = this.health[name];
  h.totalRequests++;
  h.totalFailures++;
  h.consecutiveFailures++;

  var cb = this.circuitBreakers[name];
  if (h.consecutiveFailures >= cb.failureThreshold) {
    cb.state = 'open';
    cb.openedAt = Date.now();
    h.healthy = false;
  }
};

ProviderRegistry.prototype.getAvailableProviders = function() {
  var self = this;
  return Object.keys(this.providers).filter(function(name) {
    return self.isAvailable(name);
  });
};

The circuit breaker pattern prevents cascading failures. When a provider hits its failure threshold, the circuit opens and all requests bypass that provider for a cooldown period. After cooldown, a single test request probes recovery. This avoids hammering a struggling provider with requests that will all fail.

A/B Testing Across Models

Routing gives you a natural A/B testing framework. Split traffic between models and measure quality, cost, and latency in production.

function ABTestRouter(tests) {
  this.tests = tests;      // { testName: { models: [...], weights: [...], metrics: {} } }
  this.assignments = {};    // userId -> testName -> model
}

ABTestRouter.prototype.assign = function(testName, userId) {
  var key = userId + ':' + testName;
  if (this.assignments[key]) return this.assignments[key];

  var test = this.tests[testName];
  if (!test) return null;

  // Deterministic assignment based on hash
  var hash = simpleHash(key);
  var normalized = (hash % 1000) / 1000;

  var cumulative = 0;
  for (var i = 0; i < test.models.length; i++) {
    cumulative += test.weights[i];
    if (normalized < cumulative) {
      this.assignments[key] = test.models[i];
      return test.models[i];
    }
  }

  this.assignments[key] = test.models[test.models.length - 1];
  return this.assignments[key];
};

ABTestRouter.prototype.recordOutcome = function(testName, model, outcome) {
  var test = this.tests[testName];
  if (!test.metrics[model]) {
    test.metrics[model] = { count: 0, successes: 0, totalLatency: 0, totalCost: 0 };
  }
  var m = test.metrics[model];
  m.count++;
  if (outcome.success) m.successes++;
  m.totalLatency += outcome.latencyMs;
  m.totalCost += outcome.cost;
};

ABTestRouter.prototype.getResults = function(testName) {
  var test = this.tests[testName];
  var results = {};

  Object.keys(test.metrics).forEach(function(model) {
    var m = test.metrics[model];
    results[model] = {
      sampleSize: m.count,
      successRate: m.count > 0 ? m.successes / m.count : 0,
      avgLatencyMs: m.count > 0 ? m.totalLatency / m.count : 0,
      avgCost: m.count > 0 ? m.totalCost / m.count : 0,
      significanceReached: m.count >= 100  // Minimum sample size
    };
  });

  return results;
};

function simpleHash(str) {
  var hash = 0;
  for (var i = 0; i < str.length; i++) {
    var char = str.charCodeAt(i);
    hash = ((hash << 5) - hash) + char;
    hash = hash & hash;  // Convert to 32-bit integer
  }
  return Math.abs(hash);
}

The critical detail is deterministic assignment. A user must always get the same model for the same test, otherwise you are measuring noise. Hash the user ID with the test name to get a stable bucket assignment. Wait for statistical significance (at minimum 100 samples per variant) before drawing conclusions.

Implementing a Pluggable Provider Interface

All of the patterns above depend on a clean abstraction over multiple providers. Here is the interface.

var https = require('https');

function AnthropicProvider(apiKey) {
  this.apiKey = apiKey;
  this.name = 'anthropic';
}

AnthropicProvider.prototype.complete = function(model, request, callback) {
  var body = JSON.stringify({
    model: model,
    max_tokens: request.maxTokens || 1024,
    system: request.systemPrompt || '',
    messages: [{ role: 'user', content: request.prompt }]
  });

  var options = {
    hostname: 'api.anthropic.com',
    path: '/v1/messages',
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'x-api-key': this.apiKey,
      'anthropic-version': '2023-06-01',
      'Content-Length': Buffer.byteLength(body)
    }
  };

  var req = https.request(options, function(res) {
    var data = '';
    res.on('data', function(chunk) { data += chunk; });
    res.on('end', function() {
      try {
        var parsed = JSON.parse(data);
        if (res.statusCode !== 200) {
          return callback(new Error('Anthropic API error: ' + res.statusCode + ' - ' + (parsed.error && parsed.error.message || data)));
        }
        callback(null, {
          text: parsed.content[0].text,
          model: parsed.model,
          inputTokens: parsed.usage.input_tokens,
          outputTokens: parsed.usage.output_tokens,
          provider: 'anthropic'
        });
      } catch (e) {
        callback(e);
      }
    });
  });

  req.on('error', callback);
  req.write(body);
  req.end();
};

function OpenAIProvider(apiKey) {
  this.apiKey = apiKey;
  this.name = 'openai';
}

OpenAIProvider.prototype.complete = function(model, request, callback) {
  var messages = [];
  if (request.systemPrompt) {
    messages.push({ role: 'system', content: request.systemPrompt });
  }
  messages.push({ role: 'user', content: request.prompt });

  var body = JSON.stringify({
    model: model,
    max_tokens: request.maxTokens || 1024,
    messages: messages
  });

  var options = {
    hostname: 'api.openai.com',
    path: '/v1/chat/completions',
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': 'Bearer ' + this.apiKey,
      'Content-Length': Buffer.byteLength(body)
    }
  };

  var req = https.request(options, function(res) {
    var data = '';
    res.on('data', function(chunk) { data += chunk; });
    res.on('end', function() {
      try {
        var parsed = JSON.parse(data);
        if (res.statusCode !== 200) {
          return callback(new Error('OpenAI API error: ' + res.statusCode + ' - ' + (parsed.error && parsed.error.message || data)));
        }
        callback(null, {
          text: parsed.choices[0].message.content,
          model: parsed.model,
          inputTokens: parsed.usage.prompt_tokens,
          outputTokens: parsed.usage.completion_tokens,
          provider: 'openai'
        });
      } catch (e) {
        callback(e);
      }
    });
  });

  req.on('error', callback);
  req.write(body);
  req.end();
};

Both providers implement the same .complete(model, request, callback) interface. The response is normalized to a common shape. This is what makes every pattern above work -- the router does not care whether it is talking to Anthropic, OpenAI, or a local Ollama instance.

Complete Working Example

Here is a full multi-model router that combines complexity classification, cascading execution, provider failover, cost tracking, and a metrics endpoint for performance comparison.

var express = require('express');
var app = express();

app.use(express.json());

// --- Provider Setup ---
var anthropic = new AnthropicProvider(process.env.ANTHROPIC_API_KEY);
var openai = new OpenAIProvider(process.env.OPENAI_API_KEY);

var registry = new ProviderRegistry();
registry.register('anthropic', anthropic, { cooldownMs: 30000, failureThreshold: 3 });
registry.register('openai', openai, { cooldownMs: 30000, failureThreshold: 3 });

// --- Cost Tracking ---
var costTracker = new CostTracker(50.00);  // $50/day budget

// --- Metrics Store ---
var metrics = {
  requests: [],
  byModel: {}
};

function recordMetric(entry) {
  metrics.requests.push(entry);
  if (metrics.requests.length > 10000) {
    metrics.requests = metrics.requests.slice(-5000);
  }
  if (!metrics.byModel[entry.model]) {
    metrics.byModel[entry.model] = {
      count: 0, totalLatency: 0, totalCost: 0, failures: 0
    };
  }
  var m = metrics.byModel[entry.model];
  m.count++;
  m.totalLatency += entry.latencyMs;
  m.totalCost += entry.cost;
  if (!entry.success) m.failures++;
}

// --- Model Mapping ---
var MODEL_TO_PROVIDER = {
  'claude-opus': { provider: 'anthropic', apiModel: 'claude-opus-4' },
  'claude-sonnet': { provider: 'anthropic', apiModel: 'claude-sonnet-4' },
  'claude-haiku': { provider: 'anthropic', apiModel: 'claude-haiku-3.5' },
  'gpt-4o': { provider: 'openai', apiModel: 'gpt-4o' },
  'gpt-4o-mini': { provider: 'openai', apiModel: 'gpt-4o-mini' }
};

// --- Failover Chain ---
var FAILOVER_CHAINS = {
  'claude-opus': ['claude-opus', 'gpt-4o', 'claude-sonnet'],
  'claude-sonnet': ['claude-sonnet', 'gpt-4o', 'claude-haiku'],
  'claude-haiku': ['claude-haiku', 'gpt-4o-mini'],
  'gpt-4o': ['gpt-4o', 'claude-sonnet', 'gpt-4o-mini'],
  'gpt-4o-mini': ['gpt-4o-mini', 'claude-haiku']
};

function executeWithFailover(preferredModel, request, callback) {
  var chain = FAILOVER_CHAINS[preferredModel] || [preferredModel];
  var attempt = 0;

  function tryNext() {
    if (attempt >= chain.length) {
      return callback(new Error('All providers in failover chain exhausted'));
    }

    var modelKey = chain[attempt];
    var mapping = MODEL_TO_PROVIDER[modelKey];
    if (!mapping || !registry.isAvailable(mapping.provider)) {
      attempt++;
      return tryNext();
    }

    var provider = registry.providers[mapping.provider];
    var startTime = Date.now();

    provider.complete(mapping.apiModel, request, function(err, response) {
      var latencyMs = Date.now() - startTime;

      if (err) {
        registry.recordFailure(mapping.provider);
        var cost = 0;
        recordMetric({
          model: modelKey, provider: mapping.provider,
          latencyMs: latencyMs, cost: cost, success: false,
          timestamp: Date.now()
        });
        attempt++;
        return tryNext();
      }

      registry.recordSuccess(mapping.provider);
      var cost = costTracker.recordUsage(
        modelKey, response.inputTokens, response.outputTokens
      );
      recordMetric({
        model: modelKey, provider: mapping.provider,
        latencyMs: latencyMs, cost: cost, success: true,
        timestamp: Date.now()
      });

      response.routingInfo = {
        selectedModel: modelKey,
        failoverAttempt: attempt,
        latencyMs: latencyMs,
        cost: cost
      };

      callback(null, response);
    });
  }

  tryNext();
}

// --- Main Router ---
function routeRequest(request, callback) {
  // Step 1: Classify complexity
  var complexity = classifyComplexity(request);

  // Step 2: Check budget constraints
  var downgrade = costTracker.shouldDowngrade();
  if (downgrade === 'aggressive') {
    complexity = 'simple';
  } else if (downgrade === 'moderate' && complexity === 'complex') {
    complexity = 'medium';
  }

  // Step 3: Select model tier
  var tier = COMPLEXITY_TIERS[complexity];
  var preferredModel = tier.models[0];

  // Step 4: Override with capability requirements if specified
  if (request.taskType) {
    try {
      var available = tier.models.filter(function(m) {
        var mapping = MODEL_TO_PROVIDER[m];
        return mapping && registry.isAvailable(mapping.provider);
      });
      if (available.length > 0) {
        preferredModel = selectByCapability(request.taskType, available);
      }
    } catch (e) {
      // Fall back to complexity-based selection
    }
  }

  // Step 5: Apply latency constraint if specified
  if (request.maxLatencyMs) {
    var fastEnough = selectByLatency(
      request.maxTTFTMs || 500,
      request.maxTokens || 1024,
      request.maxLatencyMs,
      tier.models
    );
    if (fastEnough.length > 0) {
      preferredModel = fastEnough[0];
    }
  }

  // Step 6: Execute with failover
  request.maxTokens = tier.maxTokens;
  executeWithFailover(preferredModel, request, callback);
}

// --- API Routes ---
app.post('/v1/route', function(req, res) {
  var request = {
    prompt: req.body.prompt,
    systemPrompt: req.body.system_prompt,
    messages: req.body.messages,
    maxTokens: req.body.max_tokens,
    taskType: req.body.task_type,
    maxLatencyMs: req.body.max_latency_ms,
    maxTTFTMs: req.body.max_ttft_ms,
    expectedMinLength: req.body.expected_min_length
  };

  routeRequest(request, function(err, response) {
    if (err) {
      return res.status(502).json({ error: err.message });
    }
    res.json({
      text: response.text,
      model: response.model,
      provider: response.provider,
      routing: response.routingInfo,
      usage: {
        input_tokens: response.inputTokens,
        output_tokens: response.outputTokens
      }
    });
  });
});

// --- Metrics Dashboard ---
app.get('/v1/metrics', function(req, res) {
  var dashboard = {
    budget: {
      daily: costTracker.dailyBudget,
      spent: costTracker.totalToday,
      remaining: costTracker.dailyBudget - costTracker.totalToday,
      ratio: costTracker.getBudgetRatio()
    },
    providers: {},
    models: {}
  };

  Object.keys(registry.health).forEach(function(name) {
    var h = registry.health[name];
    var cb = registry.circuitBreakers[name];
    dashboard.providers[name] = {
      healthy: h.healthy,
      circuitState: cb.state,
      totalRequests: h.totalRequests,
      failureRate: h.totalRequests > 0
        ? (h.totalFailures / h.totalRequests * 100).toFixed(2) + '%'
        : '0%'
    };
  });

  Object.keys(metrics.byModel).forEach(function(model) {
    var m = metrics.byModel[model];
    dashboard.models[model] = {
      requests: m.count,
      avgLatencyMs: m.count > 0 ? Math.round(m.totalLatency / m.count) : 0,
      avgCost: m.count > 0 ? (m.totalCost / m.count).toFixed(6) : '0',
      totalCost: m.totalCost.toFixed(4),
      failureRate: m.count > 0
        ? (m.failures / m.count * 100).toFixed(2) + '%'
        : '0%'
    };
  });

  res.json(dashboard);
});

// --- Health Check ---
app.get('/v1/health', function(req, res) {
  var available = registry.getAvailableProviders();
  res.status(available.length > 0 ? 200 : 503).json({
    status: available.length > 0 ? 'healthy' : 'degraded',
    availableProviders: available,
    timestamp: new Date().toISOString()
  });
});

var PORT = process.env.PORT || 3000;
app.listen(PORT, function() {
  console.log('Multi-model router listening on port ' + PORT);
});

This router accepts requests at /v1/route with an optional task_type, max_latency_ms, and other hints. It classifies complexity, checks budget constraints, selects the best model, and executes with automatic failover. The /v1/metrics endpoint exposes a real-time dashboard of cost, latency, and reliability per model.

Monitoring and Comparing Model Performance

The metrics endpoint above gives you the raw data. In production, you want to answer three questions continuously:

  1. Is the router saving money? Compare actual spend against what it would cost to send everything to the flagship model. Track the "savings ratio" daily.

  2. Is quality holding? Log a sample of routed responses and review them weekly. Track user feedback signals (thumbs up/down, regeneration rate) segmented by which model served the response.

  3. Are failovers happening? A spike in failover attempts means a provider is degrading. Alert on circuit breaker state changes and failover rate exceeding 5%.

// Add to the metrics endpoint
function calculateSavingsReport() {
  var actualCost = costTracker.totalToday;
  var flagshipCost = 0;

  metrics.requests.forEach(function(entry) {
    if (entry.success) {
      // Estimate what it would have cost with claude-opus for everything
      flagshipCost += costTracker.estimateCost('claude-opus', 500, 1000);
    }
  });

  return {
    actualCost: actualCost.toFixed(4),
    flagshipEquivalent: flagshipCost.toFixed(4),
    savingsPercent: flagshipCost > 0
      ? ((1 - actualCost / flagshipCost) * 100).toFixed(1) + '%'
      : '0%'
  };
}

Common Issues and Troubleshooting

1. Circuit breaker opens prematurely during provider rate limiting.

Error: Anthropic API error: 429 - Rate limit exceeded. Please retry after 30 seconds.

Rate limit errors (HTTP 429) should not count toward the circuit breaker failure threshold. They are not provider failures -- they are traffic management signals. Filter them out in recordFailure:

ProviderRegistry.prototype.handleError = function(name, statusCode, error) {
  if (statusCode === 429) {
    // Do not trip circuit breaker, just add backoff delay
    this.health[name].rateLimited = true;
    this.health[name].retryAfter = Date.now() + 30000;
    return;
  }
  this.recordFailure(name);
};

2. Complexity classifier sends everything to the cheapest model.

[WARN] 94% of requests classified as 'simple' in last hour

This happens when your request payloads are short but the tasks are actually complex. The heuristic classifier is biased toward token count. Add domain-specific signals. If your application primarily handles code reviews, even a 50-word request ("review this pull request for security issues") is complex. Override the classifier with task-type hints from your application layer.

3. Cascading pattern doubles or triples latency on complex requests.

[METRICS] Cascade depth: 3, total latency: 8420ms (haiku: 1200ms, sonnet: 2800ms, opus: 4420ms)

If most requests cascade to the final level, the cascade is costing you more than direct routing -- you pay for every failed attempt. Analyze cascade depth distribution. If more than 30% of requests reach level 3, your quality check is too strict or your tier boundaries are wrong. Loosen the quality threshold or route those task types directly to the appropriate tier.

4. Budget enforcement causes hard failures at end of billing period.

Error: Daily budget exhausted. $50.00/$50.00 spent. All requests rejected.

Never hard-fail on budget exhaustion. Instead, aggressively downgrade to the cheapest available model. A response from GPT-4o-mini is infinitely better than a 503 error. Treat the budget as a soft constraint that triggers progressive degradation, not a hard stop.

5. Ensemble voting produces ties with an even number of models.

[WARN] Ensemble vote tie: {"positive": 1, "negative": 1}

Always use an odd number of models in an ensemble. If you must use an even number, assign different weights to break ties or designate one model as the tiebreaker. Alternatively, use the response from the highest-capability model when votes are split.

Best Practices

  • Start with complexity-based routing alone. It delivers the biggest cost savings with the least implementation effort. Add capability routing, cascading, and ensembles incrementally as you identify specific failure modes.

  • Instrument everything from day one. Log every routing decision, model selection, latency, cost, and quality signal. You cannot improve what you do not measure. The cost of logging is negligible compared to the cost of blind model selection.

  • Keep the quality check function fast and deterministic. An LLM-based quality check in a cascading pattern defeats the purpose. Use heuristics: response length, JSON validity, presence of required fields, absence of hedging language. Reserve LLM-based evaluation for offline batch analysis.

  • Treat provider health as a first-class signal. Circuit breakers, rate limit tracking, and latency percentiles should feed back into routing decisions in real time. A provider that responds in 200ms during normal operation but is currently averaging 2000ms should be deprioritized even if it is not "down."

  • Version your routing policies separately from your application code. Store model mappings, tier definitions, failover chains, and cost thresholds in configuration that can be updated without a deployment. This lets you respond to model deprecations, pricing changes, and new model launches without code changes.

  • Set budget alerts at 50%, 75%, and 90% of your daily limit. Do not wait for aggressive downgrade to kick in. Early alerts let you investigate whether a traffic spike, a misconfigured prompt, or a runaway retry loop is burning through your budget.

  • Test failover paths regularly. Disable a provider in your staging environment weekly and verify that traffic shifts cleanly. The worst time to discover that your failover chain has a bug is during a production outage.

  • Use deterministic routing for A/B tests. Hash-based assignment ensures users see consistent behavior. Random assignment introduces noise and delays statistical significance. Always wait for sufficient sample size before concluding that one model outperforms another.

References

Powered by Contentful