Production

Model Versioning and Migration Strategies

Manage LLM model versions with canary deployments, shadow testing, quality comparison, and rollback strategies in Node.js.

Model Versioning and Migration Strategies

Overview

Model versioning is the practice of tracking, testing, and controlling which LLM model version each feature in your application uses, and managing transitions between versions without breaking production. If you have shipped anything with GPT-3.5, you have already lived through the pain of a model being deprecated, replaced, or silently degraded. This article covers the engineering systems you need to treat model migrations as a first-class operational concern: configuration registries, canary routing, shadow testing, automated quality comparison, and rollback.

Prerequisites

  • Node.js 18+ with a running Express application
  • Experience calling at least one LLM provider API (OpenAI, Anthropic, etc.)
  • Familiarity with environment variables and configuration management
  • Basic understanding of A/B testing or feature flag concepts
  • A Redis instance (for canary state tracking; in-memory fallback shown)

Why Model Versions Matter

Every few months, providers ship new model versions. Sometimes they announce it. Sometimes they do not. Either way, three things change that directly affect your application.

API deprecations. OpenAI deprecated gpt-3.5-turbo-0301 with 90 days notice. Anthropic moved from claude-2 to claude-3 with a completely different API shape. If you hardcoded model strings, you get to scramble.

Quality changes. A newer model version is not always better for your specific use case. I have seen gpt-4-turbo produce worse structured JSON output than gpt-4-0613 for certain extraction tasks. Newer models optimize for general benchmarks, not your particular prompt.

Cost changes. GPT-4o is cheaper than GPT-4 Turbo, but the token counts differ. Claude 3.5 Sonnet is cheaper than Claude 3 Opus, but produces different output lengths. A model "upgrade" can double your bill if you are not watching token usage.

The only safe assumption is that every model version change is a breaking change until proven otherwise.

Tracking Which Model Version Each Feature Uses

The first mistake teams make is scattering model identifiers across the codebase. You end up with gpt-4-turbo hardcoded in your summarizer, gpt-4o in your classifier, and claude-3-sonnet in your embedding pipeline. When deprecation hits, you are grepping the entire repo.

Instead, centralize everything into a model registry.

// config/models.js
var models = {
  summarizer: {
    provider: "openai",
    model: "gpt-4o-2024-08-06",
    fallback: "gpt-4o-2024-05-13",
    maxTokens: 1024,
    temperature: 0.3,
    promptVersion: "v2.1",
    deprecationDate: null
  },
  classifier: {
    provider: "anthropic",
    model: "claude-3-5-sonnet-20241022",
    fallback: "claude-3-5-sonnet-20240620",
    maxTokens: 256,
    temperature: 0.0,
    promptVersion: "v1.4",
    deprecationDate: null
  },
  embeddings: {
    provider: "openai",
    model: "text-embedding-3-large",
    fallback: "text-embedding-ada-002",
    dimensions: 1536,
    deprecationDate: null
  },
  spamDetector: {
    provider: "openai",
    model: "gpt-4o-mini-2024-07-18",
    fallback: "gpt-3.5-turbo-0125",
    maxTokens: 64,
    temperature: 0.0,
    promptVersion: "v3.0",
    deprecationDate: null
  }
};

module.exports = models;

Every LLM call in your application references this registry instead of a string literal. When a model version changes, you update one file. When you need to audit what your system is running, you check one file.

Implementing a Model Configuration Registry

A static config file works for small projects. For production systems with multiple environments and runtime overrides, you need a proper registry.

// lib/modelRegistry.js
var EventEmitter = require("events");
var fs = require("fs");
var path = require("path");

function ModelRegistry(options) {
  EventEmitter.call(this);
  var self = this;

  self.configPath = options.configPath || path.join(__dirname, "../config/models.json");
  self.models = {};
  self.overrides = {};
  self.history = [];
  self.maxHistory = options.maxHistory || 100;

  self._load();
}

ModelRegistry.prototype = Object.create(EventEmitter.prototype);
ModelRegistry.prototype.constructor = ModelRegistry;

ModelRegistry.prototype._load = function () {
  var self = this;
  try {
    var raw = fs.readFileSync(self.configPath, "utf8");
    self.models = JSON.parse(raw);
  } catch (err) {
    console.error("[ModelRegistry] Failed to load config:", err.message);
    self.models = {};
  }
};

ModelRegistry.prototype.get = function (featureKey) {
  var self = this;
  var base = self.models[featureKey];
  if (!base) {
    throw new Error("Unknown model feature key: " + featureKey);
  }
  var override = self.overrides[featureKey];
  if (override) {
    return Object.assign({}, base, override);
  }
  return Object.assign({}, base);
};

ModelRegistry.prototype.setOverride = function (featureKey, overrideConfig) {
  var self = this;
  var previous = self.get(featureKey);

  self.overrides[featureKey] = overrideConfig;

  var entry = {
    timestamp: new Date().toISOString(),
    featureKey: featureKey,
    previous: previous,
    override: overrideConfig,
    action: "override_set"
  };

  self.history.push(entry);
  if (self.history.length > self.maxHistory) {
    self.history.shift();
  }

  self.emit("model_changed", entry);
  return entry;
};

ModelRegistry.prototype.clearOverride = function (featureKey) {
  var self = this;
  delete self.overrides[featureKey];

  var entry = {
    timestamp: new Date().toISOString(),
    featureKey: featureKey,
    action: "override_cleared"
  };

  self.history.push(entry);
  self.emit("model_changed", entry);
  return entry;
};

ModelRegistry.prototype.getHistory = function (featureKey) {
  var self = this;
  if (featureKey) {
    return self.history.filter(function (e) {
      return e.featureKey === featureKey;
    });
  }
  return self.history.slice();
};

ModelRegistry.prototype.listFeatures = function () {
  return Object.keys(this.models);
};

module.exports = ModelRegistry;

This gives you runtime overrides without redeploying, an audit trail of every model change, and event hooks for monitoring.

Canary Deployments for Model Updates

When you upgrade a model version, do not flip the switch for 100% of traffic at once. Route a small percentage to the new model and compare results.

// lib/canaryRouter.js
var crypto = require("crypto");

function CanaryRouter(options) {
  var self = this;
  self.canaries = {};
  self.redis = options.redis || null;
  self.metricsCollector = options.metricsCollector || null;
}

CanaryRouter.prototype.startCanary = function (featureKey, newModelConfig, percentage) {
  var self = this;
  if (percentage < 0 || percentage > 100) {
    throw new Error("Canary percentage must be between 0 and 100");
  }

  self.canaries[featureKey] = {
    newModel: newModelConfig,
    percentage: percentage,
    startedAt: new Date().toISOString(),
    requestsControl: 0,
    requestsCanary: 0,
    errorsControl: 0,
    errorsCanary: 0,
    latencyControl: [],
    latencyCanary: []
  };

  console.log(
    "[Canary] Started for %s: %d%% traffic to %s",
    featureKey,
    percentage,
    newModelConfig.model
  );

  return self.canaries[featureKey];
};

CanaryRouter.prototype.route = function (featureKey, requestId) {
  var self = this;
  var canary = self.canaries[featureKey];
  if (!canary) {
    return { useCanary: false };
  }

  // Deterministic routing based on request ID so the same user
  // gets the same model within a session
  var hash = crypto.createHash("md5").update(requestId).digest("hex");
  var bucket = parseInt(hash.substring(0, 8), 16) % 100;
  var useCanary = bucket < canary.percentage;

  return {
    useCanary: useCanary,
    bucket: bucket,
    model: useCanary ? canary.newModel : null
  };
};

CanaryRouter.prototype.recordResult = function (featureKey, isCanary, result) {
  var self = this;
  var canary = self.canaries[featureKey];
  if (!canary) return;

  if (isCanary) {
    canary.requestsCanary++;
    if (result.error) canary.errorsCanary++;
    if (result.latency) canary.latencyCanary.push(result.latency);
  } else {
    canary.requestsControl++;
    if (result.error) canary.errorsControl++;
    if (result.latency) canary.latencyControl.push(result.latency);
  }
};

CanaryRouter.prototype.getStats = function (featureKey) {
  var self = this;
  var canary = self.canaries[featureKey];
  if (!canary) return null;

  function median(arr) {
    if (arr.length === 0) return 0;
    var sorted = arr.slice().sort(function (a, b) { return a - b; });
    var mid = Math.floor(sorted.length / 2);
    return sorted.length % 2 ? sorted[mid] : (sorted[mid - 1] + sorted[mid]) / 2;
  }

  return {
    featureKey: featureKey,
    percentage: canary.percentage,
    startedAt: canary.startedAt,
    control: {
      requests: canary.requestsControl,
      errors: canary.errorsControl,
      errorRate: canary.requestsControl > 0
        ? (canary.errorsControl / canary.requestsControl * 100).toFixed(2) + "%"
        : "N/A",
      medianLatency: median(canary.latencyControl)
    },
    canary: {
      requests: canary.requestsCanary,
      errors: canary.errorsCanary,
      errorRate: canary.requestsCanary > 0
        ? (canary.errorsCanary / canary.requestsCanary * 100).toFixed(2) + "%"
        : "N/A",
      medianLatency: median(canary.latencyCanary)
    }
  };
};

CanaryRouter.prototype.promote = function (featureKey) {
  var self = this;
  var canary = self.canaries[featureKey];
  if (!canary) throw new Error("No active canary for " + featureKey);

  var result = {
    promoted: canary.newModel,
    stats: self.getStats(featureKey)
  };

  delete self.canaries[featureKey];
  return result;
};

CanaryRouter.prototype.abort = function (featureKey) {
  var self = this;
  var canary = self.canaries[featureKey];
  if (!canary) throw new Error("No active canary for " + featureKey);

  var result = {
    aborted: canary.newModel,
    stats: self.getStats(featureKey),
    reason: "manual_abort"
  };

  delete self.canaries[featureKey];
  return result;
};

module.exports = CanaryRouter;

Start at 5%. Monitor for 24 hours. If error rates and latency look good, bump to 25%, then 50%, then 100%. Each step should have at least a few hundred requests before you proceed.

Shadow Testing: Run New Models in Parallel

Shadow testing is different from canary testing. With shadow testing, you always serve the result from the current model but also run the new model in the background. You compare the outputs without any risk to users.

// lib/shadowTester.js
var crypto = require("crypto");

function ShadowTester(options) {
  var self = this;
  self.tests = {};
  self.results = [];
  self.maxResults = options.maxResults || 1000;
  self.onComparison = options.onComparison || null;
}

ShadowTester.prototype.startShadow = function (featureKey, shadowModelConfig, sampleRate) {
  var self = this;
  self.tests[featureKey] = {
    shadowModel: shadowModelConfig,
    sampleRate: sampleRate || 0.1, // 10% of requests by default
    startedAt: new Date().toISOString(),
    comparisons: 0
  };
};

ShadowTester.prototype.shouldSample = function (featureKey) {
  var self = this;
  var test = self.tests[featureKey];
  if (!test) return false;
  return Math.random() < test.sampleRate;
};

ShadowTester.prototype.runShadow = function (featureKey, callFn, primaryResult) {
  var self = this;
  var test = self.tests[featureKey];
  if (!test) return;

  var startTime = Date.now();

  // Run the shadow call asynchronously - do not block the response
  callFn(test.shadowModel)
    .then(function (shadowResult) {
      var comparison = {
        id: crypto.randomUUID(),
        featureKey: featureKey,
        timestamp: new Date().toISOString(),
        primaryModel: primaryResult.model,
        shadowModel: test.shadowModel.model,
        primaryOutput: primaryResult.output,
        shadowOutput: shadowResult.output,
        primaryLatency: primaryResult.latency,
        shadowLatency: Date.now() - startTime,
        primaryTokens: primaryResult.usage,
        shadowTokens: shadowResult.usage
      };

      self.results.push(comparison);
      if (self.results.length > self.maxResults) {
        self.results.shift();
      }

      test.comparisons++;

      if (self.onComparison) {
        self.onComparison(comparison);
      }
    })
    .catch(function (err) {
      console.error("[ShadowTest] Shadow call failed for %s: %s", featureKey, err.message);
    });
};

ShadowTester.prototype.getResults = function (featureKey, limit) {
  var self = this;
  var filtered = self.results.filter(function (r) {
    return r.featureKey === featureKey;
  });
  if (limit) {
    return filtered.slice(-limit);
  }
  return filtered;
};

module.exports = ShadowTester;

Shadow testing is the safest way to validate a model migration. The new model's output never reaches users, but you accumulate a dataset of paired comparisons. Run this for a week before you even start a canary.

Automated Quality Comparison Between Model Versions

Having paired outputs from shadow testing is useless without a way to score them. You need automated evaluation.

// lib/qualityComparator.js
function QualityComparator(options) {
  var self = this;
  self.evaluators = options.evaluators || {};
  self.thresholds = options.thresholds || {};
}

QualityComparator.prototype.addEvaluator = function (name, evaluatorFn, threshold) {
  this.evaluators[name] = evaluatorFn;
  this.thresholds[name] = threshold;
};

QualityComparator.prototype.compare = function (primaryOutput, shadowOutput, context) {
  var self = this;
  var results = {};
  var passed = true;

  var evaluatorNames = Object.keys(self.evaluators);
  var promises = evaluatorNames.map(function (name) {
    return Promise.resolve(self.evaluators[name](primaryOutput, shadowOutput, context))
      .then(function (score) {
        results[name] = {
          score: score,
          threshold: self.thresholds[name],
          passed: score >= self.thresholds[name]
        };
        if (!results[name].passed) {
          passed = false;
        }
      });
  });

  return Promise.all(promises).then(function () {
    return {
      passed: passed,
      results: results,
      timestamp: new Date().toISOString()
    };
  });
};

// Built-in evaluators
QualityComparator.lengthSimilarity = function (primary, shadow) {
  var maxLen = Math.max(primary.length, shadow.length);
  if (maxLen === 0) return 1.0;
  return 1.0 - Math.abs(primary.length - shadow.length) / maxLen;
};

QualityComparator.jsonValidity = function (primary, shadow) {
  try {
    JSON.parse(shadow);
    return 1.0;
  } catch (e) {
    return 0.0;
  }
};

QualityComparator.keyOverlap = function (primary, shadow) {
  try {
    var primaryKeys = Object.keys(JSON.parse(primary));
    var shadowKeys = Object.keys(JSON.parse(shadow));
    var intersection = primaryKeys.filter(function (k) {
      return shadowKeys.indexOf(k) !== -1;
    });
    var union = primaryKeys.concat(shadowKeys.filter(function (k) {
      return primaryKeys.indexOf(k) === -1;
    }));
    return union.length > 0 ? intersection.length / union.length : 1.0;
  } catch (e) {
    return 0.0;
  }
};

module.exports = QualityComparator;

Common evaluators I use in production: JSON validity (does the new model still return valid JSON?), key overlap (does the structured output have the same fields?), length similarity (is the output roughly the same size?), and for classification tasks, exact match rate. For open-ended generation, you can use an LLM-as-judge approach where a third model scores both outputs.

Rollback Strategies When a New Model Underperforms

Rollback needs to be instant. You should be able to revert a model change in under 60 seconds. Here is a rollback manager that handles the mechanics.

// lib/rollbackManager.js
function RollbackManager(options) {
  var self = this;
  self.registry = options.registry;
  self.canaryRouter = options.canaryRouter;
  self.snapshots = {};
}

RollbackManager.prototype.snapshot = function (featureKey) {
  var self = this;
  var current = self.registry.get(featureKey);
  self.snapshots[featureKey] = self.snapshots[featureKey] || [];
  self.snapshots[featureKey].push({
    config: JSON.parse(JSON.stringify(current)),
    timestamp: new Date().toISOString()
  });
  return current;
};

RollbackManager.prototype.rollback = function (featureKey, reason) {
  var self = this;
  var snapshotList = self.snapshots[featureKey];
  if (!snapshotList || snapshotList.length === 0) {
    throw new Error("No snapshots available for " + featureKey);
  }

  var previous = snapshotList.pop();

  // Abort any active canary
  try {
    self.canaryRouter.abort(featureKey);
  } catch (e) {
    // No active canary, that is fine
  }

  // Restore the previous config via registry override
  self.registry.setOverride(featureKey, previous.config);

  console.log(
    "[Rollback] %s rolled back to %s (reason: %s)",
    featureKey,
    previous.config.model,
    reason
  );

  return {
    featureKey: featureKey,
    restoredTo: previous.config,
    reason: reason,
    timestamp: new Date().toISOString()
  };
};

RollbackManager.prototype.autoRollbackCheck = function (featureKey, stats, thresholds) {
  var self = this;
  var shouldRollback = false;
  var reasons = [];

  if (stats.canary.errorRate !== "N/A") {
    var canaryErrorRate = parseFloat(stats.canary.errorRate);
    var controlErrorRate = parseFloat(stats.control.errorRate) || 0;
    if (canaryErrorRate > thresholds.maxErrorRate) {
      shouldRollback = true;
      reasons.push("Error rate " + canaryErrorRate + "% exceeds threshold " + thresholds.maxErrorRate + "%");
    }
    if (canaryErrorRate > controlErrorRate * thresholds.errorRateMultiplier) {
      shouldRollback = true;
      reasons.push("Error rate " + thresholds.errorRateMultiplier + "x higher than control");
    }
  }

  if (stats.canary.medianLatency > thresholds.maxLatency) {
    shouldRollback = true;
    reasons.push("Median latency " + stats.canary.medianLatency + "ms exceeds " + thresholds.maxLatency + "ms");
  }

  if (shouldRollback) {
    return self.rollback(featureKey, reasons.join("; "));
  }

  return null;
};

module.exports = RollbackManager;

The autoRollbackCheck method is designed to be called from a monitoring loop. If the canary's error rate spikes or latency goes through the roof, the system rolls back automatically without human intervention.

Handling Breaking API Changes from Providers

Provider API changes are the most disruptive type of model migration. Prompt format changes, response structure changes, new required parameters — these break your code, not just your output quality.

// lib/providerAdapter.js
var adapters = {
  "openai-chat": {
    formatRequest: function (config, messages, options) {
      return {
        model: config.model,
        messages: messages,
        max_tokens: config.maxTokens || 1024,
        temperature: config.temperature || 0.7
      };
    },
    parseResponse: function (response) {
      return {
        output: response.choices[0].message.content,
        usage: {
          input: response.usage.prompt_tokens,
          output: response.usage.completion_tokens,
          total: response.usage.total_tokens
        },
        finishReason: response.choices[0].finish_reason
      };
    }
  },
  "anthropic-messages": {
    formatRequest: function (config, messages, options) {
      // Anthropic uses a different message format
      var systemMsg = "";
      var chatMessages = [];

      messages.forEach(function (msg) {
        if (msg.role === "system") {
          systemMsg = msg.content;
        } else {
          chatMessages.push({
            role: msg.role,
            content: msg.content
          });
        }
      });

      var request = {
        model: config.model,
        max_tokens: config.maxTokens || 1024,
        messages: chatMessages
      };

      if (systemMsg) {
        request.system = systemMsg;
      }

      return request;
    },
    parseResponse: function (response) {
      return {
        output: response.content[0].text,
        usage: {
          input: response.usage.input_tokens,
          output: response.usage.output_tokens,
          total: response.usage.input_tokens + response.usage.output_tokens
        },
        finishReason: response.stop_reason
      };
    }
  }
};

function ProviderAdapter() {}

ProviderAdapter.prototype.getAdapter = function (provider) {
  var adapter = adapters[provider];
  if (!adapter) {
    throw new Error("Unknown provider: " + provider + ". Available: " + Object.keys(adapters).join(", "));
  }
  return adapter;
};

ProviderAdapter.prototype.registerAdapter = function (name, adapter) {
  if (!adapter.formatRequest || !adapter.parseResponse) {
    throw new Error("Adapter must implement formatRequest and parseResponse");
  }
  adapters[name] = adapter;
};

module.exports = new ProviderAdapter();

By abstracting the provider-specific format behind adapters, a breaking API change from OpenAI or Anthropic means updating one adapter, not every call site in your application. When Anthropic moved from the completions API to the messages API, teams with this pattern handled it in an afternoon. Teams without it spent a week.

Version Pinning vs. Latest Model Strategies

There are two schools of thought, and I firmly recommend version pinning.

Version pinning means you specify gpt-4o-2024-08-06, not gpt-4o. You control exactly when you upgrade, and you can test before switching.

Latest model means you specify gpt-4o and let the provider auto-upgrade you. This sounds convenient until the provider updates the model and your carefully tuned prompts break at 3 AM on a Saturday.

// Example: version pinning config with deprecation tracking
var pinnedModels = {
  summarizer: {
    model: "gpt-4o-2024-08-06",
    pinnedAt: "2024-09-15",
    testedAt: "2024-09-14",
    deprecationDate: "2025-06-01",
    notes: "Tested with 500 sample inputs, 98.2% quality score"
  },
  classifier: {
    model: "claude-3-5-sonnet-20241022",
    pinnedAt: "2024-11-01",
    testedAt: "2024-10-30",
    deprecationDate: null,
    notes: "Classification accuracy 96.7% on test set"
  }
};

// Deprecation warning check - run on startup
function checkDeprecations(models) {
  var now = new Date();
  var warnings = [];

  Object.keys(models).forEach(function (key) {
    var m = models[key];
    if (m.deprecationDate) {
      var deprecation = new Date(m.deprecationDate);
      var daysUntil = Math.ceil((deprecation - now) / (1000 * 60 * 60 * 24));

      if (daysUntil <= 0) {
        warnings.push("[CRITICAL] " + key + " model " + m.model + " is PAST deprecation date!");
      } else if (daysUntil <= 30) {
        warnings.push("[WARNING] " + key + " model " + m.model + " deprecated in " + daysUntil + " days");
      } else if (daysUntil <= 90) {
        warnings.push("[INFO] " + key + " model " + m.model + " deprecated in " + daysUntil + " days");
      }
    }
  });

  return warnings;
}

Run checkDeprecations on every application startup. Pipe the warnings into your alerting system. You do not want to discover a deprecation in a 500 error.

Migration Checklists for Major Model Updates

A major model update is a project, not a ticket. Here is the checklist I use.

Pre-migration (1-2 weeks before):

  1. Identify all features using the deprecated model (query your registry)
  2. Read the provider's migration guide for API changes
  3. Update provider adapters if the API format changed
  4. Run your existing prompt test suite against the new model
  5. Identify prompt regressions and fix them
  6. Run shadow testing for at least 72 hours
  7. Analyze cost impact with real token counts

Migration (day of):

  1. Take a snapshot of current model configs
  2. Start canary at 5% for each affected feature
  3. Monitor error rates, latency, and quality metrics for 4 hours
  4. Escalate to 25% if metrics hold
  5. Escalate to 50% after another 4 hours
  6. Promote to 100% after 24 hours of stable metrics

Post-migration (1 week after):

  1. Review full quality comparison reports
  2. Compare cost actuals to estimates
  3. Remove old model references and adapters
  4. Update documentation and runbooks
  5. Archive shadow test results for future reference

Prompt Regression Testing Before Migration

Your prompts are code. They need tests.

// tests/promptRegression.js
var assert = require("assert");

function PromptRegressionSuite(options) {
  var self = this;
  self.llmClient = options.llmClient;
  self.testCases = [];
  self.results = [];
}

PromptRegressionSuite.prototype.addTestCase = function (testCase) {
  this.testCases.push(testCase);
};

PromptRegressionSuite.prototype.run = function (modelConfig) {
  var self = this;
  var results = [];

  return self.testCases.reduce(function (chain, testCase) {
    return chain.then(function () {
      return self.llmClient.call(modelConfig, testCase.input)
        .then(function (output) {
          var result = {
            name: testCase.name,
            passed: true,
            checks: []
          };

          testCase.assertions.forEach(function (assertion) {
            var checkResult = { name: assertion.name, passed: false };
            try {
              assertion.check(output);
              checkResult.passed = true;
            } catch (e) {
              checkResult.passed = false;
              checkResult.error = e.message;
              result.passed = false;
            }
            result.checks.push(checkResult);
          });

          results.push(result);
        })
        .catch(function (err) {
          results.push({
            name: testCase.name,
            passed: false,
            error: err.message
          });
        });
    });
  }, Promise.resolve()).then(function () {
    return results;
  });
};

// Example test cases for a summarizer
var suite = new PromptRegressionSuite({ llmClient: null /* inject real client */ });

suite.addTestCase({
  name: "summarizer_returns_json",
  input: {
    messages: [
      { role: "system", content: "Summarize the following text. Return JSON with keys: summary, keyPoints, sentiment." },
      { role: "user", content: "The quarterly earnings report showed a 15% increase in revenue..." }
    ]
  },
  assertions: [
    {
      name: "output_is_valid_json",
      check: function (output) {
        JSON.parse(output); // Throws if invalid
      }
    },
    {
      name: "has_required_keys",
      check: function (output) {
        var parsed = JSON.parse(output);
        assert.ok(parsed.summary, "Missing summary key");
        assert.ok(parsed.keyPoints, "Missing keyPoints key");
        assert.ok(parsed.sentiment, "Missing sentiment key");
      }
    },
    {
      name: "summary_under_200_words",
      check: function (output) {
        var parsed = JSON.parse(output);
        var wordCount = parsed.summary.split(/\s+/).length;
        assert.ok(wordCount <= 200, "Summary is " + wordCount + " words, expected <= 200");
      }
    }
  ]
});

Run this suite against both the current and candidate model. Any test case that passes on the current model but fails on the candidate is a regression that must be fixed before migration.

Cost Impact Analysis of Model Version Changes

Model changes affect your bill. Sometimes dramatically. Track it.

// lib/costAnalyzer.js
var PRICING = {
  "gpt-4o-2024-08-06": { input: 2.50, output: 10.00 },      // per 1M tokens
  "gpt-4o-mini-2024-07-18": { input: 0.15, output: 0.60 },
  "gpt-4-turbo-2024-04-09": { input: 10.00, output: 30.00 },
  "claude-3-5-sonnet-20241022": { input: 3.00, output: 15.00 },
  "claude-3-opus-20240229": { input: 15.00, output: 75.00 }
};

function CostAnalyzer() {
  this.usage = {};
}

CostAnalyzer.prototype.recordUsage = function (featureKey, model, inputTokens, outputTokens) {
  var self = this;
  var key = featureKey + ":" + model;
  if (!self.usage[key]) {
    self.usage[key] = { featureKey: featureKey, model: model, inputTokens: 0, outputTokens: 0, calls: 0 };
  }
  self.usage[key].inputTokens += inputTokens;
  self.usage[key].outputTokens += outputTokens;
  self.usage[key].calls++;
};

CostAnalyzer.prototype.estimateMigrationCost = function (featureKey, currentModel, newModel, sampleSize) {
  var self = this;
  var currentKey = featureKey + ":" + currentModel;
  var currentUsage = self.usage[currentKey];

  if (!currentUsage || currentUsage.calls === 0) {
    return { error: "No usage data for " + currentKey };
  }

  var avgInputTokens = currentUsage.inputTokens / currentUsage.calls;
  var avgOutputTokens = currentUsage.outputTokens / currentUsage.calls;

  var currentPricing = PRICING[currentModel];
  var newPricing = PRICING[newModel];

  if (!currentPricing || !newPricing) {
    return { error: "Pricing data missing for one or both models" };
  }

  var projectedCalls = sampleSize || currentUsage.calls;

  var currentCost = (avgInputTokens * projectedCalls / 1000000 * currentPricing.input) +
                    (avgOutputTokens * projectedCalls / 1000000 * currentPricing.output);

  var newCost = (avgInputTokens * projectedCalls / 1000000 * newPricing.input) +
                (avgOutputTokens * projectedCalls / 1000000 * newPricing.output);

  return {
    featureKey: featureKey,
    currentModel: currentModel,
    newModel: newModel,
    projectedCalls: projectedCalls,
    avgInputTokens: Math.round(avgInputTokens),
    avgOutputTokens: Math.round(avgOutputTokens),
    currentMonthlyCost: "$" + currentCost.toFixed(2),
    newMonthlyCost: "$" + newCost.toFixed(2),
    costChange: "$" + (newCost - currentCost).toFixed(2),
    costChangePercent: ((newCost - currentCost) / currentCost * 100).toFixed(1) + "%"
  };
};

module.exports = CostAnalyzer;

Output from a real migration analysis:

{
  "featureKey": "summarizer",
  "currentModel": "gpt-4-turbo-2024-04-09",
  "newModel": "gpt-4o-2024-08-06",
  "projectedCalls": 45000,
  "avgInputTokens": 1250,
  "avgOutputTokens": 380,
  "currentMonthlyCost": "$1075.50",
  "newMonthlyCost": "$312.75",
  "costChange": "-$762.75",
  "costChangePercent": "-70.9%"
}

That is real money. Put this data in front of stakeholders before migration to justify the engineering effort.

Communication Templates for Stakeholders

Engineering work needs buy-in. Here is how I communicate model migrations to non-technical stakeholders.

Pre-migration notification:

Subject: Planned AI Model Update — [Feature Name]

We are upgrading the AI model powering [feature] from [current model] to [new model]. This change will [reduce costs by X% / improve response quality / comply with deprecation timeline].

Timeline: Shadow testing starts [date], canary rollout [date], full deployment [date]. Risk: Low. We have automated rollback that triggers in under 60 seconds if quality degrades. Cost impact: Estimated [savings/increase] of $X/month.

Post-migration report:

Subject: AI Model Update Complete — [Feature Name]

Migration from [old model] to [new model] is complete.

  • Quality score: [X%] (vs. [Y%] on previous model)
  • Error rate: [X%] (unchanged / improved from [Y%])
  • Latency: [X]ms median (vs. [Y]ms previously)
  • Cost: $[X]/month (savings of $[Y]/month)
  • Rollbacks triggered: [0]

Keep it short, lead with the business impact, and always include the numbers.

Maintaining Backward Compatibility During Transitions

During migration windows, your system must handle both old and new model versions simultaneously. This is especially important when you have prompt versions tied to model versions.

// lib/promptManager.js
var prompts = {
  summarizer: {
    "v2.1": {
      compatibleModels: ["gpt-4-turbo-2024-04-09", "gpt-4o-2024-05-13"],
      system: "You are a professional summarizer. Return JSON with keys: summary, keyPoints, sentiment.",
      template: "Summarize the following text:\n\n{{text}}"
    },
    "v2.2": {
      compatibleModels: ["gpt-4o-2024-08-06", "gpt-4o-2024-05-13"],
      system: "You are a professional summarizer. Always return valid JSON.\nRequired keys: summary (string), keyPoints (array of strings), sentiment (one of: positive, negative, neutral).",
      template: "Summarize the following text in a structured format:\n\n{{text}}"
    }
  }
};

function PromptManager() {}

PromptManager.prototype.getPrompt = function (featureKey, promptVersion, model) {
  var versions = prompts[featureKey];
  if (!versions) {
    throw new Error("No prompts found for feature: " + featureKey);
  }

  var prompt = versions[promptVersion];
  if (!prompt) {
    throw new Error("Prompt version " + promptVersion + " not found for " + featureKey);
  }

  if (prompt.compatibleModels.indexOf(model) === -1) {
    console.warn(
      "[PromptManager] Model %s not in compatible list for %s %s. Compatible: %s",
      model,
      featureKey,
      promptVersion,
      prompt.compatibleModels.join(", ")
    );
  }

  return prompt;
};

PromptManager.prototype.findCompatibleVersion = function (featureKey, model) {
  var versions = prompts[featureKey];
  if (!versions) return null;

  var versionKeys = Object.keys(versions);
  // Return the latest compatible version
  for (var i = versionKeys.length - 1; i >= 0; i--) {
    var v = versions[versionKeys[i]];
    if (v.compatibleModels.indexOf(model) !== -1) {
      return versionKeys[i];
    }
  }
  return null;
};

module.exports = new PromptManager();

The prompt manager ensures that when you switch models mid-canary, each request gets the right prompt version for the model it is routed to.

Complete Working Example

Here is the full model version manager tying everything together.

// modelVersionManager.js
var ModelRegistry = require("./lib/modelRegistry");
var CanaryRouter = require("./lib/canaryRouter");
var ShadowTester = require("./lib/shadowTester");
var QualityComparator = require("./lib/qualityComparator");
var RollbackManager = require("./lib/rollbackManager");
var CostAnalyzer = require("./lib/costAnalyzer");

function ModelVersionManager(options) {
  var self = this;

  self.registry = new ModelRegistry({
    configPath: options.configPath
  });

  self.canaryRouter = new CanaryRouter({
    redis: options.redis || null
  });

  self.shadowTester = new ShadowTester({
    maxResults: 5000,
    onComparison: function (comparison) {
      console.log("[Shadow] Comparison for %s: primary=%s shadow=%s",
        comparison.featureKey, comparison.primaryModel, comparison.shadowModel);
    }
  });

  self.qualityComparator = new QualityComparator({});

  self.rollbackManager = new RollbackManager({
    registry: self.registry,
    canaryRouter: self.canaryRouter
  });

  self.costAnalyzer = new CostAnalyzer();

  self.rollbackThresholds = options.rollbackThresholds || {
    maxErrorRate: 5.0,
    errorRateMultiplier: 3.0,
    maxLatency: 10000
  };

  // Start monitoring loop
  self._monitorInterval = setInterval(function () {
    self._checkAllCanaries();
  }, options.monitorInterval || 60000);
}

ModelVersionManager.prototype.getModelConfig = function (featureKey, requestId) {
  var self = this;
  var baseConfig = self.registry.get(featureKey);

  // Check if there is an active canary
  if (requestId) {
    var routing = self.canaryRouter.route(featureKey, requestId);
    if (routing.useCanary) {
      return {
        config: Object.assign({}, baseConfig, routing.model),
        isCanary: true
      };
    }
  }

  return {
    config: baseConfig,
    isCanary: false
  };
};

ModelVersionManager.prototype.recordResult = function (featureKey, isCanary, result) {
  var self = this;

  self.canaryRouter.recordResult(featureKey, isCanary, result);

  self.costAnalyzer.recordUsage(
    featureKey,
    result.model,
    result.usage ? result.usage.input : 0,
    result.usage ? result.usage.output : 0
  );
};

ModelVersionManager.prototype.startMigration = function (featureKey, newModelConfig) {
  var self = this;

  // Step 1: Snapshot current state
  self.rollbackManager.snapshot(featureKey);

  // Step 2: Start shadow testing at 10% sample rate
  self.shadowTester.startShadow(featureKey, newModelConfig, 0.1);

  console.log("[Migration] Started shadow testing for %s with model %s",
    featureKey, newModelConfig.model);

  return { phase: "shadow", featureKey: featureKey, model: newModelConfig.model };
};

ModelVersionManager.prototype.promoteToCanary = function (featureKey, newModelConfig, percentage) {
  var self = this;

  self.canaryRouter.startCanary(featureKey, newModelConfig, percentage || 5);

  return {
    phase: "canary",
    featureKey: featureKey,
    model: newModelConfig.model,
    percentage: percentage || 5
  };
};

ModelVersionManager.prototype.promoteToFull = function (featureKey) {
  var self = this;
  var result = self.canaryRouter.promote(featureKey);

  // Apply the new model as a registry override
  self.registry.setOverride(featureKey, result.promoted);

  return {
    phase: "promoted",
    featureKey: featureKey,
    model: result.promoted.model,
    stats: result.stats
  };
};

ModelVersionManager.prototype.abortMigration = function (featureKey, reason) {
  var self = this;
  return self.rollbackManager.rollback(featureKey, reason || "manual_abort");
};

ModelVersionManager.prototype.getStatus = function () {
  var self = this;
  var features = self.registry.listFeatures();

  return features.map(function (key) {
    var config = self.registry.get(key);
    var canaryStats = self.canaryRouter.getStats(key);
    var shadowResults = self.shadowTester.getResults(key, 10);

    return {
      featureKey: key,
      currentModel: config.model,
      hasCanary: canaryStats !== null,
      canaryStats: canaryStats,
      recentShadowTests: shadowResults.length
    };
  });
};

ModelVersionManager.prototype._checkAllCanaries = function () {
  var self = this;
  var features = self.registry.listFeatures();

  features.forEach(function (key) {
    var stats = self.canaryRouter.getStats(key);
    if (stats && stats.canary.requests >= 50) {
      var rollbackResult = self.rollbackManager.autoRollbackCheck(
        key,
        stats,
        self.rollbackThresholds
      );
      if (rollbackResult) {
        console.error("[AutoRollback] %s rolled back: %s", key, rollbackResult.reason);
      }
    }
  });
};

ModelVersionManager.prototype.shutdown = function () {
  clearInterval(this._monitorInterval);
};

module.exports = ModelVersionManager;

Usage in an Express application:

// app.js (relevant excerpt)
var express = require("express");
var ModelVersionManager = require("./modelVersionManager");

var app = express();
var mvm = new ModelVersionManager({
  configPath: __dirname + "/config/models.json",
  rollbackThresholds: {
    maxErrorRate: 5.0,
    errorRateMultiplier: 3.0,
    maxLatency: 10000
  },
  monitorInterval: 30000
});

// API endpoint to check migration status
app.get("/admin/model-status", function (req, res) {
  res.json(mvm.getStatus());
});

// API endpoint to start a migration
app.post("/admin/model-migrate", function (req, res) {
  var featureKey = req.body.featureKey;
  var newModel = req.body.newModelConfig;
  var result = mvm.startMigration(featureKey, newModel);
  res.json(result);
});

// API endpoint to promote canary
app.post("/admin/model-promote", function (req, res) {
  var featureKey = req.body.featureKey;
  var percentage = req.body.percentage;

  if (percentage === 100) {
    var result = mvm.promoteToFull(featureKey);
    res.json(result);
  } else {
    var result = mvm.promoteToCanary(featureKey, req.body.newModelConfig, percentage);
    res.json(result);
  }
});

// API endpoint to rollback
app.post("/admin/model-rollback", function (req, res) {
  var result = mvm.abortMigration(req.body.featureKey, req.body.reason);
  res.json(result);
});

// Example: using the manager in a route handler
app.post("/api/summarize", function (req, res) {
  var requestId = req.headers["x-request-id"] || require("crypto").randomUUID();
  var modelInfo = mvm.getModelConfig("summarizer", requestId);

  var startTime = Date.now();

  callLLM(modelInfo.config, req.body.text)
    .then(function (result) {
      var latency = Date.now() - startTime;

      mvm.recordResult("summarizer", modelInfo.isCanary, {
        model: modelInfo.config.model,
        latency: latency,
        usage: result.usage,
        error: null
      });

      res.json({ summary: result.output });
    })
    .catch(function (err) {
      mvm.recordResult("summarizer", modelInfo.isCanary, {
        model: modelInfo.config.model,
        latency: Date.now() - startTime,
        error: err.message
      });

      res.status(500).json({ error: "Summarization failed" });
    });
});

Common Issues and Troubleshooting

Issue 1: Canary routing is not deterministic across requests from the same user.

[Canary] User abc123 got gpt-4o on request 1, gpt-4-turbo on request 2

This happens when you use Math.random() for canary routing. Use a deterministic hash of the user ID or session ID instead. The CanaryRouter above uses MD5 hashing of the request ID for this reason. Pass a consistent identifier (user ID, session token) as the requestId parameter, not a random UUID.

Issue 2: Shadow test results show massive latency differences but production is fine.

[ShadowTest] Primary latency: 450ms, Shadow latency: 3200ms

Shadow calls run without connection pooling or warm connections because they are fire-and-forget side calls. The shadow model's first call may include cold start latency from a new HTTP connection. Use a dedicated HTTP agent with keep-alive for shadow calls, and discard the first few results as warmup.

Issue 3: Auto-rollback triggers on a transient provider outage.

[AutoRollback] summarizer rolled back: Error rate 12.0% exceeds threshold 5.0%

A brief provider outage causes errors on the canary path, triggering rollback even though the control path would have had the same errors. Add a minimum request threshold (at least 50 requests for both control and canary) before evaluating, and compare error rates between canary and control rather than checking canary error rate in isolation. The autoRollbackCheck method handles this with the errorRateMultiplier threshold.

Issue 4: JSON parsing errors after model migration because the new model wraps output in markdown code fences.

SyntaxError: Unexpected token ` in JSON at position 0
    at JSON.parse (<anonymous>)

GPT-4o and Claude 3.5 Sonnet sometimes wrap JSON responses in triple backtick code fences (\``json ... ````), even when instructed not to. Add a response sanitizer in your provider adapter:

function sanitizeJsonResponse(text) {
  var trimmed = text.trim();
  // Strip markdown code fences
  if (trimmed.startsWith("```")) {
    var lines = trimmed.split("\n");
    lines.shift(); // Remove opening fence
    if (lines[lines.length - 1].trim() === "```") {
      lines.pop(); // Remove closing fence
    }
    return lines.join("\n").trim();
  }
  return trimmed;
}

Issue 5: Cost analysis is inaccurate because token counts differ between models for the same input.

Cost projection: -30% ($450/mo savings)
Actual cost change: +5% ($75/mo increase)

Different models tokenize text differently. GPT-4o and GPT-4 Turbo use different tokenizers (o200k_base vs cl100k_base). Your input token count will differ even with identical prompts. Always base cost projections on actual shadow test token counts, not on multiplying current usage by the new model's pricing.

Best Practices

  • Pin model versions explicitly. Never use unversioned model identifiers like gpt-4o in production. Use gpt-4o-2024-08-06 so you control exactly when upgrades happen.

  • Run shadow tests for at least 72 hours before starting a canary. This gives you enough data to catch edge cases that only appear at scale, like rare input patterns that cause the new model to hallucinate or refuse.

  • Keep prompt versions paired with model versions. A prompt tuned for GPT-4 Turbo may perform terribly on GPT-4o because the newer model interprets instructions differently. Version your prompts alongside your models and test each combination.

  • Automate deprecation tracking. Check provider deprecation dates on every application startup. Set alerts for 90 days, 30 days, and 7 days before deprecation. Do not rely on email notifications from providers.

  • Log every model call with the exact model version, token counts, and latency. This data is essential for cost analysis, quality monitoring, and debugging. Store it in a queryable format (not just application logs).

  • Establish rollback thresholds before starting a migration, not after. Decide in advance what error rate, latency, or quality score triggers an automatic rollback. Document these thresholds and get agreement from the team. Deciding in the moment leads to debates while production is degraded.

  • Test with production-like traffic, not synthetic benchmarks. Your prompt regression test suite should include real inputs from production (sanitized of PII). Synthetic test cases miss the long tail of weird user inputs that cause model regressions.

  • Abstract provider APIs behind adapters from day one. Even if you only use OpenAI today, the adapter pattern costs you nothing and saves weeks when you need to add Anthropic, switch providers, or handle an API version change.

  • Communicate model changes to stakeholders proactively. A non-technical product manager should not discover a model change by noticing the output "feels different." Send pre-migration and post-migration reports with concrete metrics.

References

Powered by Contentful