Production

Blue-Green Deployments for AI Features

Implement blue-green deployments for AI features with quality-based canary analysis, gradual traffic shifting, and automated rollback in Node.js.

Blue-Green Deployments for AI Features

Overview

Blue-green deployments give you two identical production environments so you can switch traffic between them with zero downtime, but when your application depends on large language models, traditional blue-green strategies fall apart. Model changes, prompt updates, and provider configuration shifts affect output quality in ways that standard health checks cannot detect. This article covers how to build a blue-green deployment pipeline specifically designed for AI-powered services, with quality-based canary analysis, gradual traffic shifting, and automated rollback triggers that go far beyond simple HTTP status checks.

Prerequisites

  • Node.js v18+ and npm installed
  • Working knowledge of Express.js and reverse proxy concepts
  • Familiarity with Docker and nginx configuration
  • An AI-powered service already running in production (OpenAI, Anthropic, or similar)
  • Basic understanding of deployment strategies (rolling, canary, blue-green)
  • Access to a cloud platform (DigitalOcean, AWS, or similar) for multi-environment hosting

Why Blue-Green Matters More for AI Features

Traditional web applications have predictable behavior after deployment. If the health check passes and the tests are green, you ship it. AI features break that assumption completely.

When you update a prompt template, swap from GPT-4o to GPT-4o-mini to reduce costs, or upgrade your embedding model, the application still returns 200 status codes. The API still responds within acceptable latency windows. But the output quality may have degraded catastrophically. A summarization endpoint might start producing summaries that miss key points. A classification service might silently drop accuracy from 94% to 71%. A chatbot might start hallucinating facts that were previously grounded.

This is why AI features demand a deployment strategy that treats quality as a first-class deployment signal, not just availability. Blue-green gives you the infrastructure to run both the old and new versions simultaneously, compare their outputs on real traffic, and roll back instantly when quality degrades.

Blue-Green Architecture for LLM-Dependent Services

The standard blue-green architecture runs two identical environments behind a load balancer or reverse proxy. One environment (blue) serves live traffic while the other (green) sits idle or receives the next deployment. When you are ready to switch, you redirect traffic from blue to green.

For AI services, this architecture needs three additions:

  1. A quality evaluation layer that scores responses from both environments against the same inputs
  2. A traffic splitter that can send configurable percentages to each environment rather than all-or-nothing switching
  3. A decision engine that automatically triggers rollback when quality metrics cross defined thresholds
                    ┌──────────────┐
   Requests ───────►│ Traffic      │
                    │ Controller   │
                    └──────┬───────┘
                           │
                ┌──────────┴──────────┐
                │                     │
         ┌──────▼──────┐       ┌──────▼──────┐
         │  Blue Env   │       │  Green Env  │
         │  (current)  │       │  (new)      │
         └──────┬──────┘       └──────┬──────┘
                │                     │
                └──────────┬──────────┘
                           │
                    ┌──────▼──────┐
                    │  Quality    │
                    │  Analyzer   │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  Rollback   │
                    │  Decision   │
                    └─────────────┘

Implementing Traffic Switching

The core of a blue-green deployment controller is the traffic router. Unlike simple DNS-based switching, an AI-aware router needs to duplicate a sample of requests to both environments for quality comparison while routing the primary response from whichever environment is currently active.

var express = require("express");
var http = require("http");

var BLUE_TARGET = process.env.BLUE_TARGET || "http://localhost:3001";
var GREEN_TARGET = process.env.GREEN_TARGET || "http://localhost:3002";

var trafficConfig = {
  greenPercentage: 0,
  activeEnvironment: "blue",
  shadowEnabled: true
};

function routeRequest(req, res) {
  var roll = Math.random() * 100;
  var target;

  if (roll < trafficConfig.greenPercentage) {
    target = GREEN_TARGET;
  } else {
    target = BLUE_TARGET;
  }

  var options = {
    hostname: new URL(target).hostname,
    port: new URL(target).port,
    path: req.originalUrl,
    method: req.method,
    headers: req.headers
  };

  var proxyReq = http.request(options, function (proxyRes) {
    res.writeHead(proxyRes.statusCode, proxyRes.headers);
    proxyRes.pipe(res);
  });

  proxyReq.on("error", function (err) {
    console.error("Proxy error to " + target + ":", err.message);
    res.status(502).json({ error: "Environment unavailable" });
  });

  if (req.body) {
    proxyReq.write(JSON.stringify(req.body));
  }

  proxyReq.end();
}

The shadowEnabled flag is critical. When shadow mode is on, the router sends a copy of sampled requests to the inactive environment without returning those responses to the user. This lets you collect quality data from both environments on identical inputs before shifting any real traffic.

Health Checks Specific to AI Features

Standard health checks verify that your service is running and can respond to HTTP requests. For AI services, you need a second tier of health checks that validate output quality.

var axios = require("axios");

var QUALITY_TEST_CASES = [
  {
    input: "Summarize: The Federal Reserve raised interest rates by 25 basis points.",
    expectedKeywords: ["Federal Reserve", "interest rates", "25 basis points"],
    minLength: 20,
    maxLength: 500
  },
  {
    input: "Classify sentiment: I absolutely love this product, best purchase ever!",
    expectedOutput: "positive",
    matchType: "contains"
  }
];

function runQualityHealthCheck(targetUrl, callback) {
  var results = [];
  var completed = 0;

  QUALITY_TEST_CASES.forEach(function (testCase, index) {
    axios.post(targetUrl + "/api/process", { input: testCase.input })
      .then(function (response) {
        var output = response.data.output || "";
        var passed = true;
        var reasons = [];

        if (testCase.expectedKeywords) {
          testCase.expectedKeywords.forEach(function (keyword) {
            if (output.indexOf(keyword) === -1) {
              passed = false;
              reasons.push("Missing keyword: " + keyword);
            }
          });
        }

        if (testCase.minLength && output.length < testCase.minLength) {
          passed = false;
          reasons.push("Output too short: " + output.length + " chars");
        }

        if (testCase.maxLength && output.length > testCase.maxLength) {
          passed = false;
          reasons.push("Output too long: " + output.length + " chars");
        }

        if (testCase.matchType === "contains" && output.toLowerCase().indexOf(testCase.expectedOutput.toLowerCase()) === -1) {
          passed = false;
          reasons.push("Expected output containing: " + testCase.expectedOutput);
        }

        results[index] = { passed: passed, reasons: reasons };
        completed++;

        if (completed === QUALITY_TEST_CASES.length) {
          var passCount = results.filter(function (r) { return r.passed; }).length;
          callback(null, {
            score: passCount / results.length,
            results: results,
            healthy: passCount / results.length >= 0.8
          });
        }
      })
      .catch(function (err) {
        results[index] = { passed: false, reasons: [err.message] };
        completed++;

        if (completed === QUALITY_TEST_CASES.length) {
          var passCount = results.filter(function (r) { return r.passed; }).length;
          callback(null, {
            score: passCount / results.length,
            results: results,
            healthy: false
          });
        }
      });
  });
}

This goes beyond checking if the server is alive. It validates that the AI is producing outputs that meet baseline quality expectations. Run these checks on both environments continuously during a deployment transition.

Canary Analysis: Comparing Output Quality Between Environments

Canary analysis for AI features means running the same inputs through both blue and green environments and comparing the outputs. This is where you catch quality regressions before they affect all users.

function runCanaryComparison(sampleRequests, blueUrl, greenUrl, callback) {
  var comparisons = [];
  var completed = 0;

  sampleRequests.forEach(function (request, index) {
    var blueResult, greenResult;

    axios.post(blueUrl + "/api/process", request)
      .then(function (res) {
        blueResult = res.data;
        return axios.post(greenUrl + "/api/process", request);
      })
      .then(function (res) {
        greenResult = res.data;

        var comparison = {
          input: request.input,
          blueOutput: blueResult.output,
          greenOutput: greenResult.output,
          blueCost: blueResult.tokenUsage || 0,
          greenCost: greenResult.tokenUsage || 0,
          blueLatency: blueResult.latencyMs || 0,
          greenLatency: greenResult.latencyMs || 0,
          qualityDelta: scoreOutput(greenResult.output) - scoreOutput(blueResult.output)
        };

        comparisons[index] = comparison;
        completed++;

        if (completed === sampleRequests.length) {
          var avgDelta = comparisons.reduce(function (sum, c) {
            return sum + c.qualityDelta;
          }, 0) / comparisons.length;

          var costIncrease = comparisons.reduce(function (sum, c) {
            return sum + (c.greenCost - c.blueCost);
          }, 0) / comparisons.length;

          callback(null, {
            comparisons: comparisons,
            averageQualityDelta: avgDelta,
            averageCostDelta: costIncrease,
            recommendation: avgDelta >= -0.05 ? "proceed" : "rollback"
          });
        }
      })
      .catch(function (err) {
        comparisons[index] = { error: err.message };
        completed++;

        if (completed === sampleRequests.length) {
          callback(null, {
            comparisons: comparisons,
            recommendation: "rollback",
            reason: "Errors during canary comparison"
          });
        }
      });
  });
}

function scoreOutput(output) {
  var score = 0;
  if (output && output.length > 10) score += 0.3;
  if (output && output.length < 5000) score += 0.2;
  if (output && output.indexOf("I cannot") === -1) score += 0.2;
  if (output && output.indexOf("As an AI") === -1) score += 0.15;
  if (output && output.indexOf("undefined") === -1) score += 0.15;
  return score;
}

The scoreOutput function here is intentionally simple. In production, you would replace this with domain-specific evaluation: BLEU scores for translation, ROUGE for summarization, exact match for classification, or LLM-as-judge for open-ended generation. The key insight is that your deployment pipeline needs an automated way to score outputs, not just check status codes.

Gradual Traffic Shifting

Never switch 100% of traffic at once. A proper blue-green deployment for AI features follows a ramp schedule:

var RAMP_SCHEDULE = [
  { percentage: 10, durationMinutes: 15, minSamples: 50 },
  { percentage: 25, durationMinutes: 15, minSamples: 100 },
  { percentage: 50, durationMinutes: 30, minSamples: 250 },
  { percentage: 100, durationMinutes: 0, minSamples: 0 }
];

var deploymentState = {
  currentStep: -1,
  stepStartTime: null,
  samplesCollected: 0,
  qualityScores: [],
  costSamples: [],
  errorCount: 0,
  totalRequests: 0
};

function advanceTrafficRamp() {
  var nextStep = deploymentState.currentStep + 1;

  if (nextStep >= RAMP_SCHEDULE.length) {
    console.log("Deployment complete. Green is now primary.");
    trafficConfig.activeEnvironment = "green";
    return { status: "complete" };
  }

  var step = RAMP_SCHEDULE[nextStep];
  var currentMetrics = evaluateCurrentMetrics();

  if (deploymentState.currentStep >= 0) {
    var currentStepConfig = RAMP_SCHEDULE[deploymentState.currentStep];
    var elapsed = (Date.now() - deploymentState.stepStartTime) / 60000;

    if (elapsed < currentStepConfig.durationMinutes) {
      return {
        status: "waiting",
        minutesRemaining: Math.ceil(currentStepConfig.durationMinutes - elapsed)
      };
    }

    if (deploymentState.samplesCollected < currentStepConfig.minSamples) {
      return {
        status: "waiting",
        reason: "Insufficient samples",
        collected: deploymentState.samplesCollected,
        required: currentStepConfig.minSamples
      };
    }

    if (currentMetrics.shouldRollback) {
      return triggerRollback(currentMetrics.rollbackReason);
    }
  }

  deploymentState.currentStep = nextStep;
  deploymentState.stepStartTime = Date.now();
  deploymentState.samplesCollected = 0;
  trafficConfig.greenPercentage = step.percentage;

  console.log("Advanced to " + step.percentage + "% green traffic");

  return {
    status: "advanced",
    percentage: step.percentage,
    nextAdvanceMinutes: step.durationMinutes
  };
}

Each step has a minimum duration and minimum number of samples before the system will consider advancing. This prevents you from ramping to 100% on five requests that happened to look fine.

Automated Rollback Triggers

Rollback triggers for AI deployments need to monitor three things that traditional deployments ignore: quality degradation, cost spikes, and behavioral anomalies.

var ROLLBACK_THRESHOLDS = {
  qualityDropPercent: 10,
  costIncreasePercent: 50,
  errorRatePercent: 5,
  latencyIncreasePercent: 200,
  refusalRatePercent: 15
};

function evaluateCurrentMetrics() {
  var avgQuality = 0;
  var avgCost = 0;
  var errorRate = 0;
  var refusalRate = 0;

  if (deploymentState.qualityScores.length > 0) {
    avgQuality = deploymentState.qualityScores.reduce(function (a, b) { return a + b; }, 0)
      / deploymentState.qualityScores.length;
  }

  if (deploymentState.costSamples.length > 0) {
    avgCost = deploymentState.costSamples.reduce(function (a, b) { return a + b; }, 0)
      / deploymentState.costSamples.length;
  }

  if (deploymentState.totalRequests > 0) {
    errorRate = (deploymentState.errorCount / deploymentState.totalRequests) * 100;
  }

  var shouldRollback = false;
  var rollbackReason = "";

  if (avgQuality < (1 - ROLLBACK_THRESHOLDS.qualityDropPercent / 100) * deploymentState.baselineQuality) {
    shouldRollback = true;
    rollbackReason = "Quality dropped " + Math.round((1 - avgQuality / deploymentState.baselineQuality) * 100) + "% below baseline";
  }

  if (avgCost > deploymentState.baselineCost * (1 + ROLLBACK_THRESHOLDS.costIncreasePercent / 100)) {
    shouldRollback = true;
    rollbackReason = "Cost increased " + Math.round((avgCost / deploymentState.baselineCost - 1) * 100) + "% above baseline";
  }

  if (errorRate > ROLLBACK_THRESHOLDS.errorRatePercent) {
    shouldRollback = true;
    rollbackReason = "Error rate at " + errorRate.toFixed(1) + "%, threshold is " + ROLLBACK_THRESHOLDS.errorRatePercent + "%";
  }

  return {
    avgQuality: avgQuality,
    avgCost: avgCost,
    errorRate: errorRate,
    refusalRate: refusalRate,
    shouldRollback: shouldRollback,
    rollbackReason: rollbackReason
  };
}

function triggerRollback(reason) {
  console.error("ROLLBACK TRIGGERED: " + reason);
  trafficConfig.greenPercentage = 0;
  trafficConfig.activeEnvironment = "blue";
  deploymentState.currentStep = -1;

  return {
    status: "rolled_back",
    reason: reason,
    timestamp: new Date().toISOString()
  };
}

The refusal rate check is particularly important for LLM-based features. If a new prompt or model version starts refusing legitimate requests at a higher rate, that is a quality regression even though the responses are technically valid HTTP 200s.

Database Considerations During Blue-Green

When both environments share a database, you need to handle schema migrations carefully. For AI features, the most common shared state is:

  • Embedding indexes: If your green environment uses a different embedding model, the vector dimensions will not match. You need either separate vector indexes or a migration plan that re-embeds existing data.
  • Conversation history: Both environments must read the same conversation format. If your green environment changes the conversation schema, it needs backward compatibility.
  • Cached responses: AI response caches keyed by input hash will have different outputs per environment. Namespace your caches by environment version.
var ENVIRONMENT_VERSION = process.env.DEPLOY_VERSION || "v1";

function getCacheKey(input) {
  var crypto = require("crypto");
  var inputHash = crypto.createHash("sha256").update(input).digest("hex");
  return ENVIRONMENT_VERSION + ":" + inputHash;
}

function getCachedResponse(redisClient, input, callback) {
  var key = getCacheKey(input);
  redisClient.get(key, function (err, cached) {
    if (err || !cached) {
      callback(null, null);
      return;
    }
    callback(null, JSON.parse(cached));
  });
}

By namespacing cache keys with the deployment version, both environments can run against the same Redis instance without polluting each other's caches.

Handling Long-Running AI Requests During Switchover

LLM requests can take 5-30 seconds. If you switch traffic mid-request, in-flight requests to the old environment must complete gracefully. This requires connection draining.

var activeConnections = { blue: 0, green: 0 };

function trackConnection(environment, delta) {
  activeConnections[environment] += delta;
}

function waitForDrain(environment, timeoutMs, callback) {
  var startTime = Date.now();

  var checkInterval = setInterval(function () {
    if (activeConnections[environment] === 0) {
      clearInterval(checkInterval);
      callback(null, { drained: true, elapsed: Date.now() - startTime });
      return;
    }

    if (Date.now() - startTime > timeoutMs) {
      clearInterval(checkInterval);
      console.warn(
        "Drain timeout for " + environment + ". " +
        activeConnections[environment] + " connections remaining."
      );
      callback(null, {
        drained: false,
        remaining: activeConnections[environment],
        elapsed: timeoutMs
      });
      return;
    }
  }, 1000);
}

function safeSwitchover(callback) {
  trafficConfig.greenPercentage = 100;
  console.log("Routing all new traffic to green. Draining blue...");

  waitForDrain("blue", 60000, function (err, result) {
    if (result.drained) {
      console.log("Blue fully drained. Switchover complete.");
      trafficConfig.activeEnvironment = "green";
    } else {
      console.warn("Blue did not fully drain. " + result.remaining + " connections still active.");
      trafficConfig.activeEnvironment = "green";
    }
    callback(null, result);
  });
}

The drain timeout should be set to at least twice your maximum expected AI request duration. For streaming responses, you may need to wait even longer.

Blue-Green for Prompt and Model Configuration Changes

Not every blue-green deployment involves a code change. Some of the riskiest deployments for AI features are pure configuration changes: swapping models, updating system prompts, or changing temperature parameters.

Store your AI configuration outside your code so you can deploy configuration changes through the same blue-green pipeline:

var aiConfig = {
  blue: {
    model: "gpt-4o",
    systemPrompt: "You are a helpful assistant for our e-commerce platform.",
    temperature: 0.3,
    maxTokens: 1024
  },
  green: {
    model: "gpt-4o-mini",
    systemPrompt: "You are a helpful and concise assistant for our e-commerce platform. Keep responses under 200 words.",
    temperature: 0.2,
    maxTokens: 512
  }
};

function getAIConfig(environment) {
  return aiConfig[environment] || aiConfig.blue;
}

function processRequest(input, environment) {
  var config = getAIConfig(environment);

  return {
    model: config.model,
    messages: [
      { role: "system", content: config.systemPrompt },
      { role: "user", content: input }
    ],
    temperature: config.temperature,
    max_tokens: config.maxTokens
  };
}

This lets you test the exact same code with different model configurations, isolating whether quality changes come from your code or your AI settings.

Implementing with DigitalOcean App Platform

DigitalOcean App Platform does not natively support blue-green deployments, but you can implement them with two apps and a load balancer or CDN in front:

# .do/app-blue.yaml
name: myapp-blue
services:
  - name: api
    github:
      repo: myorg/myapp
      branch: main
      deploy_on_push: false
    envs:
      - key: DEPLOY_ENV
        value: "blue"
      - key: AI_MODEL
        value: "gpt-4o"
    http_port: 8080
    instance_count: 2
    instance_size_slug: professional-s

# .do/app-green.yaml
name: myapp-green
services:
  - name: api
    github:
      repo: myorg/myapp
      branch: main
      deploy_on_push: false
    envs:
      - key: DEPLOY_ENV
        value: "green"
      - key: AI_MODEL
        value: "gpt-4o-mini"
    http_port: 8080
    instance_count: 2
    instance_size_slug: professional-s

Use the DigitalOcean API or doctl CLI to trigger deployments on the green app, run your quality checks, and then update your DNS or CDN origin to point at the green app. The blue app stays running as your instant rollback target.

Implementing with Docker and nginx

For self-managed infrastructure, Docker Compose with nginx gives you full control over traffic routing:

# nginx.conf
upstream blue_backend {
    server blue-app:8080;
}

upstream green_backend {
    server green-app:8080;
}

split_clients "${remote_addr}${uri}" $backend {
    10%   green_backend;
    *     blue_backend;
}

server {
    listen 80;

    location / {
        proxy_pass http://$backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Deploy-Env $backend;
        proxy_read_timeout 120s;
        proxy_connect_timeout 10s;
    }

    location /health {
        proxy_pass http://blue_backend/health;
    }
}
# docker-compose.yml
version: "3.8"
services:
  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf
    depends_on:
      - blue-app
      - green-app

  blue-app:
    build: .
    environment:
      - DEPLOY_ENV=blue
      - AI_MODEL=gpt-4o
      - PORT=8080

  green-app:
    build: .
    environment:
      - DEPLOY_ENV=green
      - AI_MODEL=gpt-4o-mini
      - PORT=8080

Adjust the split_clients percentage in nginx.conf and reload nginx to shift traffic. The proxy_read_timeout at 120 seconds is important for AI features since LLM calls can be slow, especially for complex reasoning tasks.

Cost Management

Running two production environments doubles your infrastructure cost during the deployment window. Strategies to manage this:

  • Scale down the idle environment. After a successful switchover, reduce the old environment to a single minimum instance rather than shutting it down completely. This keeps rollback fast without paying for full capacity.
  • Time-box deployments. Set a maximum deployment window (e.g., 2 hours). If canary analysis has not completed by then, either commit or roll back.
  • Share expensive resources. Both environments should use the same database, cache layer, and object storage. Only the application servers need to be duplicated.
  • Monitor AI API costs separately per environment. Tag your LLM API calls with the environment identifier so you can see if the green environment is consuming significantly more tokens.
function callAIProvider(params, environment) {
  var startTime = Date.now();

  return axios.post("https://api.openai.com/v1/chat/completions", params, {
    headers: {
      "Authorization": "Bearer " + process.env.OPENAI_API_KEY,
      "X-Deploy-Env": environment
    }
  }).then(function (response) {
    var usage = response.data.usage || {};
    var cost = estimateCost(params.model, usage.prompt_tokens, usage.completion_tokens);

    logMetric("ai.cost", cost, { environment: environment, model: params.model });
    logMetric("ai.latency", Date.now() - startTime, { environment: environment });
    logMetric("ai.tokens.prompt", usage.prompt_tokens, { environment: environment });
    logMetric("ai.tokens.completion", usage.completion_tokens, { environment: environment });

    return response.data;
  });
}

function estimateCost(model, promptTokens, completionTokens) {
  var rates = {
    "gpt-4o": { prompt: 2.50 / 1000000, completion: 10.00 / 1000000 },
    "gpt-4o-mini": { prompt: 0.15 / 1000000, completion: 0.60 / 1000000 }
  };

  var rate = rates[model] || rates["gpt-4o"];
  return (promptTokens * rate.prompt) + (completionTokens * rate.completion);
}

function logMetric(name, value, tags) {
  console.log(JSON.stringify({
    metric: name,
    value: value,
    tags: tags,
    timestamp: Date.now()
  }));
}

Monitoring During Deployment Transitions

During a blue-green transition, you need real-time visibility into four categories of metrics across both environments:

  1. Availability: Error rates, response times, timeout rates
  2. Quality: Output relevance scores, hallucination detection flags, refusal rates
  3. Cost: Token consumption per request, total API spend per minute
  4. Behavioral: Output length distribution, response format consistency, model-specific anomalies

Build a monitoring dashboard that shows these metrics side-by-side for blue and green:

var deploymentMetrics = {
  blue: { requests: 0, errors: 0, totalLatency: 0, totalCost: 0, qualityScores: [] },
  green: { requests: 0, errors: 0, totalLatency: 0, totalCost: 0, qualityScores: [] }
};

function recordRequestMetric(environment, data) {
  var env = deploymentMetrics[environment];
  env.requests++;

  if (data.error) {
    env.errors++;
  }

  env.totalLatency += data.latencyMs || 0;
  env.totalCost += data.cost || 0;

  if (data.qualityScore !== undefined) {
    env.qualityScores.push(data.qualityScore);
  }
}

function getDeploymentDashboard() {
  var result = {};

  ["blue", "green"].forEach(function (env) {
    var m = deploymentMetrics[env];
    var avgQuality = m.qualityScores.length > 0
      ? m.qualityScores.reduce(function (a, b) { return a + b; }, 0) / m.qualityScores.length
      : null;

    result[env] = {
      requests: m.requests,
      errorRate: m.requests > 0 ? ((m.errors / m.requests) * 100).toFixed(2) + "%" : "N/A",
      avgLatency: m.requests > 0 ? Math.round(m.totalLatency / m.requests) + "ms" : "N/A",
      totalCost: "$" + m.totalCost.toFixed(4),
      avgQuality: avgQuality !== null ? avgQuality.toFixed(3) : "N/A",
      sampleCount: m.qualityScores.length
    };
  });

  return result;
}

Complete Working Example

Here is a full Express.js deployment controller that ties everything together. This runs as a standalone service that sits in front of your blue and green application instances.

var express = require("express");
var axios = require("axios");
var crypto = require("crypto");

var app = express();
app.use(express.json());

// --- Configuration ---

var BLUE_URL = process.env.BLUE_URL || "http://localhost:3001";
var GREEN_URL = process.env.GREEN_URL || "http://localhost:3002";
var ADMIN_TOKEN = process.env.ADMIN_TOKEN || "change-me-in-production";
var PORT = process.env.CONTROLLER_PORT || 4000;

var RAMP_STEPS = [
  { percent: 10, holdMinutes: 15, minSamples: 50 },
  { percent: 25, holdMinutes: 15, minSamples: 100 },
  { percent: 50, holdMinutes: 30, minSamples: 250 },
  { percent: 100, holdMinutes: 0, minSamples: 0 }
];

var THRESHOLDS = {
  maxQualityDrop: 0.10,
  maxCostIncrease: 0.50,
  maxErrorRate: 0.05,
  maxLatencyIncrease: 2.0
};

// --- State ---

var state = {
  active: "blue",
  greenPercent: 0,
  rampStep: -1,
  rampStartedAt: null,
  stepStartedAt: null,
  deploying: false,
  baseline: { quality: 0.85, cost: 0.002, latency: 1500 },
  metrics: {
    blue: { requests: 0, errors: 0, latencySum: 0, costSum: 0, qualityScores: [] },
    green: { requests: 0, errors: 0, latencySum: 0, costSum: 0, qualityScores: [] }
  }
};

// --- Auth Middleware ---

function requireAuth(req, res, next) {
  var token = req.headers["x-admin-token"];
  if (token !== ADMIN_TOKEN) {
    return res.status(401).json({ error: "Unauthorized" });
  }
  next();
}

// --- Traffic Routing ---

app.all("/api/*", function (req, res) {
  var target = BLUE_URL;

  if (state.greenPercent > 0 && Math.random() * 100 < state.greenPercent) {
    target = GREEN_URL;
  }

  var env = target === BLUE_URL ? "blue" : "green";
  var startTime = Date.now();

  axios({
    method: req.method,
    url: target + req.originalUrl,
    data: req.body,
    headers: {
      "Content-Type": "application/json",
      "X-Deploy-Env": env,
      "X-Request-Id": crypto.randomUUID()
    },
    timeout: 120000
  })
    .then(function (response) {
      var latency = Date.now() - startTime;
      var cost = response.data._cost || 0;
      var quality = response.data._qualityScore || null;

      state.metrics[env].requests++;
      state.metrics[env].latencySum += latency;
      state.metrics[env].costSum += cost;
      if (quality !== null) {
        state.metrics[env].qualityScores.push(quality);
      }

      delete response.data._cost;
      delete response.data._qualityScore;

      res.status(response.status).json(response.data);
    })
    .catch(function (err) {
      var latency = Date.now() - startTime;
      state.metrics[env].requests++;
      state.metrics[env].errors++;
      state.metrics[env].latencySum += latency;

      console.error("[" + env + "] Request failed:", err.message);
      res.status(502).json({ error: "Backend unavailable", environment: env });
    });
});

// --- Deployment Control Endpoints ---

app.post("/deploy/start", requireAuth, function (req, res) {
  if (state.deploying) {
    return res.status(409).json({ error: "Deployment already in progress" });
  }

  state.deploying = true;
  state.rampStep = -1;
  state.rampStartedAt = Date.now();
  state.metrics.green = { requests: 0, errors: 0, latencySum: 0, costSum: 0, qualityScores: [] };

  console.log("Deployment started. Running baseline quality check...");

  runQualityCheck(BLUE_URL, function (err, baseline) {
    if (err) {
      state.deploying = false;
      return res.status(500).json({ error: "Baseline check failed", details: err.message });
    }

    state.baseline.quality = baseline.score;
    console.log("Baseline quality: " + baseline.score.toFixed(3));

    var advanceResult = advanceRamp();
    res.json({
      status: "deployment_started",
      baseline: state.baseline,
      ramp: advanceResult
    });
  });
});

app.post("/deploy/advance", requireAuth, function (req, res) {
  if (!state.deploying) {
    return res.status(400).json({ error: "No deployment in progress" });
  }

  var result = advanceRamp();
  res.json(result);
});

app.post("/deploy/rollback", requireAuth, function (req, res) {
  var result = executeRollback(req.body.reason || "Manual rollback");
  res.json(result);
});

app.get("/deploy/status", requireAuth, function (req, res) {
  var greenMetrics = computeEnvMetrics("green");
  var blueMetrics = computeEnvMetrics("blue");

  res.json({
    deploying: state.deploying,
    active: state.active,
    greenPercent: state.greenPercent,
    rampStep: state.rampStep,
    rampStepCount: RAMP_STEPS.length,
    elapsedMinutes: state.rampStartedAt
      ? Math.round((Date.now() - state.rampStartedAt) / 60000)
      : 0,
    metrics: { blue: blueMetrics, green: greenMetrics },
    thresholds: THRESHOLDS
  });
});

// --- Ramp Logic ---

function advanceRamp() {
  var nextStep = state.rampStep + 1;

  if (nextStep >= RAMP_STEPS.length) {
    state.active = "green";
    state.deploying = false;
    console.log("Deployment complete. Green is now primary.");
    return { status: "complete", active: "green" };
  }

  if (state.rampStep >= 0) {
    var current = RAMP_STEPS[state.rampStep];
    var elapsed = (Date.now() - state.stepStartedAt) / 60000;

    if (elapsed < current.holdMinutes) {
      return {
        status: "holding",
        currentPercent: state.greenPercent,
        minutesRemaining: Math.ceil(current.holdMinutes - elapsed)
      };
    }

    var greenSamples = state.metrics.green.qualityScores.length;
    if (greenSamples < current.minSamples) {
      return {
        status: "waiting_for_samples",
        collected: greenSamples,
        required: current.minSamples
      };
    }

    var rollbackCheck = checkRollbackConditions();
    if (rollbackCheck.shouldRollback) {
      return executeRollback(rollbackCheck.reason);
    }
  }

  state.rampStep = nextStep;
  state.stepStartedAt = Date.now();
  state.greenPercent = RAMP_STEPS[nextStep].percent;

  console.log("Ramp advanced: " + state.greenPercent + "% traffic to green (step " + (nextStep + 1) + "/" + RAMP_STEPS.length + ")");

  return {
    status: "advanced",
    step: nextStep + 1,
    totalSteps: RAMP_STEPS.length,
    greenPercent: state.greenPercent,
    holdMinutes: RAMP_STEPS[nextStep].holdMinutes
  };
}

function checkRollbackConditions() {
  var greenMetrics = computeEnvMetrics("green");

  if (greenMetrics.avgQuality !== null && greenMetrics.avgQuality < state.baseline.quality * (1 - THRESHOLDS.maxQualityDrop)) {
    return {
      shouldRollback: true,
      reason: "Quality degraded: " + greenMetrics.avgQuality.toFixed(3) +
        " vs baseline " + state.baseline.quality.toFixed(3) +
        " (threshold: " + (THRESHOLDS.maxQualityDrop * 100) + "% drop)"
    };
  }

  if (greenMetrics.avgCost > state.baseline.cost * (1 + THRESHOLDS.maxCostIncrease)) {
    return {
      shouldRollback: true,
      reason: "Cost spike: $" + greenMetrics.avgCost.toFixed(4) +
        " vs baseline $" + state.baseline.cost.toFixed(4)
    };
  }

  if (greenMetrics.errorRate > THRESHOLDS.maxErrorRate) {
    return {
      shouldRollback: true,
      reason: "Error rate: " + (greenMetrics.errorRate * 100).toFixed(1) +
        "% exceeds threshold " + (THRESHOLDS.maxErrorRate * 100) + "%"
    };
  }

  if (greenMetrics.avgLatency > state.baseline.latency * THRESHOLDS.maxLatencyIncrease) {
    return {
      shouldRollback: true,
      reason: "Latency spike: " + Math.round(greenMetrics.avgLatency) +
        "ms vs baseline " + Math.round(state.baseline.latency) + "ms"
    };
  }

  return { shouldRollback: false };
}

function executeRollback(reason) {
  console.error("ROLLBACK: " + reason);
  state.greenPercent = 0;
  state.active = "blue";
  state.deploying = false;
  state.rampStep = -1;

  return {
    status: "rolled_back",
    reason: reason,
    timestamp: new Date().toISOString()
  };
}

// --- Metrics ---

function computeEnvMetrics(env) {
  var m = state.metrics[env];
  var avgQuality = m.qualityScores.length > 0
    ? m.qualityScores.reduce(function (a, b) { return a + b; }, 0) / m.qualityScores.length
    : null;

  return {
    requests: m.requests,
    errors: m.errors,
    errorRate: m.requests > 0 ? m.errors / m.requests : 0,
    avgLatency: m.requests > 0 ? m.latencySum / m.requests : 0,
    avgCost: m.requests > 0 ? m.costSum / m.requests : 0,
    avgQuality: avgQuality,
    qualitySamples: m.qualityScores.length
  };
}

// --- Quality Check ---

function runQualityCheck(targetUrl, callback) {
  var testCases = [
    { input: "Summarize: Node.js is a JavaScript runtime built on Chrome's V8 engine.", minLength: 10 },
    { input: "Classify: This product is terrible and broke immediately.", expected: "negative" },
    { input: "Extract keywords: Machine learning models require large datasets for training.", keywords: ["machine learning", "datasets", "training"] }
  ];

  var passed = 0;
  var completed = 0;

  testCases.forEach(function (tc) {
    axios.post(targetUrl + "/api/process", { input: tc.input }, { timeout: 30000 })
      .then(function (res) {
        var output = (res.data.output || "").toLowerCase();
        var ok = true;

        if (tc.minLength && output.length < tc.minLength) ok = false;
        if (tc.expected && output.indexOf(tc.expected) === -1) ok = false;
        if (tc.keywords) {
          tc.keywords.forEach(function (kw) {
            if (output.indexOf(kw) === -1) ok = false;
          });
        }

        if (ok) passed++;
        completed++;
        if (completed === testCases.length) {
          callback(null, { score: passed / testCases.length, total: testCases.length, passed: passed });
        }
      })
      .catch(function (err) {
        completed++;
        if (completed === testCases.length) {
          callback(null, { score: passed / testCases.length, total: testCases.length, passed: passed });
        }
      });
  });
}

// --- Start ---

app.listen(PORT, function () {
  console.log("Blue-green deployment controller running on port " + PORT);
  console.log("Blue: " + BLUE_URL);
  console.log("Green: " + GREEN_URL);
  console.log("Active: " + state.active);
});

To use this controller:

  1. Start your blue and green application instances on separate ports
  2. Start the controller: BLUE_URL=http://localhost:3001 GREEN_URL=http://localhost:3002 ADMIN_TOKEN=secret node controller.js
  3. Begin deployment: curl -X POST -H "X-Admin-Token: secret" http://localhost:4000/deploy/start
  4. Check status: curl -H "X-Admin-Token: secret" http://localhost:4000/deploy/status
  5. Advance ramp: curl -X POST -H "X-Admin-Token: secret" http://localhost:4000/deploy/advance
  6. Emergency rollback: curl -X POST -H "X-Admin-Token: secret" -d '{"reason":"manual"}' http://localhost:4000/deploy/rollback

Common Issues and Troubleshooting

1. Shadow Requests Doubling Your AI API Costs

Error: Monthly API budget exceeded. Current spend: $2,847.52 (budget: $1,500)

Shadow mode sends duplicate requests to the inactive environment for quality comparison. If you forget to account for this, you will blow through your LLM API budget. Fix this by sampling shadow requests at a configurable rate (e.g., 5% of traffic) rather than shadowing everything:

var SHADOW_SAMPLE_RATE = 0.05;

function shouldShadow() {
  return trafficConfig.shadowEnabled && Math.random() < SHADOW_SAMPLE_RATE;
}

2. Quality Checks Timing Out Against Cold Environments

Error: ECONNREFUSED 127.0.0.1:3002 - connect ECONNREFUSED

or

Error: timeout of 30000ms exceeded

The green environment may not be ready when you start the deployment. AI services often have longer startup times because they need to load tokenizers, warm caches, or establish connection pools to embedding databases. Always run a readiness check before starting canary analysis:

function waitForReady(url, maxWaitMs, callback) {
  var start = Date.now();
  var interval = setInterval(function () {
    axios.get(url + "/health", { timeout: 5000 })
      .then(function () {
        clearInterval(interval);
        callback(null, true);
      })
      .catch(function () {
        if (Date.now() - start > maxWaitMs) {
          clearInterval(interval);
          callback(new Error("Environment not ready after " + maxWaitMs + "ms"));
        }
      });
  }, 3000);
}

3. Inconsistent Quality Scores Due to LLM Non-Determinism

Canary comparison inconclusive: quality delta oscillating between -0.12 and +0.08

LLMs are not deterministic even with temperature set to 0. The same input can produce different outputs across calls, which makes quality comparison noisy. Increase your minimum sample size and use statistical significance testing rather than raw averages:

function isSignificantDegradation(scores, baseline, minSamples) {
  if (scores.length < minSamples) return { significant: false, reason: "insufficient samples" };

  var mean = scores.reduce(function (a, b) { return a + b; }, 0) / scores.length;
  var variance = scores.reduce(function (sum, s) {
    return sum + Math.pow(s - mean, 2);
  }, 0) / scores.length;
  var stdErr = Math.sqrt(variance / scores.length);
  var zScore = (baseline - mean) / stdErr;

  return {
    significant: zScore > 1.96,
    mean: mean,
    baseline: baseline,
    zScore: zScore,
    reason: zScore > 1.96 ? "Statistically significant quality drop" : "Within normal variance"
  };
}

4. Embedding Dimension Mismatch After Model Upgrade

Error: pgvector: expected 1536 dimensions, got 3072
PostgreSQL error: column "embedding" has type vector(1536) but expression has type vector(3072)

If your green environment uses a different embedding model (e.g., switching from text-embedding-ada-002 at 1536 dimensions to text-embedding-3-large at 3072 dimensions), all existing embeddings become incompatible. You cannot simply swap the model. Instead, create a parallel vector column or table for the new model and backfill asynchronously:

// During deployment, use environment-aware column selection
function getEmbeddingColumn(environment) {
  if (environment === "green") {
    return "embedding_v2";
  }
  return "embedding_v1";
}

5. Memory Exhaustion From Accumulating Quality Metrics

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

If your deployment runs for hours and you are storing every quality score in an in-memory array, you will eventually run out of heap space. Use a circular buffer or periodic aggregation:

function addQualityScore(env, score) {
  var MAX_SCORES = 10000;
  state.metrics[env].qualityScores.push(score);

  if (state.metrics[env].qualityScores.length > MAX_SCORES) {
    var removed = state.metrics[env].qualityScores.splice(0, MAX_SCORES / 2);
    var removedAvg = removed.reduce(function (a, b) { return a + b; }, 0) / removed.length;
    state.metrics[env].aggregatedQuality = state.metrics[env].aggregatedQuality || [];
    state.metrics[env].aggregatedQuality.push({ avg: removedAvg, count: removed.length });
  }
}

Best Practices

  • Always establish a baseline before starting a deployment. Run your quality checks against the current blue environment and record the scores. Without a baseline, you have no reference point for detecting degradation in green.

  • Tag every AI API call with the deployment environment. This lets you separate costs, latency, and error rates by environment in your monitoring system. When something goes wrong at 3 AM, you need to know instantly whether it is a blue or green problem.

  • Set your drain timeout to at least 3x your maximum expected AI response time. Streaming LLM responses to complex prompts with large context windows can take 60 seconds or more. A 30-second drain timeout will forcibly terminate in-flight requests and cause user-visible errors.

  • Never run canary analysis on fewer than 50 samples per ramp step. LLM outputs are inherently variable. Small sample sizes will produce noisy quality metrics that lead to both false positives (unnecessary rollbacks) and false negatives (missed regressions).

  • Use separate cache namespaces per environment. If blue and green share a response cache without environment-scoped keys, green will serve cached blue responses and vice versa. This completely undermines your quality comparison because you are not actually testing the new configuration.

  • Automate the ramp schedule but require manual approval for the final step. Going from 50% to 100% is the point of no return. Having a human review the canary metrics before committing to full traffic prevents automation bugs from completing a bad deployment.

  • Keep the old environment running for at least 24 hours after full switchover. Some quality issues only manifest on specific types of inputs that may not appear during the canary window. A 24-hour buffer gives you time to catch long-tail regressions.

  • Version your prompt templates and model configurations alongside your code. Treat prompt changes as code changes that go through the same blue-green pipeline. A prompt change that introduces subtle hallucinations is just as dangerous as a code bug.

  • Monitor cost per request, not just total cost. If green uses a cheaper model but generates 3x more tokens per request due to a verbose prompt, total cost may still increase. Per-request cost metrics catch this immediately.

  • Test your rollback mechanism regularly. Run practice rollbacks at least monthly. A rollback that has never been tested is a rollback that will fail when you need it most.

References

Powered by Contentful