A/B Testing LLM Responses in Production

Shane

2/13/2026

29 min read

Build A/B testing for LLM features with experiment frameworks, user bucketing, statistical analysis, and rollout strategies in Node.js.

llm nodejs production quality ab testing experimentation

A/B Testing LLM Responses in Production

Overview

Shipping an LLM feature is not the finish line; it is the starting line. Every prompt change, model swap, or temperature tweak can shift response quality in ways that are invisible without rigorous experimentation. This article walks through building a production-grade A/B testing framework for LLM features in Node.js, covering user bucketing, metric collection, statistical analysis, and safe rollout strategies that account for the unique challenge of non-deterministic outputs.

Prerequisites

Node.js 18+ and npm installed
Familiarity with Express.js and middleware patterns
A working LLM integration (OpenAI, Anthropic, or similar)
Basic understanding of statistics (means, standard deviations, confidence intervals)
Redis or a similar key-value store for experiment state (examples use Redis)

Why A/B Testing Matters for LLM Features

Traditional software is deterministic. You deploy a new algorithm, it either returns the correct result or it does not. LLMs are different. The same prompt can produce wildly different outputs across calls. Quality is subjective, latency varies by response length, and costs fluctuate with token usage.

There are three categories of changes you will want to test in production:

Model changes -- swapping from Claude Haiku to Claude Sonnet, or from GPT-4o to GPT-4o-mini. These affect quality, latency, and cost simultaneously. You cannot evaluate them in isolation.

Prompt changes -- rewording instructions, adding few-shot examples, restructuring the system prompt, changing output format constraints. Prompt engineering is empirical. What reads well to a human does not always produce better model output.

Parameter changes -- temperature, max tokens, top-p, presence penalty. Small parameter tweaks can dramatically shift the distribution of outputs. A temperature change from 0.7 to 0.3 might improve consistency but reduce creativity for generative tasks.

The only reliable way to evaluate these changes is to expose real users to both variants simultaneously and measure what happens.

Designing Experiments for Non-Deterministic Systems

Standard A/B testing assumes that given the same input, variant A and variant B produce the same output every time. LLMs break this assumption. You need to account for three sources of variance:

Between-variant variance -- the actual difference you are trying to measure
Within-variant variance -- the same variant producing different outputs for the same input
Between-user variance -- different users having different quality expectations and behaviors

This means you need more samples than a typical A/B test. A button color test might reach significance with 500 users per arm. An LLM quality test might need 5,000 or more, depending on your metric variance.

Design your experiments with these principles:

Run experiments for at least two weeks to capture day-of-week effects
Collect multiple quality signals per interaction, not just one
Log the full request and response for post-hoc analysis
Set a maximum experiment duration to prevent indefinite runs

Implementing a Feature Flag System for LLM Variants

The foundation of any experiment framework is a feature flag system that routes users to variants. Here is a minimal implementation:

var crypto = require("crypto");

function ExperimentConfig(options) {
  this.id = options.id;
  this.name = options.name;
  this.variants = options.variants || [];
  this.active = options.active !== false;
  this.startDate = options.startDate || new Date();
  this.endDate = options.endDate || null;
  this.targetPercent = options.targetPercent || 100;
}

function Variant(options) {
  this.id = options.id;
  this.name = options.name;
  this.weight = options.weight || 50;
  this.config = options.config || {};
}

var summarizationExperiment = new ExperimentConfig({
  id: "exp_summarize_v2",
  name: "Summarization Prompt V2 Test",
  variants: [
    new Variant({
      id: "control",
      name: "Current prompt",
      weight: 50,
      config: {
        model: "claude-sonnet-4-20250514",
        temperature: 0.3,
        systemPrompt: "Summarize the following text concisely.",
        maxTokens: 500
      }
    }),
    new Variant({
      id: "treatment",
      name: "Structured prompt with examples",
      weight: 50,
      config: {
        model: "claude-sonnet-4-20250514",
        temperature: 0.2,
        systemPrompt: "Summarize the following text in 2-3 sentences. Focus on the key takeaway and any actionable information. Do not include filler phrases.",
        maxTokens: 500
      }
    })
  ],
  active: true,
  targetPercent: 100
});

The config object on each variant holds everything the LLM call needs. This keeps experiment configuration separate from application logic.

User Bucketing with Consistent Hashing

Users must see the same variant every time they interact with the feature. Random assignment per request would contaminate your results. Consistent hashing solves this:

var crypto = require("crypto");

function hashUserToVariant(userId, experimentId, variants) {
  var hashInput = userId + ":" + experimentId;
  var hash = crypto.createHash("md5").update(hashInput).digest("hex");
  var hashInt = parseInt(hash.substring(0, 8), 16);
  var normalizedValue = hashInt / 0xffffffff;

  var cumulativeWeight = 0;
  var totalWeight = 0;

  for (var i = 0; i < variants.length; i++) {
    totalWeight += variants[i].weight;
  }

  for (var j = 0; j < variants.length; j++) {
    cumulativeWeight += variants[j].weight / totalWeight;
    if (normalizedValue <= cumulativeWeight) {
      return variants[j];
    }
  }

  return variants[variants.length - 1];
}

function isUserInExperiment(userId, experiment) {
  if (!experiment.active) return false;

  var now = new Date();
  if (experiment.endDate && now > experiment.endDate) return false;

  if (experiment.targetPercent < 100) {
    var hash = crypto.createHash("md5").update(userId + ":eligibility").digest("hex");
    var eligibilityScore = parseInt(hash.substring(0, 8), 16) / 0xffffffff * 100;
    if (eligibilityScore > experiment.targetPercent) return false;
  }

  return true;
}

MD5 is fine here because we are not doing cryptography; we just need a uniform distribution. The key property is that hashUserToVariant("user123", "exp_summarize_v2", variants) always returns the same variant for the same user-experiment pair. If you add or remove variants, assignments change, which is why you should not modify live experiments -- create new ones instead.

Metrics to Track Per Variant

LLM experiments require a broader set of metrics than typical A/B tests. Track these for every variant:

Quality metrics:

Automated quality scores (if you have a rubric or judge model)
User thumbs up/down or star ratings
Task completion rate (did the user accomplish their goal?)
Edit distance (how much did the user modify the LLM output?)

Performance metrics:

Time to first token (TTFT)
Total response latency (end-to-end)
Input token count
Output token count

Cost metrics:

Cost per request (derived from token counts and model pricing)
Cost per successful interaction

Engagement metrics:

Copy/use rate (did the user actually use the output?)
Retry rate (did the user re-submit the same request?)
Abandonment rate (did the user leave mid-generation?)

Here is a metric collector:

var EventEmitter = require("events");

function MetricCollector() {
  this.emitter = new EventEmitter();
  this.buffer = [];
  this.flushInterval = setInterval(this.flush.bind(this), 5000);
}

MetricCollector.prototype.record = function(experimentId, variantId, userId, metrics) {
  var event = {
    experimentId: experimentId,
    variantId: variantId,
    userId: userId,
    timestamp: Date.now(),
    metrics: metrics
  };

  this.buffer.push(event);
  this.emitter.emit("metric", event);

  if (this.buffer.length >= 100) {
    this.flush();
  }
};

MetricCollector.prototype.flush = function() {
  if (this.buffer.length === 0) return;

  var batch = this.buffer.splice(0);

  // Replace with your actual storage: PostgreSQL, ClickHouse, BigQuery, etc.
  writeToDB(batch).catch(function(err) {
    console.error("Failed to flush metrics:", err.message);
    // Re-queue failed batch (with a cap to prevent memory issues)
  });
};

MetricCollector.prototype.shutdown = function() {
  clearInterval(this.flushInterval);
  return this.flush();
};

function writeToDB(batch) {
  // Implementation depends on your storage backend
  return Promise.resolve();
}

Statistical Significance for LLM Experiments

LLM output variance is high. A user might rate the same response a 4 one day and a 3 the next. This means you need to be careful about declaring winners.

Use Welch's t-test for comparing means between two groups with unequal variances:

function welchTTest(groupA, groupB) {
  var meanA = mean(groupA);
  var meanB = mean(groupB);
  var varA = variance(groupA);
  var varB = variance(groupB);
  var nA = groupA.length;
  var nB = groupB.length;

  var se = Math.sqrt(varA / nA + varB / nB);
  if (se === 0) return { tStat: 0, pValue: 1, significant: false };

  var tStat = (meanA - meanB) / se;

  // Welch-Satterthwaite degrees of freedom
  var num = Math.pow(varA / nA + varB / nB, 2);
  var den = Math.pow(varA / nA, 2) / (nA - 1) + Math.pow(varB / nB, 2) / (nB - 1);
  var df = num / den;

  var pValue = tDistributionPValue(Math.abs(tStat), df);

  return {
    tStat: tStat,
    degreesOfFreedom: df,
    pValue: pValue,
    significant: pValue < 0.05,
    meanA: meanA,
    meanB: meanB,
    difference: meanA - meanB,
    confidenceInterval: [
      (meanA - meanB) - 1.96 * se,
      (meanA - meanB) + 1.96 * se
    ]
  };
}

function mean(arr) {
  var sum = 0;
  for (var i = 0; i < arr.length; i++) sum += arr[i];
  return sum / arr.length;
}

function variance(arr) {
  var m = mean(arr);
  var sumSq = 0;
  for (var i = 0; i < arr.length; i++) {
    sumSq += Math.pow(arr[i] - m, 2);
  }
  return sumSq / (arr.length - 1);
}

function tDistributionPValue(t, df) {
  // Approximation using the regularized incomplete beta function
  // For production use, consider the 'jstat' npm package
  var x = df / (df + t * t);
  return betaIncomplete(df / 2, 0.5, x);
}

function betaIncomplete(a, b, x) {
  // Lentz's continued fraction algorithm - simplified
  // Use a statistics library for production-grade accuracy
  var EPSILON = 1e-10;
  var MAX_ITER = 200;
  var result = Math.exp(
    a * Math.log(x) + b * Math.log(1 - x) -
    Math.log(a) - logBeta(a, b)
  );

  var numerator = 1;
  var denominator = 1;

  for (var n = 1; n <= MAX_ITER; n++) {
    var d = -(a + n) * (a + b + n) * x / ((a + 2 * n) * (a + 2 * n + 1));
    denominator = 1 + d / denominator;
    numerator = 1 + d / numerator;
    result *= numerator / denominator;

    if (Math.abs(numerator / denominator - 1) < EPSILON) break;
  }

  return result;
}

function logBeta(a, b) {
  return logGamma(a) + logGamma(b) - logGamma(a + b);
}

function logGamma(x) {
  // Stirling's approximation
  var coefficients = [
    76.18009172947146, -86.50532032941677,
    24.01409824083091, -1.231739572450155,
    0.1208650973866179e-2, -0.5395239384953e-5
  ];
  var y = x;
  var tmp = x + 5.5;
  tmp -= (x + 0.5) * Math.log(tmp);
  var sum = 1.000000000190015;
  for (var j = 0; j < 6; j++) {
    sum += coefficients[j] / ++y;
  }
  return -tmp + Math.log(2.5066282746310005 * sum / x);
}

A few rules of thumb for LLM experiments:

Do not peek at results before you have at least 1,000 observations per arm
Use a significance threshold of p < 0.05 but also require a meaningful effect size (e.g., at least a 0.2 point improvement on a 5-point scale)
Run a power analysis before starting -- if your metric has high variance, you may need weeks of data
Consider sequential testing methods (like SPRT) if you want to stop early when results are conclusive

Building the Experiment Framework

Now let us put the pieces together into a cohesive framework:

var crypto = require("crypto");
var Redis = require("ioredis");

function ExperimentFramework(options) {
  this.redis = new Redis(options.redisUrl || "redis://localhost:6379");
  this.experiments = {};
  this.metricCollector = new MetricCollector();
  this.logger = options.logger || console;
}

ExperimentFramework.prototype.registerExperiment = function(config) {
  this.experiments[config.id] = config;
  this.logger.info("Registered experiment: " + config.id);
};

ExperimentFramework.prototype.getAssignment = function(userId, experimentId) {
  var experiment = this.experiments[experimentId];
  if (!experiment) return null;
  if (!isUserInExperiment(userId, experiment)) return null;

  var variant = hashUserToVariant(userId, experimentId, experiment.variants);

  // Log the assignment for analysis
  this.logAssignment(userId, experimentId, variant.id);

  return variant;
};

ExperimentFramework.prototype.logAssignment = function(userId, experimentId, variantId) {
  var key = "exp_assignment:" + experimentId + ":" + userId;
  var self = this;

  this.redis.setnx(key, JSON.stringify({
    variantId: variantId,
    timestamp: Date.now()
  })).then(function(wasSet) {
    if (wasSet) {
      // First assignment -- increment variant counter
      self.redis.hincrby("exp_counts:" + experimentId, variantId, 1);
    }
  }).catch(function(err) {
    self.logger.error("Failed to log assignment:", err.message);
  });
};

ExperimentFramework.prototype.recordOutcome = function(userId, experimentId, metrics) {
  var experiment = this.experiments[experimentId];
  if (!experiment) return;

  var variant = hashUserToVariant(userId, experimentId, experiment.variants);

  this.metricCollector.record(experimentId, variant.id, userId, metrics);
};

ExperimentFramework.prototype.getResults = function(experimentId) {
  var self = this;

  return fetchMetricsFromDB(experimentId).then(function(allMetrics) {
    var experiment = self.experiments[experimentId];
    if (!experiment) return null;

    var results = {};

    for (var i = 0; i < experiment.variants.length; i++) {
      var variant = experiment.variants[i];
      var variantMetrics = allMetrics.filter(function(m) {
        return m.variantId === variant.id;
      });

      results[variant.id] = {
        name: variant.name,
        sampleSize: variantMetrics.length,
        metrics: aggregateMetrics(variantMetrics)
      };
    }

    // Run statistical tests between control and treatments
    if (experiment.variants.length === 2) {
      var controlMetrics = allMetrics.filter(function(m) {
        return m.variantId === "control";
      });
      var treatmentMetrics = allMetrics.filter(function(m) {
        return m.variantId === "treatment";
      });

      results.comparison = runStatisticalTests(controlMetrics, treatmentMetrics);
    }

    return results;
  });
};

function aggregateMetrics(metricsArray) {
  if (metricsArray.length === 0) return {};

  var keys = Object.keys(metricsArray[0].metrics);
  var aggregated = {};

  for (var i = 0; i < keys.length; i++) {
    var key = keys[i];
    var values = metricsArray.map(function(m) { return m.metrics[key]; })
      .filter(function(v) { return typeof v === "number"; });

    if (values.length > 0) {
      aggregated[key] = {
        mean: mean(values),
        median: median(values),
        stddev: Math.sqrt(variance(values)),
        p25: percentile(values, 25),
        p75: percentile(values, 75),
        p95: percentile(values, 95),
        count: values.length
      };
    }
  }

  return aggregated;
}

function median(arr) {
  var sorted = arr.slice().sort(function(a, b) { return a - b; });
  var mid = Math.floor(sorted.length / 2);
  if (sorted.length % 2 === 0) {
    return (sorted[mid - 1] + sorted[mid]) / 2;
  }
  return sorted[mid];
}

function percentile(arr, p) {
  var sorted = arr.slice().sort(function(a, b) { return a - b; });
  var index = (p / 100) * (sorted.length - 1);
  var lower = Math.floor(index);
  var fraction = index - lower;
  if (lower + 1 < sorted.length) {
    return sorted[lower] + fraction * (sorted[lower + 1] - sorted[lower]);
  }
  return sorted[lower];
}

function runStatisticalTests(controlMetrics, treatmentMetrics) {
  var metricKeys = Object.keys(controlMetrics[0].metrics);
  var comparisons = {};

  for (var i = 0; i < metricKeys.length; i++) {
    var key = metricKeys[i];
    var controlValues = controlMetrics.map(function(m) { return m.metrics[key]; })
      .filter(function(v) { return typeof v === "number"; });
    var treatmentValues = treatmentMetrics.map(function(m) { return m.metrics[key]; })
      .filter(function(v) { return typeof v === "number"; });

    if (controlValues.length > 1 && treatmentValues.length > 1) {
      comparisons[key] = welchTTest(treatmentValues, controlValues);
    }
  }

  return comparisons;
}

function fetchMetricsFromDB(experimentId) {
  // Replace with actual DB query
  return Promise.resolve([]);
}

Logging Experiment Assignments and Outcomes

Every experiment assignment and outcome must be logged durably. You need this data for post-hoc analysis, debugging, and auditing. Here is an Express middleware that ties it all together:

var framework = new ExperimentFramework({ redisUrl: process.env.REDIS_URL });

function experimentMiddleware(experimentId) {
  return function(req, res, next) {
    var userId = req.user ? req.user.id : req.sessionID;
    var assignment = framework.getAssignment(userId, experimentId);

    if (assignment) {
      req.experiment = {
        id: experimentId,
        variant: assignment,
        config: assignment.config
      };

      // Attach helper to record outcomes later
      req.recordExperimentOutcome = function(metrics) {
        framework.recordOutcome(userId, experimentId, metrics);
      };
    } else {
      // User not in experiment -- use default config
      req.experiment = null;
      req.recordExperimentOutcome = function() {};
    }

    next();
  };
}

// Usage in a route
var express = require("express");
var router = express.Router();

router.post("/summarize",
  experimentMiddleware("exp_summarize_v2"),
  function(req, res) {
    var config = req.experiment
      ? req.experiment.config
      : { model: "claude-sonnet-4-20250514", temperature: 0.3, systemPrompt: "Summarize the text.", maxTokens: 500 };

    var startTime = Date.now();

    callLLM({
      model: config.model,
      temperature: config.temperature,
      systemPrompt: config.systemPrompt,
      maxTokens: config.maxTokens,
      userMessage: req.body.text
    }).then(function(llmResponse) {
      var latency = Date.now() - startTime;

      // Record performance metrics immediately
      req.recordExperimentOutcome({
        latencyMs: latency,
        inputTokens: llmResponse.usage.input_tokens,
        outputTokens: llmResponse.usage.output_tokens,
        responseLength: llmResponse.text.length
      });

      res.json({
        summary: llmResponse.text,
        experimentId: req.experiment ? req.experiment.id : null,
        variantId: req.experiment ? req.experiment.variant.id : null
      });
    }).catch(function(err) {
      req.recordExperimentOutcome({
        error: true,
        errorType: err.code || "unknown"
      });
      res.status(500).json({ error: "Summarization failed" });
    });
  }
);

Include the experimentId and variantId in your API response so the client can associate user feedback (thumbs up/down, edits) with the correct variant.

Multi-Armed Bandit as an Alternative

Fixed 50/50 splits are wasteful when one variant is clearly better. A multi-armed bandit approach dynamically shifts traffic toward the winning variant using Thompson Sampling:

function ThompsonBandit(variants) {
  this.variants = variants;
  this.successes = {};
  this.failures = {};

  for (var i = 0; i < variants.length; i++) {
    this.successes[variants[i].id] = 1; // Beta(1,1) prior
    this.failures[variants[i].id] = 1;
  }
}

ThompsonBandit.prototype.selectVariant = function() {
  var bestSample = -1;
  var bestVariant = null;

  for (var i = 0; i < this.variants.length; i++) {
    var variantId = this.variants[i].id;
    var sample = sampleBeta(
      this.successes[variantId],
      this.failures[variantId]
    );

    if (sample > bestSample) {
      bestSample = sample;
      bestVariant = this.variants[i];
    }
  }

  return bestVariant;
};

ThompsonBandit.prototype.recordResult = function(variantId, success) {
  if (success) {
    this.successes[variantId]++;
  } else {
    this.failures[variantId]++;
  }
};

function sampleBeta(alpha, beta) {
  // Joehnk's algorithm for Beta distribution sampling
  var u, v, x, y;
  do {
    u = Math.random();
    v = Math.random();
    x = Math.pow(u, 1 / alpha);
    y = Math.pow(v, 1 / beta);
  } while (x + y > 1);

  return x / (x + y);
}

The tradeoff: bandits minimize regret (exposure to the worse variant) but make statistical inference harder because allocations are not balanced. Use bandits for optimization and A/B tests for measurement. If you need to publish a paper-quality analysis, stick with fixed splits.

Testing Prompt Variations

Prompt A/B testing is the most common LLM experiment. Here are patterns that work:

Testing wording changes:

var promptVariants = {
  control: {
    systemPrompt: "You are a helpful assistant. Answer the user's question."
  },
  concise: {
    systemPrompt: "You are a helpful assistant. Answer the user's question in 2-3 sentences maximum. Be direct and specific."
  },
  structured: {
    systemPrompt: "You are a helpful assistant. Answer the user's question using this format:\n**Answer:** [direct answer]\n**Details:** [supporting context]\n**Next steps:** [if applicable]"
  }
};

Testing few-shot examples:

var fewShotVariants = {
  zeroShot: {
    systemPrompt: "Classify the following support ticket as: billing, technical, account, or other."
  },
  twoShot: {
    systemPrompt: "Classify the following support ticket as: billing, technical, account, or other.\n\nExamples:\nTicket: \"I was charged twice for my subscription\"\nClassification: billing\n\nTicket: \"The app crashes when I upload a file over 10MB\"\nClassification: technical"
  }
};

Always test one variable at a time. If you change both the prompt wording and the number of few-shot examples simultaneously, you cannot attribute the result to either change.

Testing Model Variations

Model swaps are high-impact experiments. A smaller model might deliver 90% of the quality at 20% of the cost:

var modelExperiment = new ExperimentConfig({
  id: "exp_model_swap_chat",
  name: "Chat model downgrade test",
  variants: [
    new Variant({
      id: "control",
      name: "Claude Sonnet",
      weight: 50,
      config: {
        model: "claude-sonnet-4-20250514",
        temperature: 0.5,
        maxTokens: 1000
      }
    }),
    new Variant({
      id: "treatment",
      name: "Claude Haiku",
      weight: 50,
      config: {
        model: "claude-haiku-4-20250514",
        temperature: 0.5,
        maxTokens: 1000
      }
    })
  ]
});

When testing models, make sure to track cost per request as a primary metric. A 5% quality drop with a 70% cost reduction might be the right business decision.

Testing System Prompt Changes

System prompt changes are risky because they affect every interaction. Use a graduated rollout:

function graduatedRollout(experimentId, schedule) {
  // schedule: [{ day: 0, percent: 5 }, { day: 3, percent: 25 }, { day: 7, percent: 50 }, ...]
  var experiment = framework.experiments[experimentId];
  if (!experiment) return;

  var daysSinceStart = Math.floor(
    (Date.now() - experiment.startDate.getTime()) / (1000 * 60 * 60 * 24)
  );

  var targetPercent = 0;
  for (var i = 0; i < schedule.length; i++) {
    if (daysSinceStart >= schedule[i].day) {
      targetPercent = schedule[i].percent;
    }
  }

  experiment.targetPercent = targetPercent;
  framework.logger.info(
    "Experiment " + experimentId + " at " + targetPercent + "% on day " + daysSinceStart
  );
}

// Run every hour
setInterval(function() {
  graduatedRollout("exp_system_prompt_v3", [
    { day: 0, percent: 5 },
    { day: 2, percent: 10 },
    { day: 5, percent: 25 },
    { day: 7, percent: 50 },
    { day: 10, percent: 100 }
  ]);
}, 1000 * 60 * 60);

Start at 5%. If your error rate or negative feedback spikes, you can kill the experiment before it affects most users.

Analyzing Experiment Results with Confidence Intervals

Raw numbers are not enough. You need confidence intervals to understand the range of plausible effects:

function analyzeExperiment(experimentId, metricName) {
  return framework.getResults(experimentId).then(function(results) {
    if (!results || !results.comparison || !results.comparison[metricName]) {
      return { status: "insufficient_data" };
    }

    var comparison = results.comparison[metricName];
    var control = results.control.metrics[metricName];
    var treatment = results.treatment.metrics[metricName];

    var analysis = {
      metric: metricName,
      control: {
        mean: control.mean.toFixed(4),
        stddev: control.stddev.toFixed(4),
        sampleSize: control.count
      },
      treatment: {
        mean: treatment.mean.toFixed(4),
        stddev: treatment.stddev.toFixed(4),
        sampleSize: treatment.count
      },
      difference: comparison.difference.toFixed(4),
      relativeChange: ((comparison.difference / comparison.meanB) * 100).toFixed(2) + "%",
      confidenceInterval: {
        lower: comparison.confidenceInterval[0].toFixed(4),
        upper: comparison.confidenceInterval[1].toFixed(4)
      },
      pValue: comparison.pValue.toFixed(6),
      significant: comparison.significant,
      recommendation: getRecommendation(comparison, control.count, treatment.count)
    };

    return analysis;
  });
}

function getRecommendation(comparison, nControl, nTreatment) {
  var minSampleSize = 500;

  if (nControl < minSampleSize || nTreatment < minSampleSize) {
    return "CONTINUE: Need at least " + minSampleSize + " samples per arm. " +
      "Currently: control=" + nControl + ", treatment=" + nTreatment;
  }

  if (!comparison.significant) {
    if (nControl > 5000) {
      return "NO_DIFFERENCE: No significant difference detected with large sample. " +
        "Consider stopping the experiment.";
    }
    return "CONTINUE: Not yet significant. Continue collecting data.";
  }

  if (comparison.difference > 0) {
    return "SHIP_TREATMENT: Treatment is significantly better (p=" +
      comparison.pValue.toFixed(4) + "). Plan rollout.";
  }

  return "KEEP_CONTROL: Control is significantly better. Reject treatment.";
}

Rolling Out Winning Variants

When an experiment concludes with a winner, do not flip 100% of traffic immediately. Use a staged rollout:

Announce the winner -- document the experiment results, effect sizes, and confidence intervals
Ramp from 50% to 75% -- monitor for a day to confirm metrics hold
Ramp to 95% -- keep 5% on the old variant as a safety net for 48 hours
Ship 100% -- update the default configuration, remove the experiment, archive the data
Clean up -- remove experiment middleware, dead code, and feature flags

Build an automatic rollout controller:

function RolloutController(framework) {
  this.framework = framework;
  this.rolloutStages = [
    { percent: 75, durationHours: 24 },
    { percent: 95, durationHours: 48 },
    { percent: 100, durationHours: 0 }
  ];
}

RolloutController.prototype.startRollout = function(experimentId, winningVariantId) {
  var self = this;
  var experiment = this.framework.experiments[experimentId];

  if (!experiment) {
    throw new Error("Experiment not found: " + experimentId);
  }

  // Set winning variant to 75%, others to share remaining 25%
  var stage = 0;

  function advanceStage() {
    if (stage >= self.rolloutStages.length) return;

    var targetPercent = self.rolloutStages[stage].percent;
    var remaining = 100 - targetPercent;
    var otherCount = experiment.variants.length - 1;

    for (var i = 0; i < experiment.variants.length; i++) {
      if (experiment.variants[i].id === winningVariantId) {
        experiment.variants[i].weight = targetPercent;
      } else {
        experiment.variants[i].weight = remaining / otherCount;
      }
    }

    self.framework.logger.info(
      "Rollout stage " + stage + ": " + winningVariantId +
      " at " + targetPercent + "%"
    );

    stage++;
    var nextDuration = self.rolloutStages[stage - 1].durationHours;
    if (nextDuration > 0 && stage < self.rolloutStages.length) {
      setTimeout(advanceStage, nextDuration * 60 * 60 * 1000);
    }
  }

  advanceStage();
};

Complete Working Example

Here is a self-contained Express application that ties everything together. It includes experiment configuration, the LLM call wrapper, metric tracking, a results endpoint, and an admin dashboard:

var express = require("express");
var crypto = require("crypto");
var Redis = require("ioredis");
var Anthropic = require("@anthropic-ai/sdk");

var app = express();
var redis = new Redis(process.env.REDIS_URL || "redis://localhost:6379");
var anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

app.use(express.json());

// ─── Experiment Registry ───

var experiments = {
  "exp_summarize_v2": {
    id: "exp_summarize_v2",
    name: "Summarization Prompt V2",
    active: true,
    startDate: new Date("2026-02-01"),
    endDate: new Date("2026-02-28"),
    targetPercent: 100,
    variants: [
      {
        id: "control",
        weight: 50,
        config: {
          model: "claude-sonnet-4-20250514",
          temperature: 0.3,
          maxTokens: 500,
          systemPrompt: "Summarize the following text concisely."
        }
      },
      {
        id: "treatment",
        weight: 50,
        config: {
          model: "claude-sonnet-4-20250514",
          temperature: 0.2,
          maxTokens: 500,
          systemPrompt: "Summarize the following text in 2-3 sentences. Focus on the key takeaway and any actionable information. Omit filler phrases."
        }
      }
    ]
  }
};

// ─── Bucketing ───

function getVariantForUser(userId, experimentId) {
  var exp = experiments[experimentId];
  if (!exp || !exp.active) return null;

  var now = new Date();
  if (exp.endDate && now > exp.endDate) return null;

  var hash = crypto.createHash("md5")
    .update(userId + ":" + experimentId)
    .digest("hex");
  var value = parseInt(hash.substring(0, 8), 16) / 0xffffffff;

  var cumulative = 0;
  var totalWeight = 0;
  for (var i = 0; i < exp.variants.length; i++) {
    totalWeight += exp.variants[i].weight;
  }
  for (var j = 0; j < exp.variants.length; j++) {
    cumulative += exp.variants[j].weight / totalWeight;
    if (value <= cumulative) return exp.variants[j];
  }
  return exp.variants[exp.variants.length - 1];
}

// ─── Metric Storage (Redis-backed) ───

function recordMetric(experimentId, variantId, userId, metrics) {
  var entry = JSON.stringify({
    variantId: variantId,
    userId: userId,
    metrics: metrics,
    timestamp: Date.now()
  });
  return redis.rpush("exp_metrics:" + experimentId, entry);
}

function getMetrics(experimentId) {
  return redis.lrange("exp_metrics:" + experimentId, 0, -1)
    .then(function(entries) {
      return entries.map(function(e) { return JSON.parse(e); });
    });
}

// ─── Stats Helpers ───

function computeMean(arr) {
  var s = 0;
  for (var i = 0; i < arr.length; i++) s += arr[i];
  return s / arr.length;
}

function computeVariance(arr) {
  var m = computeMean(arr);
  var s = 0;
  for (var i = 0; i < arr.length; i++) s += Math.pow(arr[i] - m, 2);
  return s / (arr.length - 1);
}

function computeConfidenceInterval(arr, confidence) {
  var z = confidence === 0.99 ? 2.576 : 1.96; // 95% default
  var m = computeMean(arr);
  var se = Math.sqrt(computeVariance(arr) / arr.length);
  return { mean: m, lower: m - z * se, upper: m + z * se, se: se };
}

// ─── Routes ───

app.post("/api/summarize", function(req, res) {
  var userId = req.headers["x-user-id"] || req.ip;
  var text = req.body.text;

  if (!text) {
    return res.status(400).json({ error: "Missing 'text' field" });
  }

  var variant = getVariantForUser(userId, "exp_summarize_v2");
  var config = variant
    ? variant.config
    : { model: "claude-sonnet-4-20250514", temperature: 0.3, maxTokens: 500, systemPrompt: "Summarize the following text." };

  var startTime = Date.now();

  anthropic.messages.create({
    model: config.model,
    max_tokens: config.maxTokens,
    temperature: config.temperature,
    system: config.systemPrompt,
    messages: [{ role: "user", content: text }]
  }).then(function(response) {
    var latency = Date.now() - startTime;
    var outputText = response.content[0].text;

    if (variant) {
      recordMetric("exp_summarize_v2", variant.id, userId, {
        latencyMs: latency,
        inputTokens: response.usage.input_tokens,
        outputTokens: response.usage.output_tokens,
        responseLength: outputText.length
      });
    }

    res.json({
      summary: outputText,
      experiment: variant ? { id: "exp_summarize_v2", variant: variant.id } : null
    });
  }).catch(function(err) {
    if (variant) {
      recordMetric("exp_summarize_v2", variant.id, userId, {
        error: true,
        errorMessage: err.message
      });
    }
    res.status(500).json({ error: "Summarization failed", message: err.message });
  });
});

// User feedback endpoint
app.post("/api/feedback", function(req, res) {
  var userId = req.headers["x-user-id"] || req.ip;
  var experimentId = req.body.experimentId;
  var rating = req.body.rating; // 1-5

  if (!experimentId || !rating) {
    return res.status(400).json({ error: "Missing experimentId or rating" });
  }

  var variant = getVariantForUser(userId, experimentId);
  if (!variant) {
    return res.status(404).json({ error: "Experiment not found or inactive" });
  }

  recordMetric(experimentId, variant.id, userId, {
    userRating: rating,
    feedbackType: "explicit"
  }).then(function() {
    res.json({ recorded: true });
  });
});

// ─── Admin Dashboard ───

app.get("/admin/experiments", function(req, res) {
  var experimentList = Object.keys(experiments).map(function(id) {
    var exp = experiments[id];
    return { id: exp.id, name: exp.name, active: exp.active };
  });
  res.json(experimentList);
});

app.get("/admin/experiments/:id/results", function(req, res) {
  var experimentId = req.params.id;
  var exp = experiments[experimentId];

  if (!exp) {
    return res.status(404).json({ error: "Experiment not found" });
  }

  getMetrics(experimentId).then(function(allMetrics) {
    var results = {};

    for (var i = 0; i < exp.variants.length; i++) {
      var v = exp.variants[i];
      var vMetrics = allMetrics.filter(function(m) { return m.variantId === v.id; });

      var latencies = [];
      var ratings = [];
      var errors = 0;
      var totalCost = 0;

      for (var j = 0; j < vMetrics.length; j++) {
        var m = vMetrics[j].metrics;
        if (m.latencyMs) latencies.push(m.latencyMs);
        if (m.userRating) ratings.push(m.userRating);
        if (m.error) errors++;
        if (m.inputTokens && m.outputTokens) {
          // Approximate cost for Claude Sonnet (adjust per model)
          totalCost += (m.inputTokens * 0.003 + m.outputTokens * 0.015) / 1000;
        }
      }

      results[v.id] = {
        sampleSize: vMetrics.length,
        latency: latencies.length > 0 ? computeConfidenceInterval(latencies) : null,
        rating: ratings.length > 0 ? computeConfidenceInterval(ratings) : null,
        errorRate: vMetrics.length > 0 ? (errors / vMetrics.length * 100).toFixed(2) + "%" : "N/A",
        totalCost: "$" + totalCost.toFixed(4),
        avgCostPerRequest: vMetrics.length > 0 ? "$" + (totalCost / vMetrics.length).toFixed(6) : "N/A"
      };
    }

    // Comparison if both arms have rating data
    var controlRatings = allMetrics
      .filter(function(m) { return m.variantId === "control" && m.metrics.userRating; })
      .map(function(m) { return m.metrics.userRating; });
    var treatmentRatings = allMetrics
      .filter(function(m) { return m.variantId === "treatment" && m.metrics.userRating; })
      .map(function(m) { return m.metrics.userRating; });

    if (controlRatings.length > 1 && treatmentRatings.length > 1) {
      results.ratingComparison = {
        controlMean: computeMean(controlRatings).toFixed(3),
        treatmentMean: computeMean(treatmentRatings).toFixed(3),
        difference: (computeMean(treatmentRatings) - computeMean(controlRatings)).toFixed(3),
        controlCI: computeConfidenceInterval(controlRatings),
        treatmentCI: computeConfidenceInterval(treatmentRatings),
        recommendation: getExperimentRecommendation(controlRatings, treatmentRatings)
      };
    }

    res.json({
      experiment: { id: exp.id, name: exp.name, active: exp.active },
      results: results
    });
  }).catch(function(err) {
    res.status(500).json({ error: err.message });
  });
});

function getExperimentRecommendation(control, treatment) {
  if (control.length < 200 || treatment.length < 200) {
    return "Need more data (minimum 200 per arm)";
  }

  var diff = computeMean(treatment) - computeMean(control);
  var pooledSE = Math.sqrt(
    computeVariance(control) / control.length +
    computeVariance(treatment) / treatment.length
  );
  var tStat = diff / pooledSE;

  if (Math.abs(tStat) < 1.96) {
    return "No significant difference detected";
  }

  if (diff > 0) {
    return "Treatment is significantly better (t=" + tStat.toFixed(3) + "). Consider shipping.";
  }
  return "Control is significantly better (t=" + tStat.toFixed(3) + "). Keep control.";
}

var PORT = process.env.PORT || 3000;
app.listen(PORT, function() {
  console.log("Experiment server running on port " + PORT);
});

Common Issues and Troubleshooting

1. Users see different variants across sessions

Error: Variant assignment inconsistency for user_abc123
Expected: control, Got: treatment

This happens when the user identifier changes between sessions. If you use session IDs, a new session means a new ID. Fix this by using a stable identifier -- authenticated user ID, or a persistent cookie set on first visit. If you must use session IDs, persist the assignment in Redis:

function getOrCreateAssignment(sessionId, experimentId, variants) {
  var key = "exp_sticky:" + experimentId + ":" + sessionId;
  return redis.get(key).then(function(cached) {
    if (cached) return JSON.parse(cached);
    var variant = hashUserToVariant(sessionId, experimentId, variants);
    redis.setex(key, 86400 * 30, JSON.stringify(variant)); // 30-day TTL
    return variant;
  });
}

2. Metric cardinality explosion in Redis

OOM command not allowed when used memory > 'maxmemory'

Storing every metric event as a separate Redis list entry works for small experiments but blows up at scale. Each entry is a JSON string, and Redis keeps everything in memory. If you are running 10 experiments with 50,000 events each, that is 500,000 Redis entries. Move to a time-series database or append-only log (PostgreSQL with partitioning, ClickHouse, or BigQuery) for anything beyond prototype scale.

3. Experiment results change when you add a new variant

WARNING: Existing user assignments shifted after variant added to exp_summarize_v2
Before: { control: 5023, treatment: 4977 }
After:  { control: 3341, treatment: 3329, treatment_b: 3330 }

Adding a variant to a live experiment reshuffles the hash space. Users who were in "control" may now be in "treatment_b". Never modify live experiments. Instead, end the current experiment, analyze results, and start a new one with the additional variant.

4. Statistical significance flip-flops daily

Day 3: p=0.042 (significant!)
Day 4: p=0.067 (not significant)
Day 5: p=0.038 (significant again!)

This is the peeking problem. Every time you check results, you are implicitly running another statistical test, which inflates your false positive rate. The solution is either to pre-commit to a fixed sample size and only check once, or use a sequential testing method like the SPRT (Sequential Probability Ratio Test) that accounts for repeated checks. At minimum, do not make ship/no-ship decisions until you have at least twice your minimum sample size.

5. LLM errors disproportionately affect one variant

Variant "treatment" error rate: 12.3%
Variant "control" error rate: 0.8%

If one variant uses a different model or higher token count, it might hit rate limits or timeouts more often. This contaminates your quality metrics -- users in the high-error variant have a worse experience even if the non-error responses are better. Track error rates separately and exclude errored requests from quality analysis. Also implement automatic experiment pausing if any variant's error rate exceeds a threshold:

function checkExperimentHealth(experimentId) {
  return getMetrics(experimentId).then(function(metrics) {
    var exp = experiments[experimentId];
    for (var i = 0; i < exp.variants.length; i++) {
      var v = exp.variants[i];
      var vMetrics = metrics.filter(function(m) { return m.variantId === v.id; });
      var errors = vMetrics.filter(function(m) { return m.metrics.error; });

      if (vMetrics.length > 100 && errors.length / vMetrics.length > 0.1) {
        exp.active = false;
        console.error(
          "EXPERIMENT PAUSED: " + experimentId +
          " variant " + v.id +
          " error rate " + (errors.length / vMetrics.length * 100).toFixed(1) + "%"
        );
        return;
      }
    }
  });
}

Best Practices

One variable at a time. Test prompt OR model OR parameters, never multiple simultaneously. Multivariate testing requires exponentially more traffic.
Log everything. Store the full prompt, full response, all parameters, and all metadata for every experiment request. You will want to do qualitative analysis of responses, not just look at aggregate numbers.
Use a holdback group. Keep 5-10% of users permanently on the pre-experiment baseline. This lets you measure long-term effects and catch regressions that appear after the experiment ends.
Set hard guardrails. Define automatic pause conditions before the experiment starts: error rate > 10%, latency p95 > 5 seconds, user rating drops below 3.0. Do not rely on humans watching dashboards.
Separate quality metrics from engagement metrics. Users might engage more with a lower-quality variant because it is confusing and requires follow-up questions. High engagement does not always mean high quality.
Pre-register your experiments. Before starting, write down your hypothesis, primary metric, minimum detectable effect, required sample size, and decision criteria. This prevents post-hoc rationalization.
Account for novelty effects. Users might rate a new prompt style higher simply because it is different. Run experiments for at least two weeks to let the novelty wear off.
Version your prompts in source control. Every prompt variant should be committed to your repository with a version identifier. When you analyze results six months later, you need to know exactly what text each variant used.
Build a kill switch. Every experiment should be killable in under 60 seconds without a code deploy. Use a feature flag service or a Redis key that your middleware checks on every request.
Monitor cost continuously. An experiment that produces better responses at 3x the cost might not be a winner. Include cost-per-request as a primary metric alongside quality.

References

Anthropic API Documentation -- model parameters and usage tracking
OpenAI Cookbook: Evaluation Framework -- patterns for evaluating LLM outputs
Evan Miller's A/B Testing Tools -- sample size calculators and significance tests
Multi-Armed Bandit Algorithms -- Thompson Sampling and UCB theory
Sequential Probability Ratio Test -- early stopping for experiments
jStat Library -- JavaScript statistical functions for production use
Kohavi, Tang, Xu: Trustworthy Online Controlled Experiments -- the definitive book on A/B testing at scale

A/B Testing LLM Responses in Production

A/B Testing LLM Responses in Production

Overview

Prerequisites

Why A/B Testing Matters for LLM Features

Designing Experiments for Non-Deterministic Systems

Implementing a Feature Flag System for LLM Variants

User Bucketing with Consistent Hashing

Metrics to Track Per Variant

Statistical Significance for LLM Experiments

Building the Experiment Framework

Logging Experiment Assignments and Outcomes

Multi-Armed Bandit as an Alternative

Testing Prompt Variations

Testing Model Variations

Testing System Prompt Changes

Analyzing Experiment Results with Confidence Intervals

Rolling Out Winning Variants

Complete Working Example

Common Issues and Troubleshooting

Best Practices

References

Quick Links

Need Expert Help?