Testing LLM Integrations: Strategies and Tools
Complete guide to testing LLM integrations with mocking, fixtures, regression testing, evaluation scoring, and CI setup in Node.js.
Testing LLM Integrations: Strategies and Tools
Testing LLM integrations is one of the most misunderstood problems in modern software engineering. Unlike a database query or a REST endpoint, you are dealing with a system that can return a different answer to the same question every time you ask it. This article walks through practical strategies I have used in production to build reliable, repeatable test suites for LLM-powered features in Node.js applications.
Prerequisites
- Node.js v18+ installed
- Familiarity with Mocha and Chai for testing
- Basic understanding of LLM APIs (OpenAI, Anthropic, etc.)
- An OpenAI or Anthropic API key for integration tests
- npm packages:
mocha,chai,sinon,nock,node-fetch
Install the test dependencies:
npm install --save-dev mocha chai sinon nock
npm install node-fetch@2
The Challenge of Testing Non-Deterministic Systems
Traditional unit tests follow a simple contract: given input X, expect output Y. LLMs shatter that contract. Ask GPT to summarize a paragraph and you will get a slightly different result every single time. Setting temperature to 0 helps, but even then, model updates, API version changes, and infrastructure differences can shift outputs.
I have seen teams fall into two traps. The first is not testing at all because "the output changes." The second is writing brittle tests that assert on exact string matches and break every time the model hiccups. Both are wrong.
The right approach is layered testing:
- Unit tests with mocked responses for deterministic logic
- Fixture tests with recorded API responses for integration paths
- Evaluation tests that score output quality rather than assert exact matches
- Regression tests that detect when prompt changes degrade results
Each layer catches different failure modes. Let me walk through every one of them.
Mocking LLM API Responses for Unit Tests
The fastest, cheapest, and most reliable layer. Mock the HTTP call and test everything around it: your prompt construction, response parsing, error handling, retry logic.
// test/unit/summarizer.test.js
var sinon = require("sinon");
var chai = require("chai");
var expect = chai.expect;
var nock = require("nock");
var summarizer = require("../../lib/summarizer");
describe("Summarizer", function () {
afterEach(function () {
nock.cleanAll();
});
it("should parse a valid completion response", function (done) {
var mockResponse = {
id: "chatcmpl-abc123",
object: "chat.completion",
choices: [
{
index: 0,
message: {
role: "assistant",
content: "This article discusses Node.js testing strategies."
},
finish_reason: "stop"
}
],
usage: { prompt_tokens: 50, completion_tokens: 12, total_tokens: 62 }
};
nock("https://api.openai.com")
.post("/v1/chat/completions")
.reply(200, mockResponse);
summarizer.summarize("A long article about testing...").then(function (result) {
expect(result.summary).to.be.a("string");
expect(result.summary).to.include("testing");
expect(result.tokenUsage).to.equal(62);
done();
}).catch(done);
});
it("should handle rate limit errors with retry", function (done) {
nock("https://api.openai.com")
.post("/v1/chat/completions")
.reply(429, { error: { message: "Rate limit exceeded" } })
.post("/v1/chat/completions")
.reply(200, {
choices: [{ message: { content: "Summary here." }, finish_reason: "stop" }],
usage: { total_tokens: 30 }
});
summarizer.summarize("Some text").then(function (result) {
expect(result.summary).to.equal("Summary here.");
done();
}).catch(done);
});
it("should throw on malformed response", function (done) {
nock("https://api.openai.com")
.post("/v1/chat/completions")
.reply(200, { choices: [] });
summarizer.summarize("Some text").then(function () {
done(new Error("Should have thrown"));
}).catch(function (err) {
expect(err.message).to.include("No completion choices");
done();
});
});
});
The key insight: you are not testing the LLM. You are testing your code's ability to handle LLM responses correctly. This layer should cover every error path, every edge case in your parsing logic, every retry scenario. These tests run in milliseconds with zero API cost.
Recording and Replaying API Calls (VCR-Style Testing)
VCR testing records real API responses on the first run, then replays them on subsequent runs. This gives you realistic fixtures without constant API calls. I use nock in recording mode for this.
// test/helpers/vcr.js
var nock = require("nock");
var fs = require("fs");
var path = require("path");
function useFixture(name, options) {
var fixturePath = path.join(__dirname, "../fixtures", name + ".json");
var record = options && options.record;
if (record || !fs.existsSync(fixturePath)) {
// Record mode: let the real request through and save it
nock.recorder.rec({
output_objects: true,
dont_print: true
});
return {
save: function () {
var recordings = nock.recorder.play();
nock.recorder.clear();
nock.restore();
// Scrub API keys from recorded fixtures
var scrubbed = recordings.map(function (rec) {
if (rec.reqheaders && rec.reqheaders.authorization) {
rec.reqheaders.authorization = "Bearer sk-test-redacted";
}
return rec;
});
fs.mkdirSync(path.dirname(fixturePath), { recursive: true });
fs.writeFileSync(fixturePath, JSON.stringify(scrubbed, null, 2));
}
};
} else {
// Replay mode: load the saved fixture
var fixtures = JSON.parse(fs.readFileSync(fixturePath, "utf8"));
nock.define(fixtures);
return { save: function () {} };
}
}
module.exports = { useFixture: useFixture };
Usage in tests:
// test/integration/classifier.test.js
var expect = require("chai").expect;
var vcr = require("../helpers/vcr");
var classifier = require("../../lib/classifier");
describe("Classifier (VCR)", function () {
it("should classify a support ticket", function (done) {
var fixture = vcr.useFixture("classify-support-ticket", {
record: process.env.VCR_RECORD === "true"
});
classifier.classify("My order hasn't arrived and it's been 2 weeks")
.then(function (result) {
expect(result.category).to.be.oneOf([
"shipping", "order_status", "complaint"
]);
expect(result.confidence).to.be.above(0.5);
fixture.save();
done();
})
.catch(done);
});
});
To record fresh fixtures:
VCR_RECORD=true npx mocha test/integration/classifier.test.js
To replay:
npx mocha test/integration/classifier.test.js
One critical rule: always scrub API keys from recorded fixtures before committing them. I have seen production keys leak through test fixtures more than once.
Snapshot Testing for LLM Outputs
Snapshot testing captures the full output from an LLM call and saves it. On subsequent runs, you compare the new output against the saved snapshot. The goal is not exact matching but detecting unexpected drift.
// test/helpers/snapshot.js
var fs = require("fs");
var path = require("path");
function compareSnapshot(name, actual, threshold) {
var snapshotDir = path.join(__dirname, "../snapshots");
var snapshotPath = path.join(snapshotDir, name + ".txt");
threshold = threshold || 0.8;
if (process.env.UPDATE_SNAPSHOTS === "true" || !fs.existsSync(snapshotPath)) {
fs.mkdirSync(snapshotDir, { recursive: true });
fs.writeFileSync(snapshotPath, actual);
return { pass: true, similarity: 1.0, message: "Snapshot created" };
}
var expected = fs.readFileSync(snapshotPath, "utf8");
var similarity = computeSimilarity(expected, actual);
return {
pass: similarity >= threshold,
similarity: similarity,
message: similarity >= threshold
? "Output within threshold (" + similarity.toFixed(3) + " >= " + threshold + ")"
: "Output drifted too far (" + similarity.toFixed(3) + " < " + threshold + ")"
};
}
function computeSimilarity(a, b) {
// Simple Jaccard similarity on word sets
var wordsA = a.toLowerCase().split(/\s+/);
var wordsB = b.toLowerCase().split(/\s+/);
var setA = {};
var setB = {};
var intersection = 0;
wordsA.forEach(function (w) { setA[w] = true; });
wordsB.forEach(function (w) { setB[w] = true; });
Object.keys(setA).forEach(function (w) {
if (setB[w]) intersection++;
});
var union = Object.keys(setA).length + Object.keys(setB).length - intersection;
return union === 0 ? 1 : intersection / union;
}
module.exports = { compareSnapshot: compareSnapshot };
This is especially useful for detecting when a model upgrade or prompt change causes output to shift dramatically. A similarity threshold of 0.7-0.8 works well for natural language outputs.
Evaluation Frameworks: Scoring Response Quality Programmatically
This is where testing LLMs gets genuinely interesting. Instead of asserting exact outputs, you define quality criteria and score responses against them.
// lib/evaluator.js
var fetch = require("node-fetch");
function EvalSuite(name) {
this.name = name;
this.checks = [];
}
EvalSuite.prototype.addCheck = function (name, fn) {
this.checks.push({ name: name, fn: fn });
};
EvalSuite.prototype.evaluate = function (output) {
var results = [];
var totalScore = 0;
this.checks.forEach(function (check) {
var score = check.fn(output);
results.push({ check: check.name, score: score, pass: score >= 0.5 });
totalScore += score;
});
return {
suite: this.name,
overallScore: totalScore / this.checks.length,
results: results,
pass: results.every(function (r) { return r.pass; })
};
};
// Built-in check: response contains required fields when parsed as JSON
function jsonStructureCheck(requiredFields) {
return function (output) {
try {
var parsed = JSON.parse(output);
var found = 0;
requiredFields.forEach(function (field) {
if (parsed[field] !== undefined) found++;
});
return found / requiredFields.length;
} catch (e) {
return 0;
}
};
}
// Built-in check: response length within bounds
function lengthCheck(minWords, maxWords) {
return function (output) {
var wordCount = output.split(/\s+/).length;
if (wordCount < minWords) return wordCount / minWords;
if (wordCount > maxWords) return maxWords / wordCount;
return 1.0;
};
}
// Built-in check: response does not contain forbidden patterns
function noForbiddenPatterns(patterns) {
return function (output) {
var violations = 0;
patterns.forEach(function (pattern) {
if (output.match(pattern)) violations++;
});
return violations === 0 ? 1.0 : 0.0;
};
}
module.exports = {
EvalSuite: EvalSuite,
checks: {
jsonStructure: jsonStructureCheck,
length: lengthCheck,
noForbiddenPatterns: noForbiddenPatterns
}
};
Using this in tests:
// test/eval/summarizer-eval.test.js
var expect = require("chai").expect;
var evaluator = require("../../lib/evaluator");
describe("Summarizer Quality Evaluation", function () {
var suite;
beforeEach(function () {
suite = new evaluator.EvalSuite("summarizer");
suite.addCheck("length", evaluator.checks.length(10, 100));
suite.addCheck("no-hallucination-markers",
evaluator.checks.noForbiddenPatterns([
/as an AI/i,
/I cannot/i,
/I don't have access/i
])
);
suite.addCheck("contains-key-terms", function (output) {
var terms = ["node", "testing", "api"];
var found = 0;
terms.forEach(function (term) {
if (output.toLowerCase().indexOf(term) !== -1) found++;
});
return found / terms.length;
});
});
it("should score a good summary above 0.7", function () {
var goodOutput = "Node.js API testing requires a layered approach with unit tests, " +
"integration tests, and load tests to ensure reliability under production conditions.";
var result = suite.evaluate(goodOutput);
expect(result.overallScore).to.be.above(0.7);
expect(result.pass).to.be.true;
});
it("should flag a poor summary", function () {
var poorOutput = "As an AI, I cannot summarize this effectively.";
var result = suite.evaluate(poorOutput);
expect(result.pass).to.be.false;
});
});
Building a Test Harness for Prompt Regression Testing
When you change a prompt, you need to know if the output quality improved, degraded, or stayed the same. A regression harness runs a set of test cases against both the old and new prompts, then compares scores.
// test/regression/prompt-regression.js
var fs = require("fs");
var path = require("path");
var evaluator = require("../../lib/evaluator");
function PromptRegressionHarness(options) {
this.testCases = options.testCases;
this.evalSuite = options.evalSuite;
this.generateFn = options.generateFn;
this.resultsDir = options.resultsDir || path.join(__dirname, "../regression-results");
}
PromptRegressionHarness.prototype.run = function (promptVersion) {
var self = this;
var results = [];
return self.testCases.reduce(function (chain, testCase) {
return chain.then(function () {
return self.generateFn(testCase.input, promptVersion).then(function (output) {
var evalResult = self.evalSuite.evaluate(output);
results.push({
testCase: testCase.name,
input: testCase.input,
output: output,
score: evalResult.overallScore,
details: evalResult.results
});
});
});
}, Promise.resolve()).then(function () {
var summary = {
promptVersion: promptVersion,
timestamp: new Date().toISOString(),
averageScore: results.reduce(function (sum, r) { return sum + r.score; }, 0) / results.length,
results: results
};
fs.mkdirSync(self.resultsDir, { recursive: true });
var filePath = path.join(self.resultsDir, promptVersion + ".json");
fs.writeFileSync(filePath, JSON.stringify(summary, null, 2));
return summary;
});
};
PromptRegressionHarness.prototype.compare = function (versionA, versionB) {
var fileA = path.join(this.resultsDir, versionA + ".json");
var fileB = path.join(this.resultsDir, versionB + ".json");
var resultsA = JSON.parse(fs.readFileSync(fileA, "utf8"));
var resultsB = JSON.parse(fs.readFileSync(fileB, "utf8"));
var scoreDiff = resultsB.averageScore - resultsA.averageScore;
return {
versionA: versionA,
versionB: versionB,
scoreA: resultsA.averageScore,
scoreB: resultsB.averageScore,
delta: scoreDiff,
improved: scoreDiff > 0.02,
degraded: scoreDiff < -0.02,
stable: Math.abs(scoreDiff) <= 0.02
};
};
module.exports = PromptRegressionHarness;
I use this in CI to block merges that degrade prompt quality by more than 2%.
Testing Tool Use and Function Calling Flows
Function calling adds a layer of complexity. You need to test that the LLM picks the right tool, passes valid arguments, and that your code handles the tool result correctly.
// test/unit/tool-calling.test.js
var expect = require("chai").expect;
var nock = require("nock");
var agent = require("../../lib/agent");
describe("Tool Calling", function () {
afterEach(function () {
nock.cleanAll();
});
it("should handle a function call response", function (done) {
// First call: LLM decides to call the weather tool
nock("https://api.openai.com")
.post("/v1/chat/completions")
.reply(200, {
choices: [{
message: {
role: "assistant",
content: null,
tool_calls: [{
id: "call_abc123",
type: "function",
function: {
name: "get_weather",
arguments: '{"location": "San Francisco", "units": "fahrenheit"}'
}
}]
},
finish_reason: "tool_calls"
}],
usage: { total_tokens: 45 }
});
// Second call: LLM produces final answer after receiving tool result
nock("https://api.openai.com")
.post("/v1/chat/completions")
.reply(200, {
choices: [{
message: {
role: "assistant",
content: "The weather in San Francisco is 62°F and partly cloudy."
},
finish_reason: "stop"
}],
usage: { total_tokens: 80 }
});
agent.run("What's the weather in SF?").then(function (result) {
expect(result.toolsUsed).to.deep.equal(["get_weather"]);
expect(result.answer).to.include("San Francisco");
expect(result.totalTokens).to.equal(125);
done();
}).catch(done);
});
it("should handle invalid function arguments gracefully", function (done) {
nock("https://api.openai.com")
.post("/v1/chat/completions")
.reply(200, {
choices: [{
message: {
role: "assistant",
content: null,
tool_calls: [{
id: "call_bad",
type: "function",
function: {
name: "get_weather",
arguments: '{"location": ""}' // empty location
}
}]
},
finish_reason: "tool_calls"
}],
usage: { total_tokens: 30 }
});
agent.run("Weather?").then(function (result) {
expect(result.error).to.include("Invalid tool arguments");
done();
}).catch(done);
});
});
Always test the multi-turn flow: LLM requests tool, your code executes tool, LLM receives result, LLM generates final answer. Mock each HTTP call in sequence.
Load Testing LLM Integrations
LLM APIs have rate limits, and your application needs to handle concurrent requests gracefully. Here is a straightforward load test setup:
// test/load/throughput.test.js
var expect = require("chai").expect;
var summarizer = require("../../lib/summarizer");
describe("Load Test", function () {
this.timeout(120000);
it("should handle 20 concurrent requests without errors", function (done) {
var concurrency = 20;
var inputs = [];
var i;
for (i = 0; i < concurrency; i++) {
inputs.push("Article content number " + i + " about software testing.");
}
var start = Date.now();
Promise.all(inputs.map(function (input) {
return summarizer.summarize(input).then(function (result) {
return { success: true, tokens: result.tokenUsage };
}).catch(function (err) {
return { success: false, error: err.message };
});
})).then(function (results) {
var elapsed = Date.now() - start;
var successes = results.filter(function (r) { return r.success; });
var failures = results.filter(function (r) { return !r.success; });
console.log(" Elapsed: " + elapsed + "ms");
console.log(" Successes: " + successes.length + "/" + concurrency);
console.log(" Failures: " + failures.length);
if (failures.length > 0) {
console.log(" Failure reasons:", failures.map(function (f) { return f.error; }));
}
// Allow up to 10% failure rate for rate limiting
expect(successes.length).to.be.at.least(concurrency * 0.9);
done();
}).catch(done);
});
});
Run load tests against the real API in a staging environment, never in CI with mocked responses. The point is to discover rate limiting behavior, timeout thresholds, and concurrency bottlenecks.
Testing Streaming Responses
Streaming responses from LLMs require special handling. You need to test that your code correctly assembles chunks, handles mid-stream errors, and fires events properly.
// test/unit/streaming.test.js
var expect = require("chai").expect;
var nock = require("nock");
var streamHandler = require("../../lib/stream-handler");
describe("Streaming Response Handler", function () {
afterEach(function () {
nock.cleanAll();
});
it("should assemble streamed chunks into complete response", function (done) {
var chunks = [
'data: {"choices":[{"delta":{"role":"assistant"},"index":0}]}\n\n',
'data: {"choices":[{"delta":{"content":"Hello"},"index":0}]}\n\n',
'data: {"choices":[{"delta":{"content":" world"},"index":0}]}\n\n',
'data: {"choices":[{"delta":{},"index":0,"finish_reason":"stop"}]}\n\n',
'data: [DONE]\n\n'
].join("");
nock("https://api.openai.com")
.post("/v1/chat/completions")
.reply(200, chunks, {
"Content-Type": "text/event-stream"
});
var collectedChunks = [];
streamHandler.streamCompletion("Say hello", {
onChunk: function (text) {
collectedChunks.push(text);
},
onComplete: function (fullText) {
expect(fullText).to.equal("Hello world");
expect(collectedChunks).to.deep.equal(["Hello", " world"]);
done();
},
onError: function (err) {
done(err);
}
});
});
it("should handle mid-stream disconnection", function (done) {
var partialChunks = [
'data: {"choices":[{"delta":{"content":"Partial"},"index":0}]}\n\n'
].join("");
nock("https://api.openai.com")
.post("/v1/chat/completions")
.replyWithError("Socket hang up");
streamHandler.streamCompletion("Say hello", {
onChunk: function () {},
onComplete: function () {
done(new Error("Should not complete"));
},
onError: function (err) {
expect(err.message).to.include("Socket hang up");
done();
}
});
});
});
Cost-Controlled Testing
Running full test suites against GPT-4 or Claude Opus gets expensive fast. Here are strategies I use to keep costs under control.
Use cheaper models in test environments:
// lib/config.js
var config = {
llm: {
model: process.env.LLM_MODEL || "gpt-4o",
testModel: process.env.LLM_TEST_MODEL || "gpt-4o-mini",
apiKey: process.env.OPENAI_API_KEY
}
};
function getModel() {
if (process.env.NODE_ENV === "test") {
return config.llm.testModel;
}
return config.llm.model;
}
module.exports = { config: config, getModel: getModel };
Cache test responses to disk:
// test/helpers/response-cache.js
var fs = require("fs");
var path = require("path");
var crypto = require("crypto");
var CACHE_DIR = path.join(__dirname, "../.response-cache");
function getCacheKey(prompt, model) {
var hash = crypto.createHash("sha256");
hash.update(prompt + "|" + model);
return hash.digest("hex").substring(0, 16);
}
function getCached(prompt, model) {
var key = getCacheKey(prompt, model);
var cachePath = path.join(CACHE_DIR, key + ".json");
if (fs.existsSync(cachePath)) {
var cached = JSON.parse(fs.readFileSync(cachePath, "utf8"));
var age = Date.now() - cached.timestamp;
// Cache expires after 7 days
if (age < 7 * 24 * 60 * 60 * 1000) {
return cached.response;
}
}
return null;
}
function setCache(prompt, model, response) {
fs.mkdirSync(CACHE_DIR, { recursive: true });
var key = getCacheKey(prompt, model);
var cachePath = path.join(CACHE_DIR, key + ".json");
fs.writeFileSync(cachePath, JSON.stringify({
prompt: prompt,
model: model,
response: response,
timestamp: Date.now()
}, null, 2));
}
module.exports = { getCached: getCached, setCache: setCache };
Add .response-cache/ to your .gitignore. I have seen test caches grow to hundreds of megabytes in busy projects.
CI Pipeline Setup for LLM Tests
Setting up LLM tests in CI requires balancing cost, speed, and reliability. Here is a practical GitHub Actions configuration:
# .github/workflows/llm-tests.yml
name: LLM Test Suite
on:
pull_request:
paths:
- 'lib/**'
- 'prompts/**'
- 'test/**'
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- run: npm ci
- run: npm run test:unit
env:
NODE_ENV: test
fixture-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- run: npm ci
- run: npm run test:fixture
env:
NODE_ENV: test
evaluation-tests:
runs-on: ubuntu-latest
if: github.event.pull_request.head.ref == 'prompt-update' || contains(github.event.pull_request.labels.*.name, 'run-eval')
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- run: npm ci
- run: npm run test:eval
env:
NODE_ENV: test
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY_TEST }}
LLM_TEST_MODEL: gpt-4o-mini
LLM_TEST_BUDGET: "5.00"
- uses: actions/upload-artifact@v4
with:
name: eval-results
path: test/regression-results/
Key decisions:
- Unit and fixture tests run on every PR — they are free and fast
- Evaluation tests only run when labeled or on prompt branches — they cost money
- Use a dedicated test API key with spending limits set in the provider dashboard
- Upload evaluation results as artifacts for comparison across runs
Add a test budget tracker:
// test/helpers/budget.js
var totalSpent = 0;
var BUDGET_LIMIT = parseFloat(process.env.LLM_TEST_BUDGET || "5.00");
// Approximate costs per 1K tokens (adjust for current pricing)
var COST_PER_1K = {
"gpt-4o": { input: 0.0025, output: 0.01 },
"gpt-4o-mini": { input: 0.00015, output: 0.0006 },
"claude-3-5-sonnet": { input: 0.003, output: 0.015 }
};
function trackUsage(model, inputTokens, outputTokens) {
var costs = COST_PER_1K[model] || { input: 0.01, output: 0.03 };
var cost = (inputTokens / 1000 * costs.input) + (outputTokens / 1000 * costs.output);
totalSpent += cost;
if (totalSpent > BUDGET_LIMIT) {
throw new Error(
"LLM test budget exceeded: $" + totalSpent.toFixed(4) +
" > $" + BUDGET_LIMIT.toFixed(2) +
". Remaining tests will be skipped."
);
}
return { cost: cost, totalSpent: totalSpent, remaining: BUDGET_LIMIT - totalSpent };
}
function getSpent() {
return totalSpent;
}
module.exports = { trackUsage: trackUsage, getSpent: getSpent };
Property-Based Testing for LLM Outputs
When you cannot assert exact outputs, assert on structural properties. This approach catches a surprising number of bugs.
// test/properties/output-properties.test.js
var expect = require("chai").expect;
describe("LLM Output Properties", function () {
// These tests run against recorded fixtures or cached responses
var outputs = require("../fixtures/sample-outputs.json");
describe("JSON Classification Output", function () {
outputs.classifications.forEach(function (sample, index) {
it("sample " + index + " should have valid structure", function () {
var parsed;
// Property: output must be valid JSON
expect(function () {
parsed = JSON.parse(sample.output);
}).to.not.throw();
// Property: must have required fields
expect(parsed).to.have.property("category");
expect(parsed).to.have.property("confidence");
// Property: category must be from allowed set
var allowedCategories = [
"billing", "technical", "account", "general", "spam"
];
expect(allowedCategories).to.include(parsed.category);
// Property: confidence must be a number between 0 and 1
expect(parsed.confidence).to.be.a("number");
expect(parsed.confidence).to.be.within(0, 1);
});
});
});
describe("Summary Output", function () {
outputs.summaries.forEach(function (sample, index) {
it("sample " + index + " should meet length constraints", function () {
var wordCount = sample.output.split(/\s+/).length;
// Property: summary should be shorter than input
var inputWordCount = sample.input.split(/\s+/).length;
expect(wordCount).to.be.below(inputWordCount);
// Property: summary should be at least 10 words
expect(wordCount).to.be.at.least(10);
// Property: summary should not exceed 200 words
expect(wordCount).to.be.at.most(200);
});
it("sample " + index + " should not contain meta-commentary", function () {
// Property: output should not contain AI self-references
expect(sample.output).to.not.match(/as an ai/i);
expect(sample.output).to.not.match(/i am a language model/i);
expect(sample.output).to.not.match(/i cannot/i);
});
});
});
});
Property-based tests are the workhorse of LLM testing in production. They are stable across model updates, catch real bugs, and communicate your quality requirements clearly.
Complete Working Example
Here is a full test suite you can drop into a project. It covers mocked unit tests, VCR fixtures, evaluation scoring, and prompt regression testing.
// lib/summarizer.js
var fetch = require("node-fetch");
var config = require("./config");
var PROMPT_VERSIONS = {
"v1": "Summarize the following text in 2-3 sentences:",
"v2": "Provide a concise 2-3 sentence summary. Focus on key facts and conclusions:"
};
function summarize(text, options) {
options = options || {};
var model = options.model || config.getModel();
var promptVersion = options.promptVersion || "v2";
var prompt = PROMPT_VERSIONS[promptVersion] || PROMPT_VERSIONS["v2"];
return fetch("https://api.openai.com/v1/chat/completions", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": "Bearer " + config.config.llm.apiKey
},
body: JSON.stringify({
model: model,
messages: [
{ role: "system", content: prompt },
{ role: "user", content: text }
],
temperature: 0
})
}).then(function (response) {
if (response.status === 429) {
// Simple retry after 1 second
return new Promise(function (resolve) {
setTimeout(resolve, 1000);
}).then(function () {
return summarize(text, options);
});
}
return response.json();
}).then(function (data) {
if (!data.choices || data.choices.length === 0) {
throw new Error("No completion choices returned from API");
}
return {
summary: data.choices[0].message.content,
tokenUsage: data.usage ? data.usage.total_tokens : 0,
model: model
};
});
}
module.exports = { summarize: summarize, PROMPT_VERSIONS: PROMPT_VERSIONS };
// test/complete-suite.test.js
var expect = require("chai").expect;
var sinon = require("sinon");
var nock = require("nock");
var path = require("path");
var fs = require("fs");
var summarizer = require("../lib/summarizer");
var evaluator = require("../lib/evaluator");
var PromptRegressionHarness = require("./regression/prompt-regression");
// =========================================
// 1. UNIT TESTS (mocked, zero API cost)
// =========================================
describe("Unit Tests", function () {
afterEach(function () {
nock.cleanAll();
});
it("should return summary and token count", function (done) {
nock("https://api.openai.com")
.post("/v1/chat/completions")
.reply(200, {
choices: [{
message: {
content: "Node.js is a JavaScript runtime for server-side development."
},
finish_reason: "stop"
}],
usage: { prompt_tokens: 40, completion_tokens: 12, total_tokens: 52 }
});
summarizer.summarize("Long text about Node.js...")
.then(function (result) {
expect(result.summary).to.be.a("string");
expect(result.summary.length).to.be.above(10);
expect(result.tokenUsage).to.equal(52);
done();
})
.catch(done);
});
it("should retry on 429", function (done) {
this.timeout(5000);
nock("https://api.openai.com")
.post("/v1/chat/completions")
.reply(429, { error: { message: "Rate limit exceeded" } });
nock("https://api.openai.com")
.post("/v1/chat/completions")
.reply(200, {
choices: [{ message: { content: "Retried successfully." }, finish_reason: "stop" }],
usage: { total_tokens: 20 }
});
summarizer.summarize("Test text")
.then(function (result) {
expect(result.summary).to.equal("Retried successfully.");
done();
})
.catch(done);
});
it("should throw on empty choices", function (done) {
nock("https://api.openai.com")
.post("/v1/chat/completions")
.reply(200, { choices: [], usage: { total_tokens: 10 } });
summarizer.summarize("Test")
.then(function () {
done(new Error("Expected error"));
})
.catch(function (err) {
expect(err.message).to.equal("No completion choices returned from API");
done();
});
});
});
// =========================================
// 2. EVALUATION TESTS (scored quality)
// =========================================
describe("Evaluation Tests", function () {
var suite;
beforeEach(function () {
suite = new evaluator.EvalSuite("summarizer-eval");
suite.addCheck("length", evaluator.checks.length(10, 80));
suite.addCheck("no-ai-references",
evaluator.checks.noForbiddenPatterns([/as an AI/i, /language model/i])
);
});
it("should pass evaluation for well-formed summary", function () {
var output = "Express.js middleware chains handle HTTP requests through " +
"a pipeline of functions. Each middleware can modify the request, " +
"response, or terminate the chain by sending a response.";
var result = suite.evaluate(output);
expect(result.overallScore).to.be.above(0.7);
expect(result.pass).to.be.true;
console.log(" Overall score:", result.overallScore.toFixed(3));
result.results.forEach(function (r) {
console.log(" " + r.check + ": " + r.score.toFixed(3) + " (" + (r.pass ? "PASS" : "FAIL") + ")");
});
});
it("should fail evaluation for AI self-reference", function () {
var output = "As an AI language model, I think Express is a framework.";
var result = suite.evaluate(output);
expect(result.pass).to.be.false;
});
});
// =========================================
// 3. PROMPT REGRESSION TESTS
// =========================================
describe("Prompt Regression", function () {
this.timeout(30000);
var testCases = [
{
name: "technical-article",
input: "Node.js uses an event-driven, non-blocking I/O model that makes it " +
"lightweight and efficient. It uses the V8 JavaScript engine and is " +
"ideal for data-intensive real-time applications running across distributed devices."
},
{
name: "product-description",
input: "The XR-5000 wireless headphones feature active noise cancellation, " +
"40-hour battery life, and Bluetooth 5.3 connectivity. Available in " +
"midnight black and arctic white. Retail price $199.99."
}
];
it("should compare prompt v1 vs v2 using fixtures", function () {
// Load pre-recorded outputs for both prompt versions
var v1Outputs = {
"technical-article": "Node.js is a lightweight runtime using event-driven I/O and V8 engine for real-time apps.",
"product-description": "The XR-5000 offers noise cancellation, 40-hour battery, and Bluetooth 5.3 for $199.99."
};
var v2Outputs = {
"technical-article": "Node.js leverages non-blocking I/O and V8 for efficient, real-time distributed applications.",
"product-description": "XR-5000 wireless headphones feature ANC, 40-hour battery, and BT 5.3 at $199.99 in two colors."
};
var suite = new evaluator.EvalSuite("regression");
suite.addCheck("length", evaluator.checks.length(10, 50));
suite.addCheck("no-ai-refs", evaluator.checks.noForbiddenPatterns([/as an AI/i]));
var v1Scores = [];
var v2Scores = [];
testCases.forEach(function (tc) {
v1Scores.push(suite.evaluate(v1Outputs[tc.name]).overallScore);
v2Scores.push(suite.evaluate(v2Outputs[tc.name]).overallScore);
});
var v1Avg = v1Scores.reduce(function (a, b) { return a + b; }, 0) / v1Scores.length;
var v2Avg = v2Scores.reduce(function (a, b) { return a + b; }, 0) / v2Scores.length;
var delta = v2Avg - v1Avg;
console.log(" Prompt v1 avg score: " + v1Avg.toFixed(3));
console.log(" Prompt v2 avg score: " + v2Avg.toFixed(3));
console.log(" Delta: " + delta.toFixed(3));
// v2 should not be worse than v1
expect(delta).to.be.at.least(-0.02);
});
});
Run the full suite:
npx mocha test/complete-suite.test.js --timeout 30000
Expected output:
Unit Tests
✓ should return summary and token count (45ms)
✓ should retry on 429 (1120ms)
✓ should throw on empty choices (12ms)
Evaluation Tests
Overall score: 1.000
length: 1.000 (PASS)
no-ai-references: 1.000 (PASS)
✓ should pass evaluation for well-formed summary
✓ should fail evaluation for AI self-reference
Prompt Regression
Prompt v1 avg score: 1.000
Prompt v2 avg score: 1.000
Delta: 0.000
✓ should compare prompt v1 vs v2 using fixtures
6 passing (1.2s)
Common Issues and Troubleshooting
1. Nock Interceptor Mismatch
Error: Nock: No match for request {
method: 'POST',
url: 'https://api.openai.com/v1/chat/completions',
headers: { 'content-type': 'application/json' }
}
This happens when the request body or headers do not match what nock expects. Fix by using .post("/v1/chat/completions") without a body matcher, or match loosely:
nock("https://api.openai.com")
.post("/v1/chat/completions", function () { return true; })
.reply(200, mockResponse);
2. Fixture Deserialization Error
SyntaxError: Unexpected token u in JSON at position 0
at JSON.parse (<anonymous>)
at Object.getCached (test/helpers/response-cache.js:18:27)
Your cache file is corrupted or was partially written. Delete the .response-cache/ directory and re-run with VCR_RECORD=true. Also check that your fixture file paths are correct and the file is not empty.
3. Test Budget Exceeded in CI
Error: LLM test budget exceeded: $5.0342 > $5.00. Remaining tests will be skipped.
Your evaluation tests are consuming more tokens than budgeted. Solutions: reduce the number of eval test cases, switch to a cheaper model for CI (gpt-4o-mini), or increase the budget with LLM_TEST_BUDGET=10.00. You can also cache responses so repeated CI runs do not re-invoke the API.
4. Streaming Test Timeout
Error: Timeout of 2000ms exceeded. For async tests and hooks, ensure "done()" is called.
Streaming tests need longer timeouts. Add this.timeout(10000) to the test or describe block. Also verify that your stream handler actually calls the onComplete or onError callback in all code paths, including when the connection is reset before any data arrives.
5. Rate Limiting Cascade in Parallel Tests
Error: 429 Too Many Requests
{ "error": { "message": "Rate limit reached for gpt-4o-mini" } }
Running multiple test files in parallel can exhaust your rate limit. Use --jobs 1 with mocha to run test files sequentially, or add request queuing in your LLM client wrapper with a concurrency limit of 3-5 requests.
Best Practices
Layer your tests. Unit tests with mocks for logic, VCR fixtures for integration paths, evaluation scoring for quality, regression harnesses for prompt changes. Each layer catches different failures.
Never assert on exact LLM output strings. Assert on structure, length bounds, required fields, forbidden patterns, and quality scores. Exact string assertions will break on every model update.
Record fixtures once, replay everywhere. VCR-style testing gives you realistic data without API costs. Re-record fixtures monthly or when you change models.
Set a hard test budget in CI. Use a budget tracker that throws when spending exceeds the limit. Start with $5 per CI run and adjust. Without a budget, a misconfigured test loop can burn through hundreds of dollars.
Use cheaper models for test environments. GPT-4o-mini or Claude Haiku cost 10-50x less than full models. Your tests should validate behavior patterns, not model intelligence.
Scrub API keys from all fixtures and recordings. Audit your fixture files before every commit. Add a pre-commit hook that scans for API key patterns in JSON files under
test/.Test error paths as thoroughly as happy paths. Rate limits, timeouts, malformed responses, empty choices, invalid JSON, network failures. LLM APIs fail in creative ways. Your tests should cover every one of them.
Keep prompt versions in code, not in dashboards. Version your prompts alongside your tests so you can regression-test across versions. Store them as named strings or template files in your repository.
Cache test responses to disk with TTL expiry. A 7-day cache eliminates redundant API calls during local development. Add the cache directory to
.gitignoreand document how to clear it.Run evaluation tests only on prompt-change PRs. Gate expensive tests behind PR labels or branch name patterns. Unit and fixture tests run on every push; eval tests run only when you are changing prompts.
References
- Nock - HTTP server mocking for Node.js
- Sinon.js - Test spies, stubs, and mocks
- Mocha - JavaScript test framework
- OpenAI API Reference - Chat Completions
- Anthropic API Reference - Messages
- DeepEval - LLM Evaluation Framework
- Promptfoo - Prompt Testing and Evaluation
- Braintrust - LLM Evaluation Platform