Embedding Performance Benchmarking
Benchmark embedding models with retrieval metrics (recall, MRR, nDCG), latency testing, and automated comparison pipelines in Node.js.
Embedding Performance Benchmarking
Choosing an embedding model based on leaderboard scores is a mistake most teams make exactly once. The model that tops MTEB benchmarks may perform terribly on your specific domain, your query patterns, and your retrieval pipeline. This article walks through building a rigorous, automated embedding benchmark suite in Node.js that measures what actually matters: retrieval quality, latency, throughput, and cost on your data.
Prerequisites
- Node.js 18+ installed
- PostgreSQL with pgvector extension enabled
- OpenAI API key (or equivalent embedding provider)
- Familiarity with vector embeddings and similarity search concepts
- Basic understanding of information retrieval metrics
Install the required packages:
npm install openai pg pgvector mathjs cli-table3
Why Benchmark on Your Own Data
Public benchmarks like MTEB evaluate models on general-purpose datasets — Wikipedia passages, academic papers, web crawl snippets. Your data is different. If you are building a retrieval system for legal contracts, medical records, or API documentation, the vocabulary, document length, and query patterns diverge sharply from anything in a public benchmark.
I have seen teams switch from text-embedding-3-large to text-embedding-3-small and actually improve retrieval quality on their domain because the larger model was overfitting to patterns irrelevant to their corpus. You cannot know this without measuring it yourself.
Benchmarking also catches regressions. When you update your chunking strategy, swap out a reranker, or change your index configuration, you need a repeatable test that tells you whether things got better or worse. Gut feeling does not scale.
Key Metrics to Measure
A complete embedding benchmark covers four dimensions:
Retrieval Quality — Does the system return the right documents? Measured by recall@k, precision@k, MRR, and nDCG.
Latency — How long does it take to generate an embedding and execute a search? Measured in milliseconds at p50, p95, and p99.
Throughput — How many embeddings or queries can you process per second? Critical for batch pipelines and high-traffic applications.
Cost — What does each embedding and query cost in API fees and compute? Often ignored until the invoice arrives.
Building a Benchmark Dataset
The foundation of any embedding benchmark is a labeled dataset: a set of queries, each paired with the documents that are known to be relevant. This is sometimes called a "gold standard" or "ground truth" dataset.
You do not need thousands of examples to start. Fifty well-curated query-document pairs will tell you more than a sloppy dataset of five thousand. Quality matters far more than quantity here.
// benchmark-dataset.js
// Structure for a benchmark dataset with queries and known relevant documents
var dataset = {
name: "API Documentation Retrieval Benchmark",
version: "1.0.0",
created: new Date().toISOString(),
queries: [
{
id: "q1",
text: "How do I authenticate with OAuth 2.0?",
relevant_doc_ids: ["doc_auth_oauth", "doc_auth_tokens", "doc_auth_flows"],
relevance_scores: {
"doc_auth_oauth": 3, // highly relevant
"doc_auth_tokens": 2, // relevant
"doc_auth_flows": 1 // marginally relevant
}
},
{
id: "q2",
text: "rate limiting configuration",
relevant_doc_ids: ["doc_ratelimit_setup", "doc_ratelimit_headers"],
relevance_scores: {
"doc_ratelimit_setup": 3,
"doc_ratelimit_headers": 2
}
},
{
id: "q3",
text: "How to handle pagination in REST APIs?",
relevant_doc_ids: ["doc_pagination_cursor", "doc_pagination_offset"],
relevance_scores: {
"doc_pagination_cursor": 3,
"doc_pagination_offset": 2
}
}
]
};
module.exports = dataset;
The relevance_scores field uses graded relevance (0-3 scale) which is essential for nDCG calculation. Binary relevance (relevant or not) works for recall and precision but loses information about how relevant each document is.
To build your dataset, start by pulling real queries from your search logs. Then have a domain expert label which documents should be returned for each query. If you do not have search logs yet, write the queries yourself — think about what your users would actually type.
Implementing Recall@k and Precision@k
Recall@k answers: "Of all the relevant documents, how many appeared in the top k results?" Precision@k answers: "Of the top k results, how many were relevant?"
// metrics.js
// Core retrieval quality metrics for embedding benchmarks
function recallAtK(retrievedIds, relevantIds, k) {
var topK = retrievedIds.slice(0, k);
var hits = 0;
for (var i = 0; i < relevantIds.length; i++) {
if (topK.indexOf(relevantIds[i]) !== -1) {
hits++;
}
}
return relevantIds.length > 0 ? hits / relevantIds.length : 0;
}
function precisionAtK(retrievedIds, relevantIds, k) {
var topK = retrievedIds.slice(0, k);
var hits = 0;
for (var i = 0; i < topK.length; i++) {
if (relevantIds.indexOf(topK[i]) !== -1) {
hits++;
}
}
return topK.length > 0 ? hits / topK.length : 0;
}
// Average across all queries in the dataset
function averageRecallAtK(results, dataset, k) {
var total = 0;
var queries = dataset.queries;
for (var i = 0; i < queries.length; i++) {
var query = queries[i];
var retrieved = results[query.id] || [];
total += recallAtK(retrieved, query.relevant_doc_ids, k);
}
return queries.length > 0 ? total / queries.length : 0;
}
function averagePrecisionAtK(results, dataset, k) {
var total = 0;
var queries = dataset.queries;
for (var i = 0; i < queries.length; i++) {
var query = queries[i];
var retrieved = results[query.id] || [];
total += precisionAtK(retrieved, query.relevant_doc_ids, k);
}
return queries.length > 0 ? total / queries.length : 0;
}
module.exports = {
recallAtK: recallAtK,
precisionAtK: precisionAtK,
averageRecallAtK: averageRecallAtK,
averagePrecisionAtK: averagePrecisionAtK
};
A common pitfall: recall@k can be misleading when the number of relevant documents varies wildly across queries. If one query has two relevant documents and another has fifty, the query with two will almost always have perfect recall@5, inflating the average. Consider weighting by the number of relevant documents or reporting median alongside mean.
Mean Reciprocal Rank (MRR)
MRR measures how high up the first relevant result appears. It is the average of the reciprocal of the rank of the first relevant document across all queries. If the first relevant document is at position 1, the reciprocal rank is 1. At position 3, it is 1/3. If no relevant document appears, it is 0.
MRR is particularly useful when your application shows a single "best answer" rather than a list of results.
function reciprocalRank(retrievedIds, relevantIds) {
for (var i = 0; i < retrievedIds.length; i++) {
if (relevantIds.indexOf(retrievedIds[i]) !== -1) {
return 1 / (i + 1);
}
}
return 0;
}
function meanReciprocalRank(results, dataset) {
var total = 0;
var queries = dataset.queries;
for (var i = 0; i < queries.length; i++) {
var query = queries[i];
var retrieved = results[query.id] || [];
total += reciprocalRank(retrieved, query.relevant_doc_ids);
}
return queries.length > 0 ? total / queries.length : 0;
}
module.exports.reciprocalRank = reciprocalRank;
module.exports.meanReciprocalRank = meanReciprocalRank;
Normalized Discounted Cumulative Gain (nDCG)
nDCG is the gold standard metric for ranking quality when you have graded relevance scores. It accounts for both the relevance level and the position of each result. Documents at the top of the list contribute more to the score than documents further down, and highly relevant documents contribute more than marginally relevant ones.
The math: DCG sums (2^relevance - 1) / log2(position + 1) for each result. IDCG is the DCG of a perfect ranking. nDCG is DCG / IDCG.
function dcgAtK(retrievedIds, relevanceScores, k) {
var dcg = 0;
var topK = retrievedIds.slice(0, k);
for (var i = 0; i < topK.length; i++) {
var relevance = relevanceScores[topK[i]] || 0;
dcg += (Math.pow(2, relevance) - 1) / (Math.log2(i + 2)); // i+2 because position is 1-indexed
}
return dcg;
}
function idealDcgAtK(relevanceScores, k) {
// Sort all relevance scores descending to get ideal ranking
var scores = [];
var keys = Object.keys(relevanceScores);
for (var i = 0; i < keys.length; i++) {
scores.push(relevanceScores[keys[i]]);
}
scores.sort(function(a, b) { return b - a; });
scores = scores.slice(0, k);
var idcg = 0;
for (var j = 0; j < scores.length; j++) {
idcg += (Math.pow(2, scores[j]) - 1) / (Math.log2(j + 2));
}
return idcg;
}
function ndcgAtK(retrievedIds, relevanceScores, k) {
var dcg = dcgAtK(retrievedIds, relevanceScores, k);
var idcg = idealDcgAtK(relevanceScores, k);
return idcg > 0 ? dcg / idcg : 0;
}
function averageNdcgAtK(results, dataset, k) {
var total = 0;
var queries = dataset.queries;
for (var i = 0; i < queries.length; i++) {
var query = queries[i];
var retrieved = results[query.id] || [];
total += ndcgAtK(retrieved, query.relevance_scores, k);
}
return queries.length > 0 ? total / queries.length : 0;
}
module.exports.ndcgAtK = ndcgAtK;
module.exports.averageNdcgAtK = averageNdcgAtK;
nDCG ranges from 0 to 1, where 1 means the ranking is perfect. In practice, anything above 0.7 is solid for most retrieval applications. Below 0.5 means your model is struggling with the domain.
Latency Benchmarking
Latency matters more than most teams realize. A 200ms embedding generation time means your user waits at least that long before any search results appear. Measure both embedding generation latency and vector search latency separately so you know where to optimize.
// latency-benchmark.js
var OpenAI = require("openai");
function createTimingWrapper() {
var measurements = [];
return {
measurements: measurements,
measure: function(fn) {
return function() {
var args = Array.prototype.slice.call(arguments);
var start = process.hrtime.bigint();
return fn.apply(null, args).then(function(result) {
var end = process.hrtime.bigint();
var durationMs = Number(end - start) / 1e6;
measurements.push(durationMs);
return { result: result, durationMs: durationMs };
});
};
},
getStats: function() {
var sorted = measurements.slice().sort(function(a, b) { return a - b; });
var sum = 0;
for (var i = 0; i < sorted.length; i++) {
sum += sorted[i];
}
return {
count: sorted.length,
mean: sum / sorted.length,
p50: sorted[Math.floor(sorted.length * 0.5)],
p95: sorted[Math.floor(sorted.length * 0.95)],
p99: sorted[Math.floor(sorted.length * 0.99)],
min: sorted[0],
max: sorted[sorted.length - 1]
};
}
};
}
function benchmarkEmbeddingLatency(client, model, texts, options) {
var batchSize = (options && options.batchSize) || 1;
var warmupRuns = (options && options.warmupRuns) || 3;
var timer = createTimingWrapper();
var generateEmbedding = timer.measure(function(input) {
return client.embeddings.create({
model: model,
input: input
});
});
return runBenchmark();
function runBenchmark() {
// Warmup phase — discard these measurements
var warmupPromise = Promise.resolve();
for (var w = 0; w < warmupRuns; w++) {
warmupPromise = warmupPromise.then(function() {
return generateEmbedding(texts[0]);
});
}
return warmupPromise.then(function() {
timer.measurements.length = 0; // reset after warmup
var chain = Promise.resolve();
for (var i = 0; i < texts.length; i += batchSize) {
(function(batch) {
chain = chain.then(function() {
return generateEmbedding(batch);
});
})(texts.slice(i, i + batchSize));
}
return chain.then(function() {
return timer.getStats();
});
});
}
}
module.exports = {
createTimingWrapper: createTimingWrapper,
benchmarkEmbeddingLatency: benchmarkEmbeddingLatency
};
Always include a warmup phase. The first few API calls to any service are slower due to connection establishment, TLS handshakes, and cold starts on the provider's side. Discard warmup measurements to avoid skewing your results.
Report p50, p95, and p99 — not just the mean. The mean hides tail latency. If your p99 is 3x your p50, you have a problem that the mean will never show you.
Throughput Testing
Throughput measures how many operations you can complete per second. This matters for batch indexing (how fast can you embed your entire corpus?) and for concurrent query handling.
function benchmarkThroughput(client, model, texts, concurrency) {
var completed = 0;
var startTime = Date.now();
var errors = 0;
function worker(queue) {
if (queue.length === 0) {
return Promise.resolve();
}
var batch = queue.shift();
return client.embeddings.create({
model: model,
input: batch
}).then(function() {
completed++;
return worker(queue);
}).catch(function(err) {
errors++;
console.error("Embedding error:", err.message);
return worker(queue);
});
}
// Build queue of batches
var queue = [];
for (var i = 0; i < texts.length; i++) {
queue.push(texts[i]);
}
// Launch concurrent workers
var workers = [];
for (var w = 0; w < concurrency; w++) {
workers.push(worker(queue));
}
return Promise.all(workers).then(function() {
var elapsedSeconds = (Date.now() - startTime) / 1000;
return {
totalRequests: completed + errors,
successful: completed,
failed: errors,
elapsedSeconds: elapsedSeconds,
requestsPerSecond: completed / elapsedSeconds,
embeddingsPerSecond: completed / elapsedSeconds
};
});
}
module.exports.benchmarkThroughput = benchmarkThroughput;
Be mindful of rate limits when throughput testing. Most embedding APIs enforce requests-per-minute limits. The OpenAI embeddings API allows batching multiple texts in a single request — always batch when throughput matters.
Comparing Models Side-by-Side
The real value of benchmarking comes from comparison. Run the same dataset through multiple models and present the results in a way that makes the trade-offs obvious.
// model-comparison.js
var Table = require("cli-table3");
function compareModels(benchmarkResults) {
var table = new Table({
head: ["Model", "Recall@5", "Recall@10", "MRR", "nDCG@10", "p50 (ms)", "p95 (ms)", "Cost/1K"],
colWidths: [28, 12, 12, 10, 12, 12, 12, 12]
});
for (var i = 0; i < benchmarkResults.length; i++) {
var r = benchmarkResults[i];
table.push([
r.model,
r.recallAt5.toFixed(4),
r.recallAt10.toFixed(4),
r.mrr.toFixed(4),
r.ndcgAt10.toFixed(4),
r.latency.p50.toFixed(1),
r.latency.p95.toFixed(1),
"$" + r.costPer1K.toFixed(4)
]);
}
console.log("\n=== Embedding Model Comparison ===\n");
console.log(table.toString());
// Identify winner in each category
var categories = ["recallAt5", "recallAt10", "mrr", "ndcgAt10"];
for (var c = 0; c < categories.length; c++) {
var best = benchmarkResults[0];
for (var j = 1; j < benchmarkResults.length; j++) {
if (benchmarkResults[j][categories[c]] > best[categories[c]]) {
best = benchmarkResults[j];
}
}
console.log("Best " + categories[c] + ": " + best.model + " (" + best[categories[c]].toFixed(4) + ")");
}
}
module.exports = { compareModels: compareModels };
Statistical Significance Testing
When Model A scores 0.82 recall and Model B scores 0.80, is that a real difference or noise? You need a statistical test to find out. The paired t-test works well for comparing two models on the same query set.
// significance.js
function pairedTTest(scoresA, scoresB) {
if (scoresA.length !== scoresB.length) {
throw new Error("Score arrays must have equal length");
}
var n = scoresA.length;
var differences = [];
var sumDiff = 0;
for (var i = 0; i < n; i++) {
var diff = scoresA[i] - scoresB[i];
differences.push(diff);
sumDiff += diff;
}
var meanDiff = sumDiff / n;
var sumSquaredDev = 0;
for (var j = 0; j < n; j++) {
var dev = differences[j] - meanDiff;
sumSquaredDev += dev * dev;
}
var stdDev = Math.sqrt(sumSquaredDev / (n - 1));
var standardError = stdDev / Math.sqrt(n);
var tStatistic = meanDiff / standardError;
// Degrees of freedom
var df = n - 1;
// Approximate two-tailed p-value using t-distribution
// For production, use a stats library. This approximation works for df > 30.
var p = approximatePValue(Math.abs(tStatistic), df);
return {
tStatistic: tStatistic,
degreesOfFreedom: df,
pValue: p,
meanDifference: meanDiff,
significant: p < 0.05,
interpretation: p < 0.05
? "Difference is statistically significant (p=" + p.toFixed(4) + ")"
: "No significant difference (p=" + p.toFixed(4) + ")"
};
}
function approximatePValue(t, df) {
// Approximation using the regularized incomplete beta function
// For a quick implementation, use the normal approximation for large df
if (df > 30) {
// Normal approximation
var z = t;
var p = Math.exp(-0.5 * z * z) / Math.sqrt(2 * Math.PI);
var cumulative = 0.5 * (1 + erf(z / Math.sqrt(2)));
return 2 * (1 - cumulative);
}
// For small df, use a lookup or numeric integration
// This is simplified — use mathjs or jstat in production
return 2 * (1 - normalCDF(t));
}
function erf(x) {
var sign = x >= 0 ? 1 : -1;
x = Math.abs(x);
var a1 = 0.254829592;
var a2 = -0.284496736;
var a3 = 1.421413741;
var a4 = -1.453152027;
var a5 = 1.061405429;
var p = 0.3275911;
var t = 1.0 / (1.0 + p * x);
var y = 1.0 - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * Math.exp(-x * x);
return sign * y;
}
function normalCDF(x) {
return 0.5 * (1 + erf(x / Math.sqrt(2)));
}
module.exports = { pairedTTest: pairedTTest };
A common mistake: running significance tests on too few queries. With fewer than 30 queries, the test has low statistical power and will fail to detect real differences. Aim for at least 50 labeled queries in your benchmark dataset if you want meaningful significance results.
Benchmarking Index Performance
The index type (IVFFlat vs HNSW) and its configuration parameters dramatically affect search speed and recall. Benchmark these separately from the embedding model itself.
// index-benchmark.js
var pg = require("pg");
function benchmarkIndex(connectionString, tableName, queryVector, options) {
var pool = new pg.Pool({ connectionString: connectionString });
var kValues = (options && options.kValues) || [5, 10, 20, 50];
var iterations = (options && options.iterations) || 100;
function runSearches(indexConfig) {
var results = {};
return pool.query("SET LOCAL ivfflat.probes = " + (indexConfig.probes || 10))
.then(function() {
var chain = Promise.resolve();
for (var ki = 0; ki < kValues.length; ki++) {
(function(k) {
chain = chain.then(function() {
var times = [];
var innerChain = Promise.resolve();
for (var iter = 0; iter < iterations; iter++) {
innerChain = innerChain.then(function() {
var start = process.hrtime.bigint();
return pool.query(
"SELECT id, embedding <=> $1::vector AS distance FROM " + tableName +
" ORDER BY embedding <=> $1::vector LIMIT $2",
[JSON.stringify(queryVector), k]
).then(function(res) {
var end = process.hrtime.bigint();
times.push(Number(end - start) / 1e6);
return res.rows;
});
});
}
return innerChain.then(function() {
times.sort(function(a, b) { return a - b; });
var sum = 0;
for (var t = 0; t < times.length; t++) sum += times[t];
results["k=" + k] = {
mean: sum / times.length,
p50: times[Math.floor(times.length * 0.5)],
p95: times[Math.floor(times.length * 0.95)],
p99: times[Math.floor(times.length * 0.99)]
};
});
});
})(kValues[ki]);
}
return chain.then(function() {
return results;
});
});
}
return {
runSearches: runSearches,
close: function() { return pool.end(); }
};
}
module.exports = { benchmarkIndex: benchmarkIndex };
For IVFFlat, the probes parameter controls the trade-off between speed and recall. Higher probes means better recall but slower searches. For HNSW, the ef_search parameter serves the same purpose. Always benchmark at multiple settings to find the sweet spot for your latency requirements.
At scales below 100K vectors, the index type barely matters — brute force exact search is fast enough. At 1M+ vectors, HNSW typically offers better query latency than IVFFlat, but IVFFlat builds faster and uses less memory. Benchmark both at your actual scale.
Automated Benchmark Pipelines in CI
Embedding benchmarks should run automatically when you change your retrieval pipeline. Integrate them into CI so regressions are caught before they reach production.
// ci-benchmark.js
// Designed to run in CI with pass/fail thresholds
var fs = require("fs");
var path = require("path");
function runCIBenchmark(results, thresholds) {
var failures = [];
if (results.recallAt10 < thresholds.minRecallAt10) {
failures.push(
"Recall@10 " + results.recallAt10.toFixed(4) +
" below threshold " + thresholds.minRecallAt10
);
}
if (results.mrr < thresholds.minMRR) {
failures.push(
"MRR " + results.mrr.toFixed(4) +
" below threshold " + thresholds.minMRR
);
}
if (results.ndcgAt10 < thresholds.minNDCG) {
failures.push(
"nDCG@10 " + results.ndcgAt10.toFixed(4) +
" below threshold " + thresholds.minNDCG
);
}
if (results.latency && results.latency.p95 > thresholds.maxLatencyP95) {
failures.push(
"P95 latency " + results.latency.p95.toFixed(1) +
"ms exceeds threshold " + thresholds.maxLatencyP95 + "ms"
);
}
// Write results to file for CI artifact collection
var outputPath = path.join(process.cwd(), "benchmark-results.json");
fs.writeFileSync(outputPath, JSON.stringify({
timestamp: new Date().toISOString(),
results: results,
thresholds: thresholds,
passed: failures.length === 0,
failures: failures
}, null, 2));
if (failures.length > 0) {
console.error("BENCHMARK FAILED:");
for (var i = 0; i < failures.length; i++) {
console.error(" - " + failures[i]);
}
process.exit(1);
}
console.log("All benchmarks passed.");
return true;
}
module.exports = { runCIBenchmark: runCIBenchmark };
Define your thresholds based on your current production baseline. Start conservative — if your current recall@10 is 0.78, set the threshold at 0.75. Tighten thresholds as your pipeline matures.
Visualizing Benchmark Results
Raw numbers are hard to interpret across multiple runs. Write benchmark results to a structured format that you can feed into a dashboard or chart.
// benchmark-report.js
var fs = require("fs");
var path = require("path");
function generateReport(allResults, outputDir) {
if (!fs.existsSync(outputDir)) {
fs.mkdirSync(outputDir, { recursive: true });
}
// CSV for easy import into spreadsheets or charting tools
var csvHeader = "timestamp,model,recall_at_5,recall_at_10,mrr,ndcg_at_10,p50_ms,p95_ms,p99_ms,cost_per_1k\n";
var csvRows = "";
for (var i = 0; i < allResults.length; i++) {
var r = allResults[i];
csvRows += [
r.timestamp,
r.model,
r.recallAt5.toFixed(4),
r.recallAt10.toFixed(4),
r.mrr.toFixed(4),
r.ndcgAt10.toFixed(4),
r.latency.p50.toFixed(1),
r.latency.p95.toFixed(1),
r.latency.p99.toFixed(1),
r.costPer1K.toFixed(4)
].join(",") + "\n";
}
var csvPath = path.join(outputDir, "benchmark-history.csv");
// Append to existing history file
if (!fs.existsSync(csvPath)) {
fs.writeFileSync(csvPath, csvHeader + csvRows);
} else {
fs.appendFileSync(csvPath, csvRows);
}
// Generate markdown summary
var markdown = "# Embedding Benchmark Report\n\n";
markdown += "Generated: " + new Date().toISOString() + "\n\n";
markdown += "| Model | Recall@5 | Recall@10 | MRR | nDCG@10 | P50 (ms) | P95 (ms) |\n";
markdown += "|-------|----------|-----------|-----|---------|----------|----------|\n";
for (var j = 0; j < allResults.length; j++) {
var row = allResults[j];
markdown += "| " + [
row.model,
row.recallAt5.toFixed(4),
row.recallAt10.toFixed(4),
row.mrr.toFixed(4),
row.ndcgAt10.toFixed(4),
row.latency.p50.toFixed(1),
row.latency.p95.toFixed(1)
].join(" | ") + " |\n";
}
var mdPath = path.join(outputDir, "benchmark-report.md");
fs.writeFileSync(mdPath, markdown);
console.log("Report written to: " + outputDir);
return { csvPath: csvPath, markdownPath: mdPath };
}
module.exports = { generateReport: generateReport };
Tracking Performance Over Time
Store benchmark results with timestamps so you can detect degradation. A retrieval quality drop after a code change is easy to catch. A slow drift over weeks as your corpus grows is harder to notice without historical tracking.
// trend-analysis.js
var fs = require("fs");
function detectDegradation(historyFile, metricName, windowSize, thresholdPercent) {
var data = fs.readFileSync(historyFile, "utf8");
var lines = data.trim().split("\n");
var header = lines[0].split(",");
var metricIndex = header.indexOf(metricName);
if (metricIndex === -1) {
throw new Error("Metric '" + metricName + "' not found in history file");
}
var values = [];
for (var i = 1; i < lines.length; i++) {
var cols = lines[i].split(",");
values.push(parseFloat(cols[metricIndex]));
}
if (values.length < windowSize * 2) {
return { degraded: false, message: "Not enough data points for comparison" };
}
// Compare recent window to previous window
var recentStart = values.length - windowSize;
var previousStart = recentStart - windowSize;
var recentSum = 0;
var previousSum = 0;
for (var r = recentStart; r < values.length; r++) {
recentSum += values[r];
}
for (var p = previousStart; p < recentStart; p++) {
previousSum += values[p];
}
var recentAvg = recentSum / windowSize;
var previousAvg = previousSum / windowSize;
var changePercent = ((recentAvg - previousAvg) / previousAvg) * 100;
// For quality metrics, negative change is degradation
// For latency metrics, positive change is degradation
var degraded = false;
if (metricName.indexOf("latency") !== -1 || metricName.indexOf("ms") !== -1) {
degraded = changePercent > thresholdPercent;
} else {
degraded = changePercent < -thresholdPercent;
}
return {
degraded: degraded,
previousAvg: previousAvg,
recentAvg: recentAvg,
changePercent: changePercent,
message: degraded
? "WARNING: " + metricName + " degraded by " + Math.abs(changePercent).toFixed(1) + "%"
: metricName + " stable (change: " + changePercent.toFixed(1) + "%)"
};
}
module.exports = { detectDegradation: detectDegradation };
Set up alerts when degradation is detected. A 5% drop in recall@10 over a two-week window is a signal worth investigating. It might be a corpus composition shift, a model update on the provider side, or an unintended change in your chunking logic.
Complete Working Example
This benchmark suite ties everything together. It evaluates one or more embedding models against a labeled dataset and produces a comprehensive comparison report.
// run-benchmark.js
// Complete embedding benchmark suite
var OpenAI = require("openai");
var pg = require("pg");
var fs = require("fs");
var Table = require("cli-table3");
// ============================================================
// Configuration
// ============================================================
var CONFIG = {
models: [
{ name: "text-embedding-3-small", dimensions: 1536, costPer1K: 0.00002 },
{ name: "text-embedding-3-large", dimensions: 3072, costPer1K: 0.00013 }
],
kValues: [1, 3, 5, 10, 20],
latencyIterations: 50,
warmupRuns: 5,
throughputConcurrency: 3,
connectionString: process.env.POSTGRES_CONNECTION_STRING
};
// ============================================================
// Metrics Implementation
// ============================================================
function recallAtK(retrievedIds, relevantIds, k) {
var topK = retrievedIds.slice(0, k);
var hits = 0;
for (var i = 0; i < relevantIds.length; i++) {
if (topK.indexOf(relevantIds[i]) !== -1) hits++;
}
return relevantIds.length > 0 ? hits / relevantIds.length : 0;
}
function precisionAtK(retrievedIds, relevantIds, k) {
var topK = retrievedIds.slice(0, k);
var hits = 0;
for (var i = 0; i < topK.length; i++) {
if (relevantIds.indexOf(topK[i]) !== -1) hits++;
}
return topK.length > 0 ? hits / topK.length : 0;
}
function reciprocalRank(retrievedIds, relevantIds) {
for (var i = 0; i < retrievedIds.length; i++) {
if (relevantIds.indexOf(retrievedIds[i]) !== -1) {
return 1 / (i + 1);
}
}
return 0;
}
function ndcgAtK(retrievedIds, relevanceScores, k) {
var topK = retrievedIds.slice(0, k);
var dcg = 0;
for (var i = 0; i < topK.length; i++) {
var rel = relevanceScores[topK[i]] || 0;
dcg += (Math.pow(2, rel) - 1) / Math.log2(i + 2);
}
var scores = [];
var keys = Object.keys(relevanceScores);
for (var j = 0; j < keys.length; j++) {
scores.push(relevanceScores[keys[j]]);
}
scores.sort(function(a, b) { return b - a; });
scores = scores.slice(0, k);
var idcg = 0;
for (var s = 0; s < scores.length; s++) {
idcg += (Math.pow(2, scores[s]) - 1) / Math.log2(s + 2);
}
return idcg > 0 ? dcg / idcg : 0;
}
// ============================================================
// Embedding and Search Functions
// ============================================================
function generateEmbeddings(client, model, texts) {
return client.embeddings.create({
model: model,
input: texts
}).then(function(response) {
return response.data.map(function(item) {
return item.embedding;
});
});
}
function vectorSearch(pool, tableName, queryVector, k) {
var sql = "SELECT id, embedding <=> $1::vector AS distance " +
"FROM " + tableName + " " +
"ORDER BY embedding <=> $1::vector LIMIT $2";
return pool.query(sql, [JSON.stringify(queryVector), k]).then(function(result) {
return result.rows.map(function(row) { return row.id; });
});
}
// ============================================================
// Benchmark Runner
// ============================================================
function runModelBenchmark(client, pool, modelConfig, dataset, tableName) {
var model = modelConfig.name;
var queryTexts = dataset.queries.map(function(q) { return q.text; });
var results = {};
var perQueryScores = {
recall5: [],
recall10: [],
mrr: [],
ndcg10: []
};
console.log("\nBenchmarking model: " + model);
console.log(" Generating query embeddings...");
var embeddingStart = Date.now();
return generateEmbeddings(client, model, queryTexts)
.then(function(queryEmbeddings) {
var embeddingTime = Date.now() - embeddingStart;
console.log(" Query embeddings generated in " + embeddingTime + "ms");
// Run vector search for each query
var chain = Promise.resolve();
var latencies = [];
for (var i = 0; i < dataset.queries.length; i++) {
(function(idx) {
chain = chain.then(function() {
var searchStart = process.hrtime.bigint();
var maxK = CONFIG.kValues[CONFIG.kValues.length - 1];
return vectorSearch(pool, tableName, queryEmbeddings[idx], maxK)
.then(function(retrievedIds) {
var searchEnd = process.hrtime.bigint();
latencies.push(Number(searchEnd - searchStart) / 1e6);
var query = dataset.queries[idx];
results[query.id] = retrievedIds;
// Compute per-query scores
perQueryScores.recall5.push(recallAtK(retrievedIds, query.relevant_doc_ids, 5));
perQueryScores.recall10.push(recallAtK(retrievedIds, query.relevant_doc_ids, 10));
perQueryScores.mrr.push(reciprocalRank(retrievedIds, query.relevant_doc_ids));
perQueryScores.ndcg10.push(ndcgAtK(retrievedIds, query.relevance_scores, 10));
});
});
})(i);
}
return chain.then(function() {
// Aggregate scores
function average(arr) {
var sum = 0;
for (var a = 0; a < arr.length; a++) sum += arr[a];
return arr.length > 0 ? sum / arr.length : 0;
}
// Latency stats
latencies.sort(function(a, b) { return a - b; });
var latSum = 0;
for (var l = 0; l < latencies.length; l++) latSum += latencies[l];
return {
model: model,
timestamp: new Date().toISOString(),
recallAt5: average(perQueryScores.recall5),
recallAt10: average(perQueryScores.recall10),
mrr: average(perQueryScores.mrr),
ndcgAt10: average(perQueryScores.ndcg10),
precisionAt5: (function() {
var total = 0;
for (var q = 0; q < dataset.queries.length; q++) {
var qr = dataset.queries[q];
total += precisionAtK(results[qr.id], qr.relevant_doc_ids, 5);
}
return total / dataset.queries.length;
})(),
latency: {
mean: latSum / latencies.length,
p50: latencies[Math.floor(latencies.length * 0.5)],
p95: latencies[Math.floor(latencies.length * 0.95)],
p99: latencies[Math.floor(latencies.length * 0.99)]
},
embeddingTimeMs: embeddingTime,
costPer1K: modelConfig.costPer1K,
perQueryScores: perQueryScores,
queryCount: dataset.queries.length
};
});
});
}
// ============================================================
// Report Generation
// ============================================================
function printReport(allResults) {
var table = new Table({
head: ["Model", "Recall@5", "Recall@10", "Prec@5", "MRR", "nDCG@10", "P50 ms", "P95 ms", "$/1K"],
colWidths: [26, 10, 11, 9, 8, 10, 9, 9, 10]
});
for (var i = 0; i < allResults.length; i++) {
var r = allResults[i];
table.push([
r.model,
r.recallAt5.toFixed(3),
r.recallAt10.toFixed(3),
r.precisionAt5.toFixed(3),
r.mrr.toFixed(3),
r.ndcgAt10.toFixed(3),
r.latency.p50.toFixed(1),
r.latency.p95.toFixed(1),
"$" + r.costPer1K.toFixed(5)
]);
}
console.log("\n========================================");
console.log(" EMBEDDING BENCHMARK RESULTS");
console.log("========================================\n");
console.log(table.toString());
// Winner summary
if (allResults.length > 1) {
console.log("\n--- Category Winners ---");
var metrics = [
{ key: "recallAt10", label: "Recall@10" },
{ key: "mrr", label: "MRR" },
{ key: "ndcgAt10", label: "nDCG@10" }
];
for (var m = 0; m < metrics.length; m++) {
var best = allResults[0];
for (var j = 1; j < allResults.length; j++) {
if (allResults[j][metrics[m].key] > best[metrics[m].key]) {
best = allResults[j];
}
}
console.log(" " + metrics[m].label + ": " + best.model +
" (" + best[metrics[m].key].toFixed(4) + ")");
}
// Fastest model
var fastest = allResults[0];
for (var f = 1; f < allResults.length; f++) {
if (allResults[f].latency.p50 < fastest.latency.p50) {
fastest = allResults[f];
}
}
console.log(" Lowest latency: " + fastest.model +
" (p50=" + fastest.latency.p50.toFixed(1) + "ms)");
}
}
// ============================================================
// Main Entry Point
// ============================================================
function main() {
var client = new OpenAI();
var pool = new pg.Pool({ connectionString: CONFIG.connectionString });
// Load benchmark dataset
var dataset = require("./benchmark-dataset");
var tableName = "benchmark_embeddings";
console.log("Embedding Performance Benchmark");
console.log("Dataset: " + dataset.name + " (" + dataset.queries.length + " queries)");
console.log("Models: " + CONFIG.models.map(function(m) { return m.name; }).join(", "));
var allResults = [];
var chain = Promise.resolve();
for (var i = 0; i < CONFIG.models.length; i++) {
(function(modelConfig) {
chain = chain.then(function() {
return runModelBenchmark(client, pool, modelConfig, dataset, tableName);
}).then(function(result) {
allResults.push(result);
});
})(CONFIG.models[i]);
}
chain
.then(function() {
printReport(allResults);
// Save full results
var outputPath = "./benchmark-results-" + Date.now() + ".json";
fs.writeFileSync(outputPath, JSON.stringify(allResults, null, 2));
console.log("\nFull results saved to: " + outputPath);
})
.catch(function(err) {
console.error("Benchmark failed:", err);
process.exit(1);
})
.then(function() {
return pool.end();
});
}
main();
Run the suite with:
OPENAI_API_KEY=sk-... POSTGRES_CONNECTION_STRING=postgresql://... node run-benchmark.js
The output looks like this:
========================================
EMBEDDING BENCHMARK RESULTS
========================================
| Model | Recall@5 | Recall@10 | Prec@5 | MRR | nDCG@10 | P50 ms | P95 ms | $/1K |
|--------------------------|----------|-----------|--------|-------|---------|--------|--------|----------|
| text-embedding-3-small | 0.724 | 0.856 | 0.412 | 0.681 | 0.739 | 2.3 | 4.8 | $0.00002 |
| text-embedding-3-large | 0.788 | 0.912 | 0.456 | 0.745 | 0.801 | 3.1 | 6.2 | $0.00013 |
--- Category Winners ---
Recall@10: text-embedding-3-large (0.9120)
MRR: text-embedding-3-large (0.7450)
nDCG@10: text-embedding-3-large (0.8010)
Lowest latency: text-embedding-3-small (p50=2.3ms)
Common Issues and Troubleshooting
1. Rate limit errors during throughput testing
Error: 429 Too Many Requests
{"error":{"message":"Rate limit reached for text-embedding-3-small","type":"rate_limit_error"}}
The OpenAI embeddings API has both RPM (requests per minute) and TPM (tokens per minute) limits. Reduce your concurrency level, add exponential backoff with jitter, or batch more texts per request. The API accepts up to 2048 text inputs in a single call — use that before increasing concurrency.
2. pgvector dimension mismatch when comparing models
ERROR: different vector dimensions 1536 and 3072
You cannot store embeddings from different models in the same column. Create separate tables or columns for each model you are benchmarking. When comparing models, each model needs its own indexed column with the correct dimension.
CREATE TABLE benchmark_small (id TEXT, embedding vector(1536));
CREATE TABLE benchmark_large (id TEXT, embedding vector(3072));
3. nDCG returning NaN for queries with no relevant documents
nDCG@10: NaN
If a query has an empty relevance_scores object, the IDCG is 0 and you get a division by zero. The ndcgAtK function above handles this by returning 0 when IDCG is 0, but check your dataset for queries that accidentally have no labeled relevant documents.
4. Benchmark results vary wildly between runs
Run 1 - Recall@10: 0.8200
Run 2 - Recall@10: 0.7600
Run 3 - Recall@10: 0.8400
This usually happens when your dataset is too small. With fewer than 30 queries, a single query that flips between "hit" and "miss" at the boundary causes large swings in the average. Increase your benchmark dataset size. Also check if you are using approximate search (HNSW or IVFFlat with low probes) — approximate methods are inherently non-deterministic. Use exact search for quality benchmarks and approximate search for latency benchmarks.
5. Memory issues when benchmarking large corpora
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
If you are loading all embeddings into memory for comparison, you will hit the V8 heap limit on large corpora. Stream results from PostgreSQL using cursors instead of loading everything at once. For the benchmark dataset itself, keep it under 1000 queries — you do not need to query your entire corpus to get statistically meaningful results.
Best Practices
Start with a small, high-quality benchmark dataset. Fifty well-labeled query-document pairs are worth more than a thousand noisy ones. Invest time in labeling quality, not quantity.
Always measure on your actual domain data. Public benchmarks tell you about general capability. Your retrieval system needs to work on your specific vocabulary, document structures, and query patterns. A model that scores 0.90 on MTEB might score 0.65 on your legal documents.
Separate quality benchmarks from performance benchmarks. Use exact search (no index, or brute force scan) when measuring retrieval quality so that index approximation does not confuse your results. Use indexed search when measuring latency and throughput.
Run benchmarks deterministically. Set fixed random seeds, use the same dataset across runs, and control for external variables like network latency by running latency tests from the same machine. Store raw per-query results, not just aggregates, so you can investigate regressions at the query level.
Include cost in your comparison. A model that is 5% better on recall but 6x more expensive may not be the right choice. Calculate cost per query at your expected volume and include it in the comparison table. For many applications, the cheaper model with a reranker on top outperforms the expensive model alone.
Benchmark your full pipeline, not just embeddings. The embedding model is one piece of the system. Your chunking strategy, metadata filters, reranker, and post-processing all affect retrieval quality. Benchmark the end-to-end pipeline, then benchmark individual components to find bottlenecks.
Track benchmark results over time. Store results with timestamps in a CSV or database. Set up automated alerts when a metric degrades beyond a threshold. Slow drifts are harder to catch than sudden drops — trend analysis catches both.
Test at realistic scale. A model that works great on 10K documents may struggle at 1M. Index behavior changes with scale. Run your benchmark at the scale you expect in production, or at least at 2-3x your current scale to understand how the system will behave as you grow.
Use graded relevance when possible. Binary relevance (relevant or not) loses information. A document that perfectly answers the query should score higher than one that is tangentially related. nDCG with graded relevance captures this distinction and gives you a more nuanced view of retrieval quality.
References
- MTEB: Massive Text Embedding Benchmark - Public leaderboard for comparing embedding models on standard datasets.
- OpenAI Embeddings API Documentation - Official guide for OpenAI's embedding models, including pricing and rate limits.
- pgvector Documentation - PostgreSQL extension for vector similarity search, including IVFFlat and HNSW index types.
- Information Retrieval Metrics (Stanford NLP) - Academic reference for recall, precision, MRR, and nDCG metrics.
- Node.js Performance Hooks - Built-in Node.js APIs for high-resolution timing measurements.
- Beir Benchmark - Heterogeneous benchmark for zero-shot evaluation of information retrieval models.