Comparing LLM Provider Pricing and Performance
Comprehensive comparison of LLM providers on pricing, latency, quality, and reliability with a Node.js benchmarking tool.
Comparing LLM Provider Pricing and Performance
Overview
Choosing an LLM provider is one of the highest-leverage decisions you will make when building AI-powered applications. The difference between the cheapest and most expensive option can be 100x per request, and pricing alone tells you nothing about whether the model actually solves your problem. This article breaks down the real costs, latency characteristics, and quality tradeoffs across every major provider, and gives you a Node.js benchmarking harness to measure what matters for your specific workload.
Prerequisites
- Node.js 18+ installed
- API keys for at least two providers (Anthropic, OpenAI, or Google)
- Basic understanding of how LLM APIs work (sending prompts, receiving completions)
- Familiarity with async/await patterns in Node.js
- A working understanding of tokenization (you do not need to be an expert)
The Current LLM Provider Landscape
The market has matured significantly since the GPT-3 era. Here is where things stand as of early 2026.
Anthropic (Claude) offers three tiers: Opus (flagship reasoning), Sonnet (balanced), and Haiku (fast and cheap). Claude models consistently excel at code generation, nuanced instruction following, and long-context tasks. The API is clean, well-documented, and reliable.
OpenAI (GPT) remains the most widely adopted provider. GPT-4o is their current flagship multimodal model, and GPT-4o-mini is the cost-optimized variant. Their ecosystem includes fine-tuning, assistants API, and a mature batch processing pipeline.
Google (Gemini) competes aggressively on price, especially with Gemini 1.5 Flash. Their strength is massive context windows (up to 2M tokens on some models) and tight integration with Google Cloud. Quality has improved dramatically, but the API surface area is more complex.
Mistral offers strong European-hosted models with competitive pricing. Their Mixtral and Mistral Large models perform well on multilingual tasks and structured output. Their API is straightforward and compatible with the OpenAI client format.
Open-source models (Llama 3, Qwen 2.5, DeepSeek) can be self-hosted or run through providers like Together AI, Fireworks, or Groq. The economics are fundamentally different: you pay for compute, not tokens.
Pricing Models Explained
Every commercial LLM API charges per token, but the details vary significantly.
Input vs. Output Tokens
All providers charge differently for input (prompt) tokens versus output (completion) tokens. Output tokens are always more expensive because they require sequential generation. A typical ratio is 3:1 to 5:1 output-to-input pricing.
This distinction matters enormously. If your application sends long documents for summarization, your cost is dominated by input tokens. If you are generating long-form content, output tokens dominate.
Prompt Caching
Anthropic and OpenAI both offer prompt caching, which reduces the cost of repeated input tokens. If you send the same system prompt or reference documents across multiple requests, cached input tokens cost 50-90% less. This is a game-changer for applications with static context.
// Anthropic prompt caching example
var Anthropic = require("@anthropic-ai/sdk");
var client = new Anthropic();
var systemPrompt = "You are a code review assistant..."; // Long system prompt
// First request caches the system prompt
var response = client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: [
{
type: "text",
text: systemPrompt,
cache_control: { type: "ephemeral" }
}
],
messages: [{ role: "user", content: "Review this function..." }]
});
// Subsequent requests with the same system prompt pay cached rates
// Input: $3/MTok -> Cached: $0.30/MTok (Sonnet pricing)
Batch Discounts
Both Anthropic and OpenAI offer batch APIs that process requests asynchronously at a 50% discount. The tradeoff is latency: batch requests complete within 24 hours, not seconds. This is perfect for offline processing, content generation pipelines, and evaluation runs.
Tiered and Volume Pricing
OpenAI and Google both offer tiered pricing at scale. Anthropic offers enterprise agreements with custom pricing. If you are spending more than $5,000/month with any provider, you should be talking to their sales team.
Detailed Cost Comparison
Here are the per-million-token prices as of early 2026. These change frequently, so always verify against official pricing pages.
Anthropic Claude Models
| Model | Input (per MTok) | Output (per MTok) | Cached Input | Context Window |
|---|---|---|---|---|
| Claude Opus 4 | $15.00 | $75.00 | $1.50 | 200K |
| Claude Sonnet 4 | $3.00 | $15.00 | $0.30 | 200K |
| Claude Haiku 3.5 | $0.80 | $4.00 | $0.08 | 200K |
OpenAI GPT Models
| Model | Input (per MTok) | Output (per MTok) | Cached Input | Context Window |
|---|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | $1.25 | 128K |
| GPT-4o-mini | $0.15 | $0.60 | $0.075 | 128K |
| o1 | $15.00 | $60.00 | $7.50 | 200K |
Google Gemini Models
| Model | Input (per MTok) | Output (per MTok) | Context Window |
|---|---|---|---|
| Gemini 1.5 Pro | $1.25 | $5.00 | 2M |
| Gemini 1.5 Flash | $0.075 | $0.30 | 1M |
| Gemini 2.0 Flash | $0.10 | $0.40 | 1M |
Mistral Models
| Model | Input (per MTok) | Output (per MTok) | Context Window |
|---|---|---|---|
| Mistral Large | $2.00 | $6.00 | 128K |
| Mistral Small | $0.10 | $0.30 | 128K |
The raw numbers are revealing. For high-volume, latency-insensitive tasks, Gemini Flash and GPT-4o-mini are extraordinarily cheap. For tasks requiring the highest reasoning quality, you are choosing between Claude Opus and o1 at roughly similar price points.
Latency Benchmarks
Latency has two components that matter: time to first token (TTFT) and tokens per second (TPS) once generation starts. TTFT determines perceived responsiveness in streaming applications. TPS determines total request duration.
Typical Latency Ranges
| Provider / Model | TTFT (median) | TPS (output) | Notes |
|---|---|---|---|
| Claude Haiku 3.5 | 200-400ms | 120-180 | Fastest in the Claude family |
| Claude Sonnet 4 | 400-800ms | 80-120 | Good balance |
| Claude Opus 4 | 800-1500ms | 40-70 | Slower but highest quality |
| GPT-4o | 300-600ms | 80-120 | Consistently fast |
| GPT-4o-mini | 150-350ms | 150-200 | Very fast |
| Gemini 1.5 Flash | 150-400ms | 150-250 | Fastest for large contexts |
| Gemini 1.5 Pro | 500-1200ms | 60-100 | Slower with large inputs |
| Mistral Small | 200-500ms | 100-160 | Competitive speed |
These numbers vary by time of day, request size, and server load. The benchmarking harness below lets you measure actual latency for your specific use case.
Key insight: TTFT increases with input size. A 100-token prompt might get a 200ms TTFT, but a 50,000-token prompt could see 1-2 seconds of TTFT on the same model. Always benchmark with realistic input sizes.
Quality Benchmarks for Common Tasks
Raw benchmarks like MMLU and HumanEval give you a rough ranking, but they do not tell you which model is best for your task. Here is what I have observed across production workloads.
Summarization
Claude Sonnet and GPT-4o are both excellent at summarization. For technical document summarization, Claude tends to preserve more nuance and structure. Gemini 1.5 Pro handles extremely long documents better than anything else because of its 2M context window. For cost-sensitive summarization at scale, GPT-4o-mini and Gemini Flash both produce acceptable results at a fraction of the cost.
Code Generation
Claude Opus and Sonnet lead on code generation tasks in my experience. They produce more idiomatic code, handle edge cases better, and follow complex multi-file instructions more reliably. GPT-4o is close behind. For simple code generation (boilerplate, CRUD, formatting), GPT-4o-mini is more than sufficient and 10-20x cheaper.
Classification
For text classification, you almost never need a frontier model. GPT-4o-mini, Claude Haiku, and Gemini Flash all achieve 90%+ accuracy on well-prompted classification tasks. The cost difference is massive: classifying 1 million short texts costs about $30 with GPT-4o-mini versus $500+ with GPT-4o.
Complex Reasoning
This is where the flagship models justify their price. For multi-step reasoning, mathematical proofs, legal analysis, or tasks requiring careful logical thinking, Claude Opus and o1 outperform everything else. Sonnet and GPT-4o handle moderate reasoning well but fail on harder problems.
Building a Benchmarking Harness in Node.js
Stop guessing and start measuring. Here is a benchmarking framework that tests multiple providers on identical prompts and produces comparable results.
// benchmark-harness.js
var https = require("https");
var PROVIDERS = {
anthropic: {
name: "Anthropic",
baseUrl: "api.anthropic.com",
path: "/v1/messages",
keyEnv: "ANTHROPIC_API_KEY",
models: ["claude-sonnet-4-20250514", "claude-haiku-4-20250514"]
},
openai: {
name: "OpenAI",
baseUrl: "api.openai.com",
path: "/v1/chat/completions",
keyEnv: "OPENAI_API_KEY",
models: ["gpt-4o", "gpt-4o-mini"]
},
google: {
name: "Google",
baseUrl: "generativelanguage.googleapis.com",
keyEnv: "GOOGLE_API_KEY",
models: ["gemini-1.5-flash", "gemini-1.5-pro"]
}
};
function buildAnthropicRequest(model, prompt, maxTokens) {
return {
model: model,
max_tokens: maxTokens || 1024,
messages: [{ role: "user", content: prompt }]
};
}
function buildOpenAIRequest(model, prompt, maxTokens) {
return {
model: model,
max_tokens: maxTokens || 1024,
messages: [{ role: "user", content: prompt }]
};
}
function buildGoogleRequest(model, prompt, maxTokens) {
return {
contents: [{ parts: [{ text: prompt }] }],
generationConfig: { maxOutputTokens: maxTokens || 1024 }
};
}
function makeRequest(provider, model, prompt, maxTokens) {
return new Promise(function (resolve, reject) {
var startTime = Date.now();
var firstTokenTime = null;
var body;
var options = {
hostname: provider.baseUrl,
method: "POST",
headers: { "Content-Type": "application/json" }
};
if (provider.name === "Anthropic") {
options.path = provider.path;
options.headers["x-api-key"] = process.env[provider.keyEnv];
options.headers["anthropic-version"] = "2023-06-01";
body = JSON.stringify(buildAnthropicRequest(model, prompt, maxTokens));
} else if (provider.name === "OpenAI") {
options.path = provider.path;
options.headers["Authorization"] = "Bearer " + process.env[provider.keyEnv];
body = JSON.stringify(buildOpenAIRequest(model, prompt, maxTokens));
} else if (provider.name === "Google") {
var key = process.env[provider.keyEnv];
options.path = "/v1beta/models/" + model + ":generateContent?key=" + key;
body = JSON.stringify(buildGoogleRequest(model, prompt, maxTokens));
}
var req = https.request(options, function (res) {
var data = "";
firstTokenTime = Date.now();
res.on("data", function (chunk) {
data += chunk;
});
res.on("end", function () {
var endTime = Date.now();
var statusCode = res.statusCode;
if (statusCode >= 400) {
resolve({
model: model,
provider: provider.name,
error: "HTTP " + statusCode + ": " + data.substring(0, 200),
ttft: firstTokenTime - startTime,
totalTime: endTime - startTime
});
return;
}
var parsed = JSON.parse(data);
var outputText = extractOutput(provider.name, parsed);
var usage = extractUsage(provider.name, parsed);
resolve({
model: model,
provider: provider.name,
ttft: firstTokenTime - startTime,
totalTime: endTime - startTime,
inputTokens: usage.input,
outputTokens: usage.output,
outputLength: outputText.length,
output: outputText
});
});
});
req.on("error", function (err) {
reject(err);
});
req.write(body);
req.end();
});
}
function extractOutput(providerName, parsed) {
if (providerName === "Anthropic") {
return parsed.content[0].text;
} else if (providerName === "OpenAI") {
return parsed.choices[0].message.content;
} else if (providerName === "Google") {
return parsed.candidates[0].content.parts[0].text;
}
return "";
}
function extractUsage(providerName, parsed) {
if (providerName === "Anthropic") {
return {
input: parsed.usage.input_tokens,
output: parsed.usage.output_tokens
};
} else if (providerName === "OpenAI") {
return {
input: parsed.usage.prompt_tokens,
output: parsed.usage.completion_tokens
};
} else if (providerName === "Google") {
var meta = parsed.usageMetadata || {};
return {
input: meta.promptTokenCount || 0,
output: meta.candidatesTokenCount || 0
};
}
return { input: 0, output: 0 };
}
Calculating True Cost Per Task
Per-token pricing is misleading if you look at it in isolation. What actually matters is cost per successful task completion. Here is how to calculate it.
// cost-calculator.js
var PRICING = {
"claude-sonnet-4-20250514": { input: 3.0, output: 15.0 },
"claude-haiku-4-20250514": { input: 0.80, output: 4.0 },
"gpt-4o": { input: 2.5, output: 10.0 },
"gpt-4o-mini": { input: 0.15, output: 0.60 },
"gemini-1.5-flash": { input: 0.075, output: 0.30 },
"gemini-1.5-pro": { input: 1.25, output: 5.0 }
};
function calculateCost(model, inputTokens, outputTokens) {
var pricing = PRICING[model];
if (!pricing) {
return { error: "Unknown model: " + model };
}
var inputCost = (inputTokens / 1000000) * pricing.input;
var outputCost = (outputTokens / 1000000) * pricing.output;
return {
model: model,
inputCost: inputCost,
outputCost: outputCost,
totalCost: inputCost + outputCost,
costBreakdown: {
inputPercent: Math.round((inputCost / (inputCost + outputCost)) * 100),
outputPercent: Math.round((outputCost / (inputCost + outputCost)) * 100)
}
};
}
function calculateCostPerTask(results) {
var totalCost = 0;
var successCount = 0;
var failCount = 0;
results.forEach(function (result) {
var cost = calculateCost(result.model, result.inputTokens, result.outputTokens);
totalCost += cost.totalCost;
if (result.success) {
successCount += 1;
} else {
failCount += 1;
}
});
var costPerAttempt = totalCost / results.length;
var costPerSuccess = successCount > 0 ? totalCost / successCount : Infinity;
return {
totalCost: totalCost,
totalAttempts: results.length,
successCount: successCount,
failCount: failCount,
successRate: (successCount / results.length * 100).toFixed(1) + "%",
costPerAttempt: costPerAttempt.toFixed(6),
costPerSuccess: costPerSuccess.toFixed(6)
};
}
This is where cheaper models can actually cost more. If GPT-4o-mini fails 30% of the time on a complex extraction task and requires retries, while Claude Sonnet succeeds 95% of the time, Sonnet may be cheaper per successful completion despite a higher per-token price.
Hidden Costs
There are several costs that do not show up in the per-token pricing tables.
Retries and Failed Requests
Rate limits, server errors, and malformed outputs all cost money. Every 429 or 500 error that triggers a retry doubles the cost of that request. I have seen retry rates as high as 15% during peak hours on some providers.
// retry-aware-cost.js
function calculateRetryOverhead(baseRequestCost, retryRate, maxRetries) {
var expectedAttempts = 1;
var retryProb = retryRate;
for (var i = 1; i <= maxRetries; i++) {
expectedAttempts += retryProb;
retryProb *= retryRate;
}
return {
baseCost: baseRequestCost,
expectedAttempts: expectedAttempts.toFixed(2),
adjustedCost: (baseRequestCost * expectedAttempts).toFixed(6),
overheadPercent: ((expectedAttempts - 1) * 100).toFixed(1) + "%"
};
}
// Example: $0.01 per request, 10% retry rate, max 3 retries
var overhead = calculateRetryOverhead(0.01, 0.10, 3);
console.log(overhead);
// { baseCost: 0.01, expectedAttempts: '1.11', adjustedCost: '0.011100', overheadPercent: '11.1%' }
Context Window Overhead
Many applications pad requests with system prompts, few-shot examples, and RAG context. A 500-token user query might balloon to 5,000 tokens after adding context. That 10x multiplier on input tokens adds up fast.
Output Parsing Failures
When a model returns malformed JSON or does not follow your schema, you have to retry. These wasted tokens are pure cost. Structured output features (OpenAI's response_format, Anthropic's tool use) reduce this, but do not eliminate it entirely.
Egress and Network Costs
If you are running in the cloud, the API response data incurs egress charges. For high-volume applications generating long outputs, this can add 5-10% to your effective cost.
When Open-Source Models Make Financial Sense
Open-source models make sense in three scenarios:
- Volume exceeds $10,000/month on a single task type that an open-source model handles well
- Data privacy requirements prevent sending data to third-party APIs
- Latency requirements demand sub-100ms inference that only local GPUs can achieve
For everything else, APIs are cheaper when you account for the total cost of ownership of running GPUs.
Break-Even Analysis for Self-Hosting
// break-even.js
function calculateBreakEven(config) {
var monthlyGPUCost = config.gpuCostPerHour * 24 * 30;
var monthlyInfraCost = monthlyGPUCost + config.overheadPerMonth;
// Estimate throughput: tokens per second * seconds per month
var monthlyTokenCapacity = config.tokensPerSecond * 86400 * 30;
var effectiveCapacity = monthlyTokenCapacity * config.utilizationRate;
// Equivalent API cost
var avgTokensPerRequest = config.avgInputTokens + config.avgOutputTokens;
var requestsPerMonth = effectiveCapacity / avgTokensPerRequest;
var apiCostPerRequest = (config.avgInputTokens / 1e6 * config.apiInputPrice) +
(config.avgOutputTokens / 1e6 * config.apiOutputPrice);
var monthlyAPICost = requestsPerMonth * apiCostPerRequest;
return {
monthlySelfHostCost: "$" + monthlyInfraCost.toFixed(2),
monthlyAPICost: "$" + monthlyAPICost.toFixed(2),
breakEvenRequests: Math.ceil(monthlyInfraCost / apiCostPerRequest),
recommendation: monthlyInfraCost < monthlyAPICost ? "Self-host" : "Use API"
};
}
// Example: Llama 3 70B on 2x A100 vs Claude Sonnet API
var analysis = calculateBreakEven({
gpuCostPerHour: 6.50, // 2x A100 80GB on AWS
overheadPerMonth: 500, // Networking, storage, ops time
tokensPerSecond: 40, // Llama 70B on 2x A100
utilizationRate: 0.60, // Realistic utilization
avgInputTokens: 2000,
avgOutputTokens: 500,
apiInputPrice: 3.0, // Claude Sonnet input
apiOutputPrice: 15.0 // Claude Sonnet output
});
console.log(analysis);
// At typical utilization, you need sustained high volume to justify self-hosting
The math almost never works out for small teams. A single A100 costs $2-3/hour. By the time you add redundancy, monitoring, model serving infrastructure, and ops overhead, you are looking at $6,000-10,000/month minimum. That buys a lot of API calls.
Provider Reliability and Uptime
Reliability is the hidden dimension that rarely appears in comparison articles. Here is what I have observed over the past year of production usage.
Anthropic maintains excellent uptime (99.9%+) but occasionally experiences degraded performance during peak hours. Rate limits are generous on paid tiers. Error messages are clear and actionable.
OpenAI has the most variable reliability of the major providers. They have improved significantly, but you will still see occasional 500 errors and rate limit spikes. Their status page is not always accurate in real-time.
Google is extremely reliable in terms of uptime but occasionally returns unexpected model behavior after silent updates. Their error handling and documentation lag behind Anthropic and OpenAI.
Mistral is reliable but has a smaller infrastructure footprint. Expect occasional latency spikes during European business hours.
Always build retry logic with exponential backoff, regardless of provider:
// resilient-client.js
function callWithRetry(fn, options) {
var maxRetries = options.maxRetries || 3;
var baseDelay = options.baseDelay || 1000;
var attempt = 0;
function tryCall() {
attempt += 1;
return fn().catch(function (err) {
if (attempt >= maxRetries) {
throw err;
}
var isRetryable = err.status === 429 || err.status === 500 ||
err.status === 503 || err.code === "ECONNRESET";
if (!isRetryable) {
throw err;
}
var delay = baseDelay * Math.pow(2, attempt - 1);
var jitter = Math.random() * delay * 0.1;
console.log(
"Retry " + attempt + "/" + maxRetries +
" after " + Math.round(delay + jitter) + "ms" +
" (error: " + (err.status || err.code) + ")"
);
return new Promise(function (resolve) {
setTimeout(function () {
resolve(tryCall());
}, delay + jitter);
});
});
}
return tryCall();
}
Negotiating Enterprise Pricing
If you are spending more than $5,000/month, reach out to provider sales teams. Here is what is typically negotiable:
- Volume discounts: 10-30% off list pricing at $10K+/month commitments
- Rate limit increases: Higher TPM (tokens per minute) and RPM (requests per minute)
- Prompt caching improvements: Extended cache TTLs or priority caching
- Committed use discounts: Annual commitments in exchange for lower rates
- Priority support: Dedicated Slack channels, faster incident response
Do not accept the first offer. Providers are competing aggressively for enterprise customers. Get quotes from at least two providers and use them as leverage.
Forecasting Costs as You Scale
The biggest mistake teams make is linear extrapolation. Your costs will not scale linearly because:
- Prompt optimization reduces token usage by 20-40% as you refine prompts
- Caching reduces effective input costs as repeat patterns emerge
- Model downgrades become possible once you have evaluation data showing where cheaper models suffice
- Batching lets you move non-urgent work to 50% discount batch APIs
// cost-forecast.js
function forecastMonthlyCost(config) {
var months = [];
var currentVolume = config.startingRequestsPerMonth;
for (var month = 1; month <= 12; month++) {
var promptOptimization = Math.max(0.6, 1 - (month * 0.03));
var cachingDiscount = Math.min(0.4, month * 0.05);
var batchPercent = Math.min(config.maxBatchPercent, month * 0.05);
var effectiveInputTokens = config.avgInputTokens * promptOptimization;
var cachedTokens = effectiveInputTokens * cachingDiscount;
var uncachedTokens = effectiveInputTokens - cachedTokens;
var realtimeRequests = currentVolume * (1 - batchPercent);
var batchRequests = currentVolume * batchPercent;
var inputCost = (uncachedTokens / 1e6 * config.inputPrice) +
(cachedTokens / 1e6 * config.cachedInputPrice);
var outputCost = config.avgOutputTokens / 1e6 * config.outputPrice;
var perRequestCost = inputCost + outputCost;
var realtimeCost = realtimeRequests * perRequestCost;
var batchCost = batchRequests * perRequestCost * 0.5;
var totalCost = realtimeCost + batchCost;
months.push({
month: month,
requests: currentVolume,
costPerRequest: "$" + perRequestCost.toFixed(6),
totalCost: "$" + totalCost.toFixed(2)
});
currentVolume = Math.round(currentVolume * config.monthlyGrowthRate);
}
return months;
}
var forecast = forecastMonthlyCost({
startingRequestsPerMonth: 100000,
monthlyGrowthRate: 1.20,
avgInputTokens: 3000,
avgOutputTokens: 500,
inputPrice: 3.0,
cachedInputPrice: 0.30,
outputPrice: 15.0,
maxBatchPercent: 0.30
});
forecast.forEach(function (m) {
console.log("Month " + m.month + ": " + m.requests + " requests, " + m.totalCost);
});
Complete Working Example
Here is a complete benchmarking tool that tests multiple providers on the same prompts, measures latency, estimates cost, and generates a comparison report.
// llm-benchmark.js
var https = require("https");
var fs = require("fs");
// ============================================================
// Configuration
// ============================================================
var PRICING = {
"claude-sonnet-4-20250514": { input: 3.0, output: 15.0 },
"claude-haiku-4-20250514": { input: 0.80, output: 4.0 },
"gpt-4o": { input: 2.5, output: 10.0 },
"gpt-4o-mini": { input: 0.15, output: 0.60 }
};
var TEST_PROMPTS = [
{
name: "summarization",
prompt: "Summarize the following in 3 bullet points: The TCP/IP model consists of four layers. The application layer handles high-level protocols like HTTP and SMTP. The transport layer manages end-to-end communication using TCP or UDP. The internet layer handles addressing and routing via IP. The link layer manages the physical network interface. Each layer encapsulates data from the layer above, adding its own headers for processing at the corresponding layer on the receiving end.",
maxTokens: 256,
evaluator: function (output) {
var hasBullets = (output.match(/[-*•]/g) || []).length >= 3;
var isReasonableLength = output.length > 100 && output.length < 1000;
return hasBullets && isReasonableLength;
}
},
{
name: "code_generation",
prompt: "Write a JavaScript function called 'deepMerge' that recursively merges two objects. If both values are objects, merge them recursively. If both values are arrays, concatenate them. Otherwise, the second value wins. Include JSDoc comments.",
maxTokens: 512,
evaluator: function (output) {
var hasFunction = output.indexOf("function") !== -1;
var hasMerge = output.indexOf("deepMerge") !== -1;
var hasRecursion = output.indexOf("deepMerge(") !== -1;
return hasFunction && hasMerge && hasRecursion;
}
},
{
name: "classification",
prompt: "Classify the following text into exactly one category (positive, negative, neutral). Respond with ONLY the category word, nothing else.\n\nText: \"The product works as expected but the shipping took longer than promised.\"",
maxTokens: 16,
evaluator: function (output) {
var normalized = output.trim().toLowerCase();
return normalized === "neutral" || normalized === "negative" || normalized === "positive";
}
},
{
name: "reasoning",
prompt: "A farmer has 17 sheep. All but 9 run away. How many sheep does the farmer have left? Think step by step, then give your final answer as just a number on the last line.",
maxTokens: 256,
evaluator: function (output) {
return output.indexOf("9") !== -1;
}
}
];
// ============================================================
// Provider API Functions
// ============================================================
function callAnthropic(model, prompt, maxTokens) {
return new Promise(function (resolve, reject) {
var startTime = Date.now();
var body = JSON.stringify({
model: model,
max_tokens: maxTokens,
messages: [{ role: "user", content: prompt }]
});
var req = https.request({
hostname: "api.anthropic.com",
path: "/v1/messages",
method: "POST",
headers: {
"Content-Type": "application/json",
"x-api-key": process.env.ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01"
}
}, function (res) {
var data = "";
var ttft = Date.now() - startTime;
res.on("data", function (chunk) { data += chunk; });
res.on("end", function () {
var totalTime = Date.now() - startTime;
if (res.statusCode >= 400) {
resolve({ error: "HTTP " + res.statusCode, ttft: ttft, totalTime: totalTime });
return;
}
var parsed = JSON.parse(data);
resolve({
output: parsed.content[0].text,
inputTokens: parsed.usage.input_tokens,
outputTokens: parsed.usage.output_tokens,
ttft: ttft,
totalTime: totalTime
});
});
});
req.on("error", reject);
req.write(body);
req.end();
});
}
function callOpenAI(model, prompt, maxTokens) {
return new Promise(function (resolve, reject) {
var startTime = Date.now();
var body = JSON.stringify({
model: model,
max_tokens: maxTokens,
messages: [{ role: "user", content: prompt }]
});
var req = https.request({
hostname: "api.openai.com",
path: "/v1/chat/completions",
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": "Bearer " + process.env.OPENAI_API_KEY
}
}, function (res) {
var data = "";
var ttft = Date.now() - startTime;
res.on("data", function (chunk) { data += chunk; });
res.on("end", function () {
var totalTime = Date.now() - startTime;
if (res.statusCode >= 400) {
resolve({ error: "HTTP " + res.statusCode, ttft: ttft, totalTime: totalTime });
return;
}
var parsed = JSON.parse(data);
resolve({
output: parsed.choices[0].message.content,
inputTokens: parsed.usage.prompt_tokens,
outputTokens: parsed.usage.completion_tokens,
ttft: ttft,
totalTime: totalTime
});
});
});
req.on("error", reject);
req.write(body);
req.end();
});
}
function callProvider(model, prompt, maxTokens) {
if (model.indexOf("claude") === 0) {
return callAnthropic(model, prompt, maxTokens);
} else if (model.indexOf("gpt") === 0) {
return callOpenAI(model, prompt, maxTokens);
}
return Promise.reject(new Error("Unknown model: " + model));
}
// ============================================================
// Benchmark Runner
// ============================================================
function runBenchmark(models, prompts, runsPerTest) {
var results = [];
var queue = [];
models.forEach(function (model) {
prompts.forEach(function (testCase) {
for (var run = 0; run < runsPerTest; run++) {
queue.push({ model: model, testCase: testCase, run: run });
}
});
});
var index = 0;
function processNext() {
if (index >= queue.length) {
return Promise.resolve(results);
}
var item = queue[index];
index += 1;
console.log(
"Running: " + item.model + " / " + item.testCase.name +
" (run " + (item.run + 1) + ")"
);
return callProvider(item.model, item.testCase.prompt, item.testCase.maxTokens)
.then(function (response) {
var success = false;
if (!response.error && response.output) {
success = item.testCase.evaluator(response.output);
}
var pricing = PRICING[item.model] || { input: 0, output: 0 };
var cost = 0;
if (response.inputTokens && response.outputTokens) {
cost = (response.inputTokens / 1e6 * pricing.input) +
(response.outputTokens / 1e6 * pricing.output);
}
results.push({
model: item.model,
task: item.testCase.name,
run: item.run + 1,
success: success,
ttft: response.ttft,
totalTime: response.totalTime,
inputTokens: response.inputTokens || 0,
outputTokens: response.outputTokens || 0,
cost: cost,
error: response.error || null
});
return processNext();
})
.catch(function (err) {
results.push({
model: item.model,
task: item.testCase.name,
run: item.run + 1,
success: false,
error: err.message
});
return processNext();
});
}
return processNext();
}
// ============================================================
// Report Generator
// ============================================================
function generateReport(results) {
var report = [];
report.push("=== LLM Provider Benchmark Report ===");
report.push("Date: " + new Date().toISOString());
report.push("Total tests: " + results.length);
report.push("");
// Group by model
var modelGroups = {};
results.forEach(function (r) {
if (!modelGroups[r.model]) {
modelGroups[r.model] = [];
}
modelGroups[r.model].push(r);
});
Object.keys(modelGroups).forEach(function (model) {
var modelResults = modelGroups[model];
var successCount = modelResults.filter(function (r) { return r.success; }).length;
var totalCost = modelResults.reduce(function (sum, r) { return sum + (r.cost || 0); }, 0);
var avgTTFT = modelResults.reduce(function (sum, r) { return sum + (r.ttft || 0); }, 0) / modelResults.length;
var avgTotal = modelResults.reduce(function (sum, r) { return sum + (r.totalTime || 0); }, 0) / modelResults.length;
report.push("--- " + model + " ---");
report.push(" Success rate: " + successCount + "/" + modelResults.length +
" (" + (successCount / modelResults.length * 100).toFixed(0) + "%)");
report.push(" Avg TTFT: " + avgTTFT.toFixed(0) + "ms");
report.push(" Avg total time: " + avgTotal.toFixed(0) + "ms");
report.push(" Total cost: $" + totalCost.toFixed(6));
report.push(" Cost per success: $" +
(successCount > 0 ? (totalCost / successCount).toFixed(6) : "N/A"));
report.push("");
// Per-task breakdown
var taskGroups = {};
modelResults.forEach(function (r) {
if (!taskGroups[r.task]) taskGroups[r.task] = [];
taskGroups[r.task].push(r);
});
Object.keys(taskGroups).forEach(function (task) {
var taskResults = taskGroups[task];
var taskSuccess = taskResults.filter(function (r) { return r.success; }).length;
report.push(" " + task + ": " + taskSuccess + "/" + taskResults.length + " passed");
});
report.push("");
});
return report.join("\n");
}
// ============================================================
// Main
// ============================================================
function main() {
var models = Object.keys(PRICING);
var runsPerTest = 3;
console.log("Starting LLM benchmark...");
console.log("Models: " + models.join(", "));
console.log("Tests per model: " + TEST_PROMPTS.length + " x " + runsPerTest + " runs");
console.log("");
runBenchmark(models, TEST_PROMPTS, runsPerTest).then(function (results) {
var report = generateReport(results);
console.log("\n" + report);
// Save raw results
fs.writeFileSync(
"benchmark-results.json",
JSON.stringify(results, null, 2)
);
// Save report
fs.writeFileSync("benchmark-report.txt", report);
console.log("Results saved to benchmark-results.json");
console.log("Report saved to benchmark-report.txt");
}).catch(function (err) {
console.error("Benchmark failed:", err);
process.exit(1);
});
}
main();
Run it with:
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
node llm-benchmark.js
Sample output:
Starting LLM benchmark...
Models: claude-sonnet-4-20250514, claude-haiku-4-20250514, gpt-4o, gpt-4o-mini
Tests per model: 4 x 3 runs
Running: claude-sonnet-4-20250514 / summarization (run 1)
Running: claude-sonnet-4-20250514 / summarization (run 2)
...
=== LLM Provider Benchmark Report ===
Date: 2026-02-11T15:30:00.000Z
Total tests: 48
--- claude-sonnet-4-20250514 ---
Success rate: 12/12 (100%)
Avg TTFT: 520ms
Avg total time: 1840ms
Total cost: $0.001260
Cost per success: $0.000105
--- claude-haiku-4-20250514 ---
Success rate: 11/12 (92%)
Avg TTFT: 280ms
Avg total time: 680ms
Total cost: $0.000384
Cost per success: $0.000035
--- gpt-4o ---
Success rate: 11/12 (92%)
Avg TTFT: 410ms
Avg total time: 1520ms
Total cost: $0.000980
Cost per success: $0.000089
--- gpt-4o-mini ---
Success rate: 10/12 (83%)
Avg TTFT: 210ms
Avg total time: 520ms
Total cost: $0.000048
Cost per success: $0.000005
Common Issues and Troubleshooting
1. Rate Limit Errors (429)
Error: Request failed with status 429: {"error":{"message":"Rate limit reached for gpt-4o in organization org-xxx on tokens per min (TPM): Limit 30000, Used 28500, Requested 2500."}}
Fix: Implement exponential backoff with jitter. Do not retry immediately. Space out requests by at least 1 second for benchmarking. For production workloads, use a token bucket rate limiter that tracks TPM and RPM separately.
2. Context Length Exceeded
Error: Request failed with status 400: {"error":{"type":"invalid_request_error","message":"prompt is too long: 204521 tokens > 200000 maximum"}}
Fix: Count tokens before sending. Use the provider's tokenizer library (@anthropic-ai/tokenizer, tiktoken for OpenAI) to validate input length. Truncate or chunk long inputs before they hit the API.
3. Timeout on Long Generations
Error: ESOCKETTIMEDOUT - socket hang up after 60000ms
Fix: Set appropriate timeouts based on expected generation length. A 4,000-token output at 80 TPS takes 50 seconds. Set your timeout to at least 2x the expected generation time. For the benchmarking harness, use 120-second timeouts.
var req = https.request(options);
req.setTimeout(120000, function () {
req.destroy(new Error("Request timed out after 120s"));
});
4. JSON Parsing Failures from Model Output
SyntaxError: Unexpected token 'H' at position 0 - JSON.parse("Here is the JSON you requested:\n{\"key\": \"value\"}")
Fix: When requesting structured output, use the provider's native JSON mode. For Anthropic, use tool_use with a defined schema. For OpenAI, set response_format: { type: "json_object" }. Never rely on the model to produce valid JSON without these guardrails in production.
5. Inconsistent Token Counts Across Providers
Same prompt: Anthropic reports 145 input tokens, OpenAI reports 152 input tokens
This is expected. Different providers use different tokenizers (Anthropic uses their own, OpenAI uses tiktoken/cl100k_base, Google uses SentencePiece). Token counts for identical text will differ by 5-15%. Your cost calculations must use each provider's reported token counts, not a shared estimate.
Best Practices
Benchmark with your actual prompts, not synthetic ones. Generic benchmarks tell you nothing about how a model performs on your specific task. The 10 minutes spent running your real prompts through multiple providers will save you thousands of dollars in wrong decisions.
Always calculate cost per successful task, not cost per token. A model that is 5x cheaper per token but fails 40% of the time is not cheaper. Track success rates alongside cost in every evaluation.
Use the cheapest model that meets your quality threshold. Start with the cheapest option (Haiku, GPT-4o-mini, Gemini Flash) and only upgrade when you have empirical evidence that quality is insufficient. Most classification, extraction, and simple generation tasks do not need frontier models.
Implement prompt caching from day one. If you have any static content in your prompts (system instructions, few-shot examples, reference documents), use prompt caching. The savings are typically 30-60% on input costs.
Build provider abstraction early. Do not hardcode to a single provider's API format. Use a thin abstraction layer that lets you swap models with a configuration change. Prices drop and quality improves constantly; you need to be able to move quickly.
Monitor costs in real-time, not monthly. Set up daily cost alerts and per-request cost logging from the start. A prompt engineering bug that doubles your token usage will cost you thousands before the monthly bill arrives.
Move non-latency-sensitive work to batch APIs. Content moderation, data enrichment, document processing, and evaluation runs can all use batch endpoints at 50% off. The 24-hour SLA is fine for offline work.
Re-evaluate quarterly. The LLM market moves fast. Models that were state-of-the-art six months ago may be outperformed by models at one-tenth the cost. Schedule quarterly benchmark runs with your production prompts.
Negotiate pricing when you cross $5K/month. Every major provider offers volume discounts. Get quotes from multiple providers and use them as leverage. A 20% discount on a $10K/month bill saves $24K annually.