LLM Provider Pricing in 2026: What It Actually Costs Per Task
LLM pricing tables lie. Here's a Node.js benchmarking harness to measure what actually matters: cost per successful task.
The cheapest model per token is often the most expensive model per result.
I have watched teams pick GPT-4o-mini for a complex extraction pipeline because the pricing table looked good, then burn through 3x the budget on retries when the model failed 40% of the time. The model that costs $0.15 per million input tokens does not save you money if it cannot do the job.
This article gives you everything you need to make a real decision: current pricing across every major provider (verified March 2026), latency benchmarks, quality tradeoffs by task type, and a complete Node.js benchmarking harness you can run against your own prompts today. Stop guessing. Start measuring.
Prerequisites
- Node.js 18+ installed
- API keys for at least two providers (Anthropic, OpenAI, or Google)
- Basic understanding of how LLM APIs work (sending prompts, receiving completions)
- Familiarity with async/await patterns in Node.js
- A working understanding of tokenization (you do not need to be an expert)
The Provider Landscape Has Shifted Dramatically
If you last compared LLM pricing six months ago, your numbers are wrong. The market moves fast, and the March 2026 landscape looks nothing like late 2025.
Anthropic (Claude) now offers three current-generation tiers: Opus 4.6 (flagship reasoning), Sonnet 4.6 (balanced workhorse), and Haiku 4.5 (fast and cheap). The 4.6 generation brought a massive price drop: Opus 4.6 runs at $5/$25 per million tokens, down from $15/$75 on Opus 4.1. Both 4.6 models include a 1M-token context window at standard pricing. Claude consistently leads on code generation, nuanced instruction following, and agentic workflows.
OpenAI (GPT) has expanded well beyond GPT-4o. GPT-4.1 replaced GPT-4o as the recommended production model with better performance at lower cost ($2/$8 vs $2.50/$10) and a 1M context window. GPT-5 is the new flagship at $1.25/$10. The o-series reasoning models (o3, o4-mini) handle complex multi-step problems. GPT-4o and GPT-4o-mini remain available and widely used.
Google (Gemini) competes aggressively across the board. The Gemini 2.5 family (Pro at $1.25/$10, Flash at $0.15/$0.60, Flash-Lite at $0.10/$0.40) offers exceptional value. Gemini 3.1 Pro is their new premium tier at $2/$12. Every Gemini model supports a 1M-token context window, and the free tier is the most generous in the industry.
Mistral offers strong European-hosted models. Mistral Large handles multilingual tasks and structured output well. Their API is straightforward and compatible with the OpenAI client format.
Open-source models (Llama 3, Qwen 2.5, DeepSeek V3) can be self-hosted or run through providers like Together AI, Fireworks, or Groq. The economics are different: you pay for compute, not tokens.
How LLM Pricing Actually Works
Every commercial LLM API charges per token, but the details determine whether you are paying $30 or $3,000 per month for the same workload.
Input vs. Output Tokens
All providers charge differently for input (prompt) tokens versus output (completion) tokens. Output tokens always cost more because they require sequential generation. Typical ratios run 3:1 to 5:1 output-to-input pricing.
This distinction matters enormously. If your application sends long documents for summarization, input tokens dominate your cost. If you generate long-form content, output tokens dominate. Know your ratio before choosing a model.
Prompt Caching Cuts Input Costs by 90%
Anthropic, OpenAI, and Google all offer prompt caching. If you send the same system prompt or reference documents across multiple requests, cached input tokens cost 50-90% less. For applications with static context, this is the single biggest cost lever available.
// Anthropic prompt caching example
var Anthropic = require("@anthropic-ai/sdk");
var client = new Anthropic();
var systemPrompt = "You are a code review assistant..."; // Long system prompt
// First request caches the system prompt
var response = client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: [
{
type: "text",
text: systemPrompt,
cache_control: { type: "ephemeral" }
}
],
messages: [{ role: "user", content: "Review this function..." }]
});
// Subsequent requests with the same system prompt pay cached rates
// Sonnet 4.6: Input $3/MTok -> Cached: $0.30/MTok (90% savings)
Batch APIs Save 50% on Async Work
Both Anthropic and OpenAI offer batch APIs that process requests asynchronously at a 50% discount. The tradeoff is latency: batch requests complete within 24 hours, not seconds. This is the right choice for content generation pipelines, data enrichment, evaluation runs, and anything that does not need a real-time response.
Volume Pricing
If you are spending more than $5,000/month with any provider, you should be talking to their sales team. OpenAI and Google both offer tiered pricing at scale. Anthropic offers enterprise agreements with custom pricing. Do not leave money on the table.
Current Pricing Tables (Verified March 2026)
These change frequently. Always verify against official pricing pages before making purchasing decisions.
Anthropic Claude Models
| Model | Input (per MTok) | Output (per MTok) | Cached Input | Context Window |
|---|---|---|---|---|
| Claude Opus 4.6 | $5.00 | $25.00 | $0.50 | 200K (1M standard) |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 | 200K (1M standard) |
| Claude Haiku 4.5 | $1.00 | $5.00 | $0.10 | 200K |
Note: Both 4.6 models include 1M context at standard pricing. Legacy Opus 4.1 ($15/$75) is still available but costs 3x more than Opus 4.6 for comparable quality.
OpenAI GPT Models
| Model | Input (per MTok) | Output (per MTok) | Cached Input | Context Window |
|---|---|---|---|---|
| GPT-5 | $1.25 | $10.00 | $0.625 | 128K |
| GPT-4.1 | $2.00 | $8.00 | $0.50 | 1M |
| GPT-4o | $2.50 | $10.00 | $1.25 | 128K |
| GPT-4o-mini | $0.15 | $0.60 | $0.075 | 128K |
| GPT-4.1-mini | $0.40 | $1.60 | $0.10 | 1M |
| o4-mini | $1.10 | $4.40 | $0.275 | 200K |
Google Gemini Models
| Model | Input (per MTok) | Output (per MTok) | Context Window |
|---|---|---|---|
| Gemini 3.1 Pro | $2.00 | $12.00 | 1M |
| Gemini 2.5 Pro | $1.25 | $10.00 | 1M |
| Gemini 2.5 Flash | $0.15 | $0.60 | 1M |
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | 1M |
Note: Gemini Pro models have tiered pricing. Prompts exceeding 200K tokens are charged at 2x the standard rate. All Gemini models support 1M context windows.
Mistral Models
| Model | Input (per MTok) | Output (per MTok) | Context Window |
|---|---|---|---|
| Mistral Large | $2.00 | $6.00 | 128K |
| Mistral Small | $0.10 | $0.30 | 128K |
The raw numbers tell a clear story. For high-volume, latency-insensitive tasks, Gemini 2.5 Flash-Lite and GPT-4o-mini are extraordinarily cheap. For flagship reasoning, Claude Opus 4.6 now costs 67% less than it did six months ago, putting it in striking distance of GPT-5 and Gemini 3.1 Pro. The competitive pressure has been relentless, and the beneficiary is anyone building on these APIs.
Latency Benchmarks
Latency has two components: time to first token (TTFT) and tokens per second (TPS) once generation starts. TTFT determines perceived responsiveness in streaming applications. TPS determines total request duration.
Typical Latency Ranges
| Provider / Model | TTFT (median) | TPS (output) | Notes |
|---|---|---|---|
| Claude Haiku 4.5 | 200-400ms | 120-180 | Fastest in the Claude family |
| Claude Sonnet 4.6 | 400-800ms | 80-120 | Good balance |
| Claude Opus 4.6 | 800-1500ms | 40-70 | Slower but highest quality |
| GPT-4o | 300-600ms | 80-120 | Consistently fast |
| GPT-4o-mini | 150-350ms | 150-200 | Very fast |
| GPT-4.1 | 300-600ms | 80-130 | Similar to GPT-4o |
| Gemini 2.5 Flash | 150-400ms | 150-250 | Fastest for large contexts |
| Gemini 2.5 Pro | 500-1200ms | 60-100 | Slower with large inputs |
| Mistral Small | 200-500ms | 100-160 | Competitive speed |
These numbers vary by time of day, request size, and server load. The benchmarking harness below lets you measure actual latency for your specific use case.
Key insight: TTFT increases with input size. A 100-token prompt might get a 200ms TTFT, but a 50,000-token prompt could see 1-2 seconds of TTFT on the same model. Always benchmark with realistic input sizes.
Which Model Wins for Your Task
Raw benchmarks like MMLU and HumanEval give you a rough ranking. They do not tell you which model is best for your specific workload. Here is what I have observed across production systems.
Summarization
Claude Sonnet 4.6 and GPT-4o are both excellent. For technical document summarization, Claude tends to preserve more nuance and structure. Gemini 2.5 Pro handles extremely long documents better than anything else because of its 1M context window at standard pricing. For cost-sensitive summarization at scale, GPT-4o-mini and Gemini 2.5 Flash both produce acceptable results at a fraction of the cost.
Code Generation
Claude Opus 4.6 and Sonnet 4.6 lead on code generation in my experience. They produce more idiomatic code, handle edge cases better, and follow complex multi-file instructions more reliably. GPT-4.1 is competitive and comes with a 1M context window that is useful for large codebases. For simple code generation (boilerplate, CRUD, formatting), GPT-4o-mini is more than sufficient and dramatically cheaper.
Classification
For text classification, you almost never need a frontier model. GPT-4o-mini, Claude Haiku 4.5, and Gemini 2.5 Flash all achieve 90%+ accuracy on well-prompted classification tasks. The cost difference is staggering: classifying 1 million short texts costs roughly $30 with GPT-4o-mini versus $500+ with GPT-4o.
Complex Reasoning
This is where flagship models justify their price. For multi-step reasoning, mathematical proofs, legal analysis, or tasks requiring careful logical thinking, Claude Opus 4.6, GPT-5, and the o-series models outperform everything else. Sonnet and GPT-4o handle moderate reasoning well but fail on harder problems. The o4-mini model offers a strong middle ground for reasoning tasks at $1.10/$4.40.
Building a Benchmarking Harness in Node.js
Stop guessing and start measuring. Here is a benchmarking framework that tests multiple providers on identical prompts and produces comparable results.
// benchmark-harness.js
var https = require("https");
var PROVIDERS = {
anthropic: {
name: "Anthropic",
baseUrl: "api.anthropic.com",
path: "/v1/messages",
keyEnv: "ANTHROPIC_API_KEY",
models: ["claude-sonnet-4-20250514", "claude-haiku-4-20250514"]
},
openai: {
name: "OpenAI",
baseUrl: "api.openai.com",
path: "/v1/chat/completions",
keyEnv: "OPENAI_API_KEY",
models: ["gpt-4o", "gpt-4o-mini"]
},
google: {
name: "Google",
baseUrl: "generativelanguage.googleapis.com",
keyEnv: "GOOGLE_API_KEY",
models: ["gemini-1.5-flash", "gemini-1.5-pro"]
}
};
function buildAnthropicRequest(model, prompt, maxTokens) {
return {
model: model,
max_tokens: maxTokens || 1024,
messages: [{ role: "user", content: prompt }]
};
}
function buildOpenAIRequest(model, prompt, maxTokens) {
return {
model: model,
max_tokens: maxTokens || 1024,
messages: [{ role: "user", content: prompt }]
};
}
function buildGoogleRequest(model, prompt, maxTokens) {
return {
contents: [{ parts: [{ text: prompt }] }],
generationConfig: { maxOutputTokens: maxTokens || 1024 }
};
}
function makeRequest(provider, model, prompt, maxTokens) {
return new Promise(function (resolve, reject) {
var startTime = Date.now();
var firstTokenTime = null;
var body;
var options = {
hostname: provider.baseUrl,
method: "POST",
headers: { "Content-Type": "application/json" }
};
if (provider.name === "Anthropic") {
options.path = provider.path;
options.headers["x-api-key"] = process.env[provider.keyEnv];
options.headers["anthropic-version"] = "2023-06-01";
body = JSON.stringify(buildAnthropicRequest(model, prompt, maxTokens));
} else if (provider.name === "OpenAI") {
options.path = provider.path;
options.headers["Authorization"] = "Bearer " + process.env[provider.keyEnv];
body = JSON.stringify(buildOpenAIRequest(model, prompt, maxTokens));
} else if (provider.name === "Google") {
var key = process.env[provider.keyEnv];
options.path = "/v1beta/models/" + model + ":generateContent?key=" + key;
body = JSON.stringify(buildGoogleRequest(model, prompt, maxTokens));
}
var req = https.request(options, function (res) {
var data = "";
firstTokenTime = Date.now();
res.on("data", function (chunk) {
data += chunk;
});
res.on("end", function () {
var endTime = Date.now();
var statusCode = res.statusCode;
if (statusCode >= 400) {
resolve({
model: model,
provider: provider.name,
error: "HTTP " + statusCode + ": " + data.substring(0, 200),
ttft: firstTokenTime - startTime,
totalTime: endTime - startTime
});
return;
}
var parsed = JSON.parse(data);
var outputText = extractOutput(provider.name, parsed);
var usage = extractUsage(provider.name, parsed);
resolve({
model: model,
provider: provider.name,
ttft: firstTokenTime - startTime,
totalTime: endTime - startTime,
inputTokens: usage.input,
outputTokens: usage.output,
outputLength: outputText.length,
output: outputText
});
});
});
req.on("error", function (err) {
reject(err);
});
req.write(body);
req.end();
});
}
function extractOutput(providerName, parsed) {
if (providerName === "Anthropic") {
return parsed.content[0].text;
} else if (providerName === "OpenAI") {
return parsed.choices[0].message.content;
} else if (providerName === "Google") {
return parsed.candidates[0].content.parts[0].text;
}
return "";
}
function extractUsage(providerName, parsed) {
if (providerName === "Anthropic") {
return {
input: parsed.usage.input_tokens,
output: parsed.usage.output_tokens
};
} else if (providerName === "OpenAI") {
return {
input: parsed.usage.prompt_tokens,
output: parsed.usage.completion_tokens
};
} else if (providerName === "Google") {
var meta = parsed.usageMetadata || {};
return {
input: meta.promptTokenCount || 0,
output: meta.candidatesTokenCount || 0
};
}
return { input: 0, output: 0 };
}
Calculating True Cost Per Task
Per-token pricing is misleading in isolation. What actually matters is cost per successful task completion. A model that fails 30% of the time and requires retries can cost more than a model at 5x the per-token price that succeeds on the first attempt.
// cost-calculator.js
var PRICING = {
"claude-sonnet-4-20250514": { input: 3.0, output: 15.0 },
"claude-haiku-4-20250514": { input: 1.0, output: 5.0 },
"gpt-4o": { input: 2.5, output: 10.0 },
"gpt-4o-mini": { input: 0.15, output: 0.60 },
"gemini-2.5-flash": { input: 0.15, output: 0.60 },
"gemini-2.5-pro": { input: 1.25, output: 10.0 }
};
function calculateCost(model, inputTokens, outputTokens) {
var pricing = PRICING[model];
if (!pricing) {
return { error: "Unknown model: " + model };
}
var inputCost = (inputTokens / 1000000) * pricing.input;
var outputCost = (outputTokens / 1000000) * pricing.output;
return {
model: model,
inputCost: inputCost,
outputCost: outputCost,
totalCost: inputCost + outputCost,
costBreakdown: {
inputPercent: Math.round((inputCost / (inputCost + outputCost)) * 100),
outputPercent: Math.round((outputCost / (inputCost + outputCost)) * 100)
}
};
}
function calculateCostPerTask(results) {
var totalCost = 0;
var successCount = 0;
var failCount = 0;
results.forEach(function (result) {
var cost = calculateCost(result.model, result.inputTokens, result.outputTokens);
totalCost += cost.totalCost;
if (result.success) {
successCount += 1;
} else {
failCount += 1;
}
});
var costPerAttempt = totalCost / results.length;
var costPerSuccess = successCount > 0 ? totalCost / successCount : Infinity;
return {
totalCost: totalCost,
totalAttempts: results.length,
successCount: successCount,
failCount: failCount,
successRate: (successCount / results.length * 100).toFixed(1) + "%",
costPerAttempt: costPerAttempt.toFixed(6),
costPerSuccess: costPerSuccess.toFixed(6)
};
}
This is where cheaper models can actually cost more. If GPT-4o-mini fails 30% of the time on a complex extraction task and requires retries, while Claude Sonnet succeeds 95% of the time, Sonnet may be cheaper per successful completion despite a higher per-token price.
Hidden Costs That Will Wreck Your Budget
Several costs never appear in pricing tables. Ignore them and your actual spend will be 20-50% higher than your projections.
Retries and Failed Requests
Rate limits, server errors, and malformed outputs all cost money. Every 429 or 500 error that triggers a retry doubles the cost of that request. I have seen retry rates as high as 15% during peak hours on some providers.
// retry-aware-cost.js
function calculateRetryOverhead(baseRequestCost, retryRate, maxRetries) {
var expectedAttempts = 1;
var retryProb = retryRate;
for (var i = 1; i <= maxRetries; i++) {
expectedAttempts += retryProb;
retryProb *= retryRate;
}
return {
baseCost: baseRequestCost,
expectedAttempts: expectedAttempts.toFixed(2),
adjustedCost: (baseRequestCost * expectedAttempts).toFixed(6),
overheadPercent: ((expectedAttempts - 1) * 100).toFixed(1) + "%"
};
}
// Example: $0.01 per request, 10% retry rate, max 3 retries
var overhead = calculateRetryOverhead(0.01, 0.10, 3);
console.log(overhead);
// { baseCost: 0.01, expectedAttempts: '1.11', adjustedCost: '0.011100', overheadPercent: '11.1%' }
Context Window Overhead
Many applications pad requests with system prompts, few-shot examples, and RAG context. A 500-token user query might balloon to 5,000 tokens after adding context. That 10x multiplier on input tokens adds up fast. And with Gemini and Claude models supporting 1M-token contexts, it is easy to send far more context than you need.
Output Parsing Failures
When a model returns malformed JSON or does not follow your schema, you have to retry. These wasted tokens are pure cost. Structured output features (OpenAI's response_format, Anthropic's tool use) reduce this, but do not eliminate it entirely.
Long-Context Pricing Traps
Both Anthropic and Google charge premium rates when prompts exceed certain thresholds. Gemini models double their input pricing for prompts over 200K tokens. Anthropic applies similar long-context pricing above 200K input tokens. If your application regularly sends large documents, model these surcharges into your cost projections.
Egress and Network Costs
If you are running in the cloud, API response data incurs egress charges. For high-volume applications generating long outputs, this can add 5-10% to your effective cost.
When Open-Source Models Make Financial Sense
Open-source models make sense in three scenarios:
- Volume exceeds $10,000/month on a single task type that an open-source model handles well
- Data privacy requirements prevent sending data to third-party APIs
- Latency requirements demand sub-100ms inference that only local GPUs can achieve
For everything else, APIs are cheaper when you account for the total cost of ownership of running GPUs.
Break-Even Analysis for Self-Hosting
// break-even.js
function calculateBreakEven(config) {
var monthlyGPUCost = config.gpuCostPerHour * 24 * 30;
var monthlyInfraCost = monthlyGPUCost + config.overheadPerMonth;
// Estimate throughput: tokens per second * seconds per month
var monthlyTokenCapacity = config.tokensPerSecond * 86400 * 30;
var effectiveCapacity = monthlyTokenCapacity * config.utilizationRate;
// Equivalent API cost
var avgTokensPerRequest = config.avgInputTokens + config.avgOutputTokens;
var requestsPerMonth = effectiveCapacity / avgTokensPerRequest;
var apiCostPerRequest = (config.avgInputTokens / 1e6 * config.apiInputPrice) +
(config.avgOutputTokens / 1e6 * config.apiOutputPrice);
var monthlyAPICost = requestsPerMonth * apiCostPerRequest;
return {
monthlySelfHostCost: "$" + monthlyInfraCost.toFixed(2),
monthlyAPICost: "$" + monthlyAPICost.toFixed(2),
breakEvenRequests: Math.ceil(monthlyInfraCost / apiCostPerRequest),
recommendation: monthlyInfraCost < monthlyAPICost ? "Self-host" : "Use API"
};
}
// Example: Llama 3 70B on 2x A100 vs Claude Sonnet API
var analysis = calculateBreakEven({
gpuCostPerHour: 6.50, // 2x A100 80GB on AWS
overheadPerMonth: 500, // Networking, storage, ops time
tokensPerSecond: 40, // Llama 70B on 2x A100
utilizationRate: 0.60, // Realistic utilization
avgInputTokens: 2000,
avgOutputTokens: 500,
apiInputPrice: 3.0, // Claude Sonnet input
apiOutputPrice: 15.0 // Claude Sonnet output
});
console.log(analysis);
// At typical utilization, you need sustained high volume to justify self-hosting
The math almost never works out for small teams. A single A100 costs $2-3/hour. By the time you add redundancy, monitoring, model serving infrastructure, and ops overhead, you are looking at $6,000-10,000/month minimum. That buys a lot of API calls, especially at 2026 prices.
Provider Reliability: The Dimension Nobody Publishes
Reliability rarely appears in comparison articles, but it directly affects your cost through retries and degraded user experience.
Anthropic maintains excellent uptime (99.9%+) with occasional degraded performance during peak hours. Rate limits are generous on paid tiers. Error messages are clear and actionable.
OpenAI has the most variable reliability among major providers. They have improved significantly, but 500 errors and rate limit spikes still occur. Their status page does not always reflect real-time conditions.
Google is extremely reliable for uptime but occasionally shows unexpected model behavior after silent updates. Documentation and error handling lag behind Anthropic and OpenAI.
Mistral is reliable with a smaller infrastructure footprint. Expect occasional latency spikes during European business hours.
Always build retry logic with exponential backoff, regardless of provider:
// resilient-client.js
function callWithRetry(fn, options) {
var maxRetries = options.maxRetries || 3;
var baseDelay = options.baseDelay || 1000;
var attempt = 0;
function tryCall() {
attempt += 1;
return fn().catch(function (err) {
if (attempt >= maxRetries) {
throw err;
}
var isRetryable = err.status === 429 || err.status === 500 ||
err.status === 503 || err.code === "ECONNRESET";
if (!isRetryable) {
throw err;
}
var delay = baseDelay * Math.pow(2, attempt - 1);
var jitter = Math.random() * delay * 0.1;
console.log(
"Retry " + attempt + "/" + maxRetries +
" after " + Math.round(delay + jitter) + "ms" +
" (error: " + (err.status || err.code) + ")"
);
return new Promise(function (resolve) {
setTimeout(function () {
resolve(tryCall());
}, delay + jitter);
});
});
}
return tryCall();
}
Negotiating Enterprise Pricing
If you are spending more than $5,000/month, reach out to provider sales teams. Here is what is typically negotiable:
- Volume discounts: 10-30% off list pricing at $10K+/month commitments
- Rate limit increases: Higher TPM (tokens per minute) and RPM (requests per minute)
- Prompt caching improvements: Extended cache TTLs or priority caching
- Committed use discounts: Annual commitments in exchange for lower rates
- Priority support: Dedicated Slack channels, faster incident response
Do not accept the first offer. Providers are competing aggressively for enterprise customers. Get quotes from at least two providers and use them as leverage.
Forecasting Costs as You Scale
The biggest mistake teams make is linear extrapolation. Your costs will not scale linearly because:
- Prompt optimization reduces token usage by 20-40% as you refine prompts
- Caching reduces effective input costs as repeat patterns emerge
- Model downgrades become possible once you have evaluation data showing where cheaper models suffice
- Batching lets you move non-urgent work to 50% discount batch APIs
// cost-forecast.js
function forecastMonthlyCost(config) {
var months = [];
var currentVolume = config.startingRequestsPerMonth;
for (var month = 1; month <= 12; month++) {
var promptOptimization = Math.max(0.6, 1 - (month * 0.03));
var cachingDiscount = Math.min(0.4, month * 0.05);
var batchPercent = Math.min(config.maxBatchPercent, month * 0.05);
var effectiveInputTokens = config.avgInputTokens * promptOptimization;
var cachedTokens = effectiveInputTokens * cachingDiscount;
var uncachedTokens = effectiveInputTokens - cachedTokens;
var realtimeRequests = currentVolume * (1 - batchPercent);
var batchRequests = currentVolume * batchPercent;
var inputCost = (uncachedTokens / 1e6 * config.inputPrice) +
(cachedTokens / 1e6 * config.cachedInputPrice);
var outputCost = config.avgOutputTokens / 1e6 * config.outputPrice;
var perRequestCost = inputCost + outputCost;
var realtimeCost = realtimeRequests * perRequestCost;
var batchCost = batchRequests * perRequestCost * 0.5;
var totalCost = realtimeCost + batchCost;
months.push({
month: month,
requests: currentVolume,
costPerRequest: "$" + perRequestCost.toFixed(6),
totalCost: "$" + totalCost.toFixed(2)
});
currentVolume = Math.round(currentVolume * config.monthlyGrowthRate);
}
return months;
}
var forecast = forecastMonthlyCost({
startingRequestsPerMonth: 100000,
monthlyGrowthRate: 1.20,
avgInputTokens: 3000,
avgOutputTokens: 500,
inputPrice: 3.0,
cachedInputPrice: 0.30,
outputPrice: 15.0,
maxBatchPercent: 0.30
});
forecast.forEach(function (m) {
console.log("Month " + m.month + ": " + m.requests + " requests, " + m.totalCost);
});
Complete Working Example
Here is a complete benchmarking tool that tests multiple providers on the same prompts, measures latency, estimates cost, and generates a comparison report.
// llm-benchmark.js
var https = require("https");
var fs = require("fs");
// ============================================================
// Configuration
// ============================================================
var PRICING = {
"claude-sonnet-4-20250514": { input: 3.0, output: 15.0 },
"claude-haiku-4-20250514": { input: 1.0, output: 5.0 },
"gpt-4o": { input: 2.5, output: 10.0 },
"gpt-4o-mini": { input: 0.15, output: 0.60 }
};
var TEST_PROMPTS = [
{
name: "summarization",
prompt: "Summarize the following in 3 bullet points: The TCP/IP model consists of four layers. The application layer handles high-level protocols like HTTP and SMTP. The transport layer manages end-to-end communication using TCP or UDP. The internet layer handles addressing and routing via IP. The link layer manages the physical network interface. Each layer encapsulates data from the layer above, adding its own headers for processing at the corresponding layer on the receiving end.",
maxTokens: 256,
evaluator: function (output) {
var hasBullets = (output.match(/[-*•]/g) || []).length >= 3;
var isReasonableLength = output.length > 100 && output.length < 1000;
return hasBullets && isReasonableLength;
}
},
{
name: "code_generation",
prompt: "Write a JavaScript function called 'deepMerge' that recursively merges two objects. If both values are objects, merge them recursively. If both values are arrays, concatenate them. Otherwise, the second value wins. Include JSDoc comments.",
maxTokens: 512,
evaluator: function (output) {
var hasFunction = output.indexOf("function") !== -1;
var hasMerge = output.indexOf("deepMerge") !== -1;
var hasRecursion = output.indexOf("deepMerge(") !== -1;
return hasFunction && hasMerge && hasRecursion;
}
},
{
name: "classification",
prompt: "Classify the following text into exactly one category (positive, negative, neutral). Respond with ONLY the category word, nothing else.\n\nText: \"The product works as expected but the shipping took longer than promised.\"",
maxTokens: 16,
evaluator: function (output) {
var normalized = output.trim().toLowerCase();
return normalized === "neutral" || normalized === "negative" || normalized === "positive";
}
},
{
name: "reasoning",
prompt: "A farmer has 17 sheep. All but 9 run away. How many sheep does the farmer have left? Think step by step, then give your final answer as just a number on the last line.",
maxTokens: 256,
evaluator: function (output) {
return output.indexOf("9") !== -1;
}
}
];
// ============================================================
// Provider API Functions
// ============================================================
function callAnthropic(model, prompt, maxTokens) {
return new Promise(function (resolve, reject) {
var startTime = Date.now();
var body = JSON.stringify({
model: model,
max_tokens: maxTokens,
messages: [{ role: "user", content: prompt }]
});
var req = https.request({
hostname: "api.anthropic.com",
path: "/v1/messages",
method: "POST",
headers: {
"Content-Type": "application/json",
"x-api-key": process.env.ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01"
}
}, function (res) {
var data = "";
var ttft = Date.now() - startTime;
res.on("data", function (chunk) { data += chunk; });
res.on("end", function () {
var totalTime = Date.now() - startTime;
if (res.statusCode >= 400) {
resolve({ error: "HTTP " + res.statusCode, ttft: ttft, totalTime: totalTime });
return;
}
var parsed = JSON.parse(data);
resolve({
output: parsed.content[0].text,
inputTokens: parsed.usage.input_tokens,
outputTokens: parsed.usage.output_tokens,
ttft: ttft,
totalTime: totalTime
});
});
});
req.on("error", reject);
req.write(body);
req.end();
});
}
function callOpenAI(model, prompt, maxTokens) {
return new Promise(function (resolve, reject) {
var startTime = Date.now();
var body = JSON.stringify({
model: model,
max_tokens: maxTokens,
messages: [{ role: "user", content: prompt }]
});
var req = https.request({
hostname: "api.openai.com",
path: "/v1/chat/completions",
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": "Bearer " + process.env.OPENAI_API_KEY
}
}, function (res) {
var data = "";
var ttft = Date.now() - startTime;
res.on("data", function (chunk) { data += chunk; });
res.on("end", function () {
var totalTime = Date.now() - startTime;
if (res.statusCode >= 400) {
resolve({ error: "HTTP " + res.statusCode, ttft: ttft, totalTime: totalTime });
return;
}
var parsed = JSON.parse(data);
resolve({
output: parsed.choices[0].message.content,
inputTokens: parsed.usage.prompt_tokens,
outputTokens: parsed.usage.completion_tokens,
ttft: ttft,
totalTime: totalTime
});
});
});
req.on("error", reject);
req.write(body);
req.end();
});
}
function callProvider(model, prompt, maxTokens) {
if (model.indexOf("claude") === 0) {
return callAnthropic(model, prompt, maxTokens);
} else if (model.indexOf("gpt") === 0) {
return callOpenAI(model, prompt, maxTokens);
}
return Promise.reject(new Error("Unknown model: " + model));
}
// ============================================================
// Benchmark Runner
// ============================================================
function runBenchmark(models, prompts, runsPerTest) {
var results = [];
var queue = [];
models.forEach(function (model) {
prompts.forEach(function (testCase) {
for (var run = 0; run < runsPerTest; run++) {
queue.push({ model: model, testCase: testCase, run: run });
}
});
});
var index = 0;
function processNext() {
if (index >= queue.length) {
return Promise.resolve(results);
}
var item = queue[index];
index += 1;
console.log(
"Running: " + item.model + " / " + item.testCase.name +
" (run " + (item.run + 1) + ")"
);
return callProvider(item.model, item.testCase.prompt, item.testCase.maxTokens)
.then(function (response) {
var success = false;
if (!response.error && response.output) {
success = item.testCase.evaluator(response.output);
}
var pricing = PRICING[item.model] || { input: 0, output: 0 };
var cost = 0;
if (response.inputTokens && response.outputTokens) {
cost = (response.inputTokens / 1e6 * pricing.input) +
(response.outputTokens / 1e6 * pricing.output);
}
results.push({
model: item.model,
task: item.testCase.name,
run: item.run + 1,
success: success,
ttft: response.ttft,
totalTime: response.totalTime,
inputTokens: response.inputTokens || 0,
outputTokens: response.outputTokens || 0,
cost: cost,
error: response.error || null
});
return processNext();
})
.catch(function (err) {
results.push({
model: item.model,
task: item.testCase.name,
run: item.run + 1,
success: false,
error: err.message
});
return processNext();
});
}
return processNext();
}
// ============================================================
// Report Generator
// ============================================================
function generateReport(results) {
var report = [];
report.push("=== LLM Provider Benchmark Report ===");
report.push("Date: " + new Date().toISOString());
report.push("Total tests: " + results.length);
report.push("");
// Group by model
var modelGroups = {};
results.forEach(function (r) {
if (!modelGroups[r.model]) {
modelGroups[r.model] = [];
}
modelGroups[r.model].push(r);
});
Object.keys(modelGroups).forEach(function (model) {
var modelResults = modelGroups[model];
var successCount = modelResults.filter(function (r) { return r.success; }).length;
var totalCost = modelResults.reduce(function (sum, r) { return sum + (r.cost || 0); }, 0);
var avgTTFT = modelResults.reduce(function (sum, r) { return sum + (r.ttft || 0); }, 0) / modelResults.length;
var avgTotal = modelResults.reduce(function (sum, r) { return sum + (r.totalTime || 0); }, 0) / modelResults.length;
report.push("--- " + model + " ---");
report.push(" Success rate: " + successCount + "/" + modelResults.length +
" (" + (successCount / modelResults.length * 100).toFixed(0) + "%)");
report.push(" Avg TTFT: " + avgTTFT.toFixed(0) + "ms");
report.push(" Avg total time: " + avgTotal.toFixed(0) + "ms");
report.push(" Total cost: $" + totalCost.toFixed(6));
report.push(" Cost per success: $" +
(successCount > 0 ? (totalCost / successCount).toFixed(6) : "N/A"));
report.push("");
// Per-task breakdown
var taskGroups = {};
modelResults.forEach(function (r) {
if (!taskGroups[r.task]) taskGroups[r.task] = [];
taskGroups[r.task].push(r);
});
Object.keys(taskGroups).forEach(function (task) {
var taskResults = taskGroups[task];
var taskSuccess = taskResults.filter(function (r) { return r.success; }).length;
report.push(" " + task + ": " + taskSuccess + "/" + taskResults.length + " passed");
});
report.push("");
});
return report.join("\n");
}
// ============================================================
// Main
// ============================================================
function main() {
var models = Object.keys(PRICING);
var runsPerTest = 3;
console.log("Starting LLM benchmark...");
console.log("Models: " + models.join(", "));
console.log("Tests per model: " + TEST_PROMPTS.length + " x " + runsPerTest + " runs");
console.log("");
runBenchmark(models, TEST_PROMPTS, runsPerTest).then(function (results) {
var report = generateReport(results);
console.log("\n" + report);
// Save raw results
fs.writeFileSync(
"benchmark-results.json",
JSON.stringify(results, null, 2)
);
// Save report
fs.writeFileSync("benchmark-report.txt", report);
console.log("Results saved to benchmark-results.json");
console.log("Report saved to benchmark-report.txt");
}).catch(function (err) {
console.error("Benchmark failed:", err);
process.exit(1);
});
}
main();
Run it with:
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
node llm-benchmark.js
Sample output:
Starting LLM benchmark...
Models: claude-sonnet-4-20250514, claude-haiku-4-20250514, gpt-4o, gpt-4o-mini
Tests per model: 4 x 3 runs
Running: claude-sonnet-4-20250514 / summarization (run 1)
Running: claude-sonnet-4-20250514 / summarization (run 2)
...
=== LLM Provider Benchmark Report ===
Date: 2026-03-15T15:30:00.000Z
Total tests: 48
--- claude-sonnet-4-20250514 ---
Success rate: 12/12 (100%)
Avg TTFT: 520ms
Avg total time: 1840ms
Total cost: $0.001260
Cost per success: $0.000105
--- claude-haiku-4-20250514 ---
Success rate: 11/12 (92%)
Avg TTFT: 280ms
Avg total time: 680ms
Total cost: $0.000384
Cost per success: $0.000035
--- gpt-4o ---
Success rate: 11/12 (92%)
Avg TTFT: 410ms
Avg total time: 1520ms
Total cost: $0.000980
Cost per success: $0.000089
--- gpt-4o-mini ---
Success rate: 10/12 (83%)
Avg TTFT: 210ms
Avg total time: 520ms
Total cost: $0.000048
Cost per success: $0.000005
Common Issues and Troubleshooting
1. Rate Limit Errors (429)
Error: Request failed with status 429: {"error":{"message":"Rate limit reached for gpt-4o in organization org-xxx on tokens per min (TPM): Limit 30000, Used 28500, Requested 2500."}}
Fix: Implement exponential backoff with jitter. Do not retry immediately. Space out requests by at least 1 second for benchmarking. For production workloads, use a token bucket rate limiter that tracks TPM and RPM separately.
2. Context Length Exceeded
Error: Request failed with status 400: {"error":{"type":"invalid_request_error","message":"prompt is too long: 204521 tokens > 200000 maximum"}}
Fix: Count tokens before sending. Use the provider's tokenizer library (@anthropic-ai/tokenizer, tiktoken for OpenAI) to validate input length. Truncate or chunk long inputs before they hit the API.
3. Timeout on Long Generations
Error: ESOCKETTIMEDOUT - socket hang up after 60000ms
Fix: Set appropriate timeouts based on expected generation length. A 4,000-token output at 80 TPS takes 50 seconds. Set your timeout to at least 2x the expected generation time. For the benchmarking harness, use 120-second timeouts.
var req = https.request(options);
req.setTimeout(120000, function () {
req.destroy(new Error("Request timed out after 120s"));
});
4. JSON Parsing Failures from Model Output
SyntaxError: Unexpected token 'H' at position 0 - JSON.parse("Here is the JSON you requested:\n{\"key\": \"value\"}")
Fix: When requesting structured output, use the provider's native JSON mode. For Anthropic, use tool_use with a defined schema. For OpenAI, set response_format: { type: "json_object" }. Never rely on the model to produce valid JSON without these guardrails in production.
5. Inconsistent Token Counts Across Providers
Same prompt: Anthropic reports 145 input tokens, OpenAI reports 152 input tokens
This is expected. Different providers use different tokenizers. Token counts for identical text will differ by 5-15%. Your cost calculations must use each provider's reported token counts, not a shared estimate.
The Playbook: Nine Rules for LLM Cost Optimization
Benchmark with your actual prompts, not synthetic ones. Generic benchmarks tell you nothing about your specific task. The 10 minutes spent running your real prompts through multiple providers will save you thousands in wrong decisions.
Always calculate cost per successful task, not cost per token. A model that is 5x cheaper per token but fails 40% of the time is not cheaper. Track success rates alongside cost in every evaluation.
Start cheap and escalate only with evidence. Begin with the cheapest option (Haiku 4.5, GPT-4o-mini, Gemini 2.5 Flash) and only upgrade when you have empirical data that quality is insufficient. Most classification, extraction, and simple generation tasks do not need frontier models.
Implement prompt caching from day one. If you have any static content in your prompts, use prompt caching. The savings are typically 30-60% on input costs, and with some providers, up to 90%.
Build provider abstraction early. Do not hardcode to a single provider's API format. Use a thin abstraction layer that lets you swap models with a configuration change. Prices drop and quality improves constantly.
Monitor costs in real-time, not monthly. Set up daily cost alerts and per-request cost logging from the start. A prompt engineering bug that doubles your token usage will cost you thousands before the monthly bill arrives.
Move non-latency-sensitive work to batch APIs. Content moderation, data enrichment, document processing, and evaluation runs can all use batch endpoints at 50% off. The 24-hour SLA is fine for offline work.
Re-evaluate quarterly. The LLM market moves fast. Claude Opus dropped from $15/$75 to $5/$25 in one generation. GPT-4.1 replaced GPT-4o at a lower price point. Schedule quarterly benchmark runs with your production prompts.
Negotiate when you cross $5K/month. Every major provider offers volume discounts. Get quotes from multiple providers and use them as leverage. A 20% discount on a $10K/month bill saves $24K annually.