Token Optimization: Reducing API Costs
Strategies for reducing LLM API costs through token optimization, caching, model selection, and budget tracking in Node.js applications.
Token Optimization: Reducing API Costs
LLM API calls are priced by tokens, and without deliberate optimization, costs escalate fast. A production application handling thousands of requests per day can burn through hundreds of dollars in API fees before anyone notices. This article covers the full stack of token optimization techniques: counting tokens before they leave your server, compressing prompts without losing meaning, caching aggressively, picking the right model for each task, and tracking every token so you know exactly where your money goes.
Prerequisites
- Node.js v18+ installed
- Familiarity with REST APIs and Express.js
- An API key for Anthropic or OpenAI
- Redis installed locally or accessible remotely (for caching examples)
- Basic understanding of how LLMs process text
Install the packages we will use throughout this article:
npm install express tiktoken @anthropic-ai/sdk openai redis crypto uuid
Understanding Tokenization
Before you can optimize tokens, you need to understand what they are. LLMs do not process raw text. They break input into tokens using a scheme called Byte Pair Encoding (BPE). A token might be a whole word, a subword fragment, or even a single character depending on how common it is in the training data.
The word "optimization" becomes two tokens: optim and ization. The word "the" is a single token. A newline character is a token. JSON formatting characters like {, ", and : are each individual tokens. This matters because you pay for every single one.
Token Counting in Node.js
The tiktoken library lets you count tokens before making an API call. This is essential for budgeting and for avoiding context window overflows.
var { encoding_for_model } = require("tiktoken");
function countTokens(text, model) {
var enc = encoding_for_model(model || "gpt-4");
var tokens = enc.encode(text);
var count = tokens.length;
enc.free();
return count;
}
// Examples showing how token counts vary
var samples = [
"Hello world",
"The quick brown fox jumps over the lazy dog",
JSON.stringify({ name: "Shane", role: "engineer", years: 10 }),
"def fibonacci(n):\n if n <= 1:\n return n\n return fibonacci(n-1) + fibonacci(n-2)"
];
samples.forEach(function(text) {
console.log(countTokens(text) + " tokens: " + text.substring(0, 50));
});
Output:
2 tokens: Hello world
9 tokens: The quick brown fox jumps over the lazy dog
18 tokens: {"name":"Shane","role":"engineer","years":10}
29 tokens: def fibonacci(n):
Notice that the JSON object uses 18 tokens. Those curly braces, colons, and quote marks each consume a token. This is why structured formats are more expensive than plain text, and why prompt format matters.
For Anthropic's Claude models, token counting works differently. Claude uses its own tokenizer, and the API response headers include actual token usage. Always use the reported usage from the API response for billing accuracy, and use local counting only for estimation and budgeting.
var Anthropic = require("@anthropic-ai/sdk");
var client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
function callClaudeAndTrackUsage(prompt) {
return client.messages.create({
model: "claude-haiku-4-20250414",
max_tokens: 256,
messages: [{ role: "user", content: prompt }]
}).then(function(response) {
console.log("Input tokens:", response.usage.input_tokens);
console.log("Output tokens:", response.usage.output_tokens);
console.log("Total tokens:", response.usage.input_tokens + response.usage.output_tokens);
return response;
});
}
Why Token Counts Matter
Tokens drive two things that determine your operational cost: price and latency.
Price is straightforward. Anthropic charges per million input tokens and per million output tokens, with output tokens costing more. As of early 2026, Claude Haiku input tokens cost roughly $0.25 per million, while Claude Opus input tokens cost around $15 per million. That is a 60x price difference. Output tokens are even more expensive relative to input.
Latency scales with token count. More input tokens means more time to process the prompt. More output tokens means more time waiting for the response to stream. A 500-token prompt with a 200-token response might take 1.5 seconds. A 5,000-token prompt with a 2,000-token response might take 8 seconds. In user-facing applications, this directly impacts the experience.
Here is a quick cost comparison for a typical API call of 1,000 input tokens and 500 output tokens:
Claude Haiku: $0.00025 input + $0.000625 output = $0.000875/call
Claude Sonnet: $0.003 input + $0.015 output = $0.018/call
Claude Opus: $0.015 input + $0.075 output = $0.090/call
GPT-4o: $0.0025 input + $0.01 output = $0.0125/call
GPT-4o-mini: $0.00015 input + $0.0006 output = $0.00075/call
At 10,000 calls per day, that is the difference between $8.75/day with Haiku and $900/day with Opus.
Prompt Compression Techniques
The fastest way to cut costs is to send fewer tokens. Prompt compression is the art of saying the same thing with fewer words while preserving the information the model needs.
Remove Redundancy
Most prompts contain filler words, repeated instructions, and unnecessary politeness. The model does not care if you say "please" or "kindly." It processes instructions the same way regardless.
// Before: 47 tokens
var verbosePrompt = "Could you please analyze the following customer review " +
"and provide me with a detailed sentiment analysis? I would really " +
"appreciate it if you could classify the sentiment as positive, " +
"negative, or neutral. Here is the review:";
// After: 18 tokens
var compressedPrompt = "Classify this review's sentiment as positive, " +
"negative, or neutral:";
That is a 62% reduction in input tokens for the instruction portion alone. Across thousands of calls, this adds up.
Abbreviate and Restructure
Use terse, structured formats for system prompts. Bullet points and abbreviated instructions work just as well as flowing prose.
// Before: 89 tokens
var verboseSystem = "You are an expert data analyst. When the user gives you " +
"data, you should analyze it carefully and provide insights. Always " +
"format your response as JSON with the following fields: summary " +
"(a brief text summary), metrics (an array of key metrics), and " +
"recommendations (an array of actionable recommendations). Make sure " +
"to be thorough but concise in your analysis.";
// After: 42 tokens
var compressedSystem = "Role: data analyst\n" +
"Output: JSON\n" +
"Fields:\n" +
"- summary: string, brief\n" +
"- metrics: string[], key findings\n" +
"- recommendations: string[], actionable";
Both prompts produce equivalent results. The compressed version costs half as much.
Strip Unnecessary Context
When sending documents or data for analysis, preprocess them. Remove HTML tags, strip whitespace, eliminate headers and footers that add no analytical value.
var cheerio = require("cheerio");
function stripToEssentials(html) {
var $ = cheerio.load(html);
// Remove non-content elements
$("script, style, nav, footer, header, aside").remove();
// Get just the text content
var text = $("body").text();
// Collapse whitespace
text = text.replace(/\s+/g, " ").trim();
return text;
}
// A typical web page might be 15,000 tokens of HTML
// After stripping: 2,000-3,000 tokens of actual content
Response Length Control
Output tokens are more expensive than input tokens across every provider. Controlling response length is one of the highest-leverage optimizations.
max_tokens
Always set max_tokens to the minimum viable length for your use case. If you need a one-word classification, do not leave it at the default.
// Classification task - we need one word
var classificationRequest = {
model: "claude-haiku-4-20250414",
max_tokens: 10,
messages: [{ role: "user", content: "Classify this as spam or not spam: " + emailBody }]
};
// Summarization - constrain to a paragraph
var summaryRequest = {
model: "claude-haiku-4-20250414",
max_tokens: 150,
messages: [{ role: "user", content: "Summarize in 2-3 sentences: " + article }]
};
Stop Sequences
Stop sequences terminate generation when the model outputs a specific string. This prevents the model from rambling past the useful portion of its response.
var response = client.messages.create({
model: "claude-haiku-4-20250414",
max_tokens: 500,
stop_sequences: ["---", "Note:", "Additional"],
messages: [{
role: "user",
content: "Extract the product name and price from this text. " +
"Format: Name: ..., Price: ...\n\n" + productDescription
}]
});
Model Selection Based on Task Complexity
This is the single most impactful cost optimization and the one most teams get wrong. Not every task needs your most capable model.
Use Haiku (or GPT-4o-mini) for:
- Classification (sentiment, category, spam detection)
- Data extraction from structured text
- Simple reformatting and text transformation
- Keyword extraction
- Language detection
- Summarization of short documents
Use Sonnet (or GPT-4o) for:
- Complex reasoning over multiple documents
- Code generation with nuanced requirements
- Long-form content creation
- Multi-step analysis
Use Opus for:
- Tasks where accuracy is critical and errors are costly
- Novel research or creative tasks
- Complex code architecture decisions
- Anything where Sonnet consistently produces wrong answers
var MODEL_ROUTING = {
classify: "claude-haiku-4-20250414",
extract: "claude-haiku-4-20250414",
summarize: "claude-haiku-4-20250414",
analyze: "claude-sonnet-4-20250514",
generate_code: "claude-sonnet-4-20250514",
research: "claude-opus-4-20250514"
};
function routeToModel(taskType) {
return MODEL_ROUTING[taskType] || "claude-haiku-4-20250414";
}
function processTask(taskType, content) {
var model = routeToModel(taskType);
console.log("Routing " + taskType + " to " + model);
return client.messages.create({
model: model,
max_tokens: taskType === "classify" ? 20 : 1024,
messages: [{ role: "user", content: content }]
});
}
I have seen teams cut their API costs by 80% just by routing classification and extraction tasks to Haiku instead of running everything through Sonnet.
Caching Strategies
If two users ask similar questions, you should not pay for the same answer twice. Caching is essential for any production LLM application.
Response Caching with Redis
var redis = require("redis");
var crypto = require("crypto");
var redisClient = redis.createClient({ url: process.env.REDIS_URL });
redisClient.connect();
function generateCacheKey(model, messages, maxTokens) {
var payload = JSON.stringify({ model: model, messages: messages, maxTokens: maxTokens });
return "llm:" + crypto.createHash("sha256").update(payload).digest("hex");
}
function cachedLLMCall(options) {
var cacheKey = generateCacheKey(options.model, options.messages, options.max_tokens);
return redisClient.get(cacheKey).then(function(cached) {
if (cached) {
var parsed = JSON.parse(cached);
parsed._cached = true;
console.log("Cache hit - saved " + parsed.usage.input_tokens + " input tokens");
return parsed;
}
return client.messages.create(options).then(function(response) {
var serialized = JSON.stringify(response);
// Cache for 1 hour. Adjust TTL based on how dynamic the content is.
redisClient.setEx(cacheKey, 3600, serialized);
return response;
});
});
}
Prompt Caching (Anthropic-Specific)
Anthropic offers prompt caching at the API level. When you mark portions of your prompt with cache_control, the API caches the processed representation of those tokens. Subsequent calls that share the same cached prefix get a 90% discount on those input tokens.
function callWithPromptCaching(systemPrompt, userMessage) {
return client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: [
{
type: "text",
text: systemPrompt,
cache_control: { type: "ephemeral" }
}
],
messages: [{ role: "user", content: userMessage }]
}).then(function(response) {
console.log("Cache creation tokens:", response.usage.cache_creation_input_tokens || 0);
console.log("Cache read tokens:", response.usage.cache_read_input_tokens || 0);
return response;
});
}
// First call: full price for system prompt tokens (+ small cache write fee)
// Subsequent calls with same system prompt: 90% discount on those tokens
This is extremely powerful for applications where every request shares a large system prompt or a reference document. A 3,000-token system prompt that gets cached across 10,000 calls saves roughly $0.40 with Haiku and $40 with Opus per day.
Batching Requests
Both Anthropic and OpenAI offer batch APIs that provide a 50% discount in exchange for accepting longer response times (up to 24 hours). If you have non-time-sensitive workloads, batching is free money.
var fs = require("fs");
function createBatchFile(requests) {
var lines = requests.map(function(req, index) {
return JSON.stringify({
custom_id: "req-" + index,
params: {
model: "claude-haiku-4-20250414",
max_tokens: 256,
messages: [{ role: "user", content: req.prompt }]
}
});
});
var filePath = "/tmp/batch-" + Date.now() + ".jsonl";
fs.writeFileSync(filePath, lines.join("\n"));
return filePath;
}
// Use cases for batching:
// - Nightly content classification
// - Bulk document summarization
// - Periodic data enrichment
// - Test suite evaluation
Context Window Management
Long conversations eat tokens fast. Each message in the conversation history gets sent as input tokens on every subsequent call. A 20-message conversation can easily hit 10,000+ input tokens per turn.
Sliding Window
Keep only the most recent N messages. Simple and effective for chat applications where distant context is less relevant.
function slidingWindow(messages, maxMessages) {
var systemMessages = messages.filter(function(m) { return m.role === "system"; });
var nonSystem = messages.filter(function(m) { return m.role !== "system"; });
if (nonSystem.length <= maxMessages) {
return messages;
}
// Always keep system messages + most recent N messages
var windowed = nonSystem.slice(nonSystem.length - maxMessages);
return systemMessages.concat(windowed);
}
Conversation Summarization
For applications where older context matters, summarize the conversation periodically and replace the full history with the summary.
function summarizeAndCompact(messages, threshold) {
if (messages.length < threshold) {
return Promise.resolve(messages);
}
var oldMessages = messages.slice(0, messages.length - 4);
var recentMessages = messages.slice(messages.length - 4);
var summaryPrompt = "Summarize this conversation in 2-3 sentences, " +
"preserving key facts and decisions:\n\n" +
oldMessages.map(function(m) {
return m.role + ": " + m.content;
}).join("\n");
return client.messages.create({
model: "claude-haiku-4-20250414",
max_tokens: 200,
messages: [{ role: "user", content: summaryPrompt }]
}).then(function(response) {
var summary = response.content[0].text;
var compacted = [
{ role: "user", content: "[Previous conversation summary: " + summary + "]" },
{ role: "assistant", content: "Understood, I have the context from our previous discussion." }
].concat(recentMessages);
console.log("Compacted " + oldMessages.length + " messages into 1 summary");
return compacted;
});
}
Token Budgeting and Cost Tracking
In a production application, you need to know how much each feature, each user, and each API route costs. Without this, you are flying blind.
var tokenTracker = {
usage: {},
record: function(feature, model, inputTokens, outputTokens) {
var key = feature + ":" + model;
if (!this.usage[key]) {
this.usage[key] = { input: 0, output: 0, calls: 0, cost: 0 };
}
this.usage[key].input += inputTokens;
this.usage[key].output += outputTokens;
this.usage[key].calls += 1;
this.usage[key].cost += this.calculateCost(model, inputTokens, outputTokens);
},
calculateCost: function(model, inputTokens, outputTokens) {
var rates = {
"claude-haiku-4-20250414": { input: 0.25, output: 1.25 },
"claude-sonnet-4-20250514": { input: 3.0, output: 15.0 },
"claude-opus-4-20250514": { input: 15.0, output: 75.0 },
"gpt-4o": { input: 2.5, output: 10.0 },
"gpt-4o-mini": { input: 0.15, output: 0.6 }
};
var rate = rates[model] || rates["claude-haiku-4-20250414"];
var inputCost = (inputTokens / 1000000) * rate.input;
var outputCost = (outputTokens / 1000000) * rate.output;
return inputCost + outputCost;
},
getReport: function() {
var report = {};
var totalCost = 0;
var keys = Object.keys(this.usage);
keys.forEach(function(key) {
var entry = this.usage[key];
report[key] = {
calls: entry.calls,
inputTokens: entry.input,
outputTokens: entry.output,
cost: "$" + entry.cost.toFixed(4)
};
totalCost += entry.cost;
}.bind(this));
report._totalCost = "$" + totalCost.toFixed(4);
return report;
}
};
Per-User Budgets
Enforce spending limits per user to prevent runaway costs from automated clients or abuse.
var userBudgets = {};
function checkBudget(userId, estimatedCost, dailyLimit) {
var today = new Date().toISOString().split("T")[0];
var key = userId + ":" + today;
if (!userBudgets[key]) {
userBudgets[key] = 0;
}
if (userBudgets[key] + estimatedCost > dailyLimit) {
return {
allowed: false,
spent: userBudgets[key],
limit: dailyLimit,
remaining: dailyLimit - userBudgets[key]
};
}
userBudgets[key] += estimatedCost;
return { allowed: true, spent: userBudgets[key], limit: dailyLimit };
}
Measuring and Monitoring in Production
Every LLM API call should be instrumented. Log the model, token counts, latency, cache status, and cost. Feed this into your existing monitoring stack.
function instrumentedLLMCall(feature, options) {
var startTime = Date.now();
return client.messages.create(options).then(function(response) {
var latency = Date.now() - startTime;
var inputTokens = response.usage.input_tokens;
var outputTokens = response.usage.output_tokens;
// Log structured data for your monitoring system
var logEntry = {
timestamp: new Date().toISOString(),
feature: feature,
model: options.model,
inputTokens: inputTokens,
outputTokens: outputTokens,
latencyMs: latency,
cost: tokenTracker.calculateCost(options.model, inputTokens, outputTokens),
cacheHit: false
};
console.log(JSON.stringify(logEntry));
tokenTracker.record(feature, options.model, inputTokens, outputTokens);
return response;
});
}
Sample log output:
{
"timestamp": "2026-02-11T14:32:01.123Z",
"feature": "sentiment-analysis",
"model": "claude-haiku-4-20250414",
"inputTokens": 342,
"outputTokens": 12,
"latencyMs": 487,
"cost": 0.0000855,
"cacheHit": false
}
Cost Comparison Across Providers
Here is a realistic comparison for a workload of 100,000 classification requests per day, each averaging 500 input tokens and 20 output tokens.
Provider / Model Input Cost Output Cost Daily Total
----------------------------------------------------------------------
Anthropic Claude Haiku $12.50 $2.50 $15.00
Anthropic Claude Sonnet $150.00 $300.00 $450.00
OpenAI GPT-4o-mini $7.50 $1.20 $8.70
OpenAI GPT-4o $125.00 $200.00 $325.00
Local Llama 3 (8B, GPU) $0 (compute) $0 (compute) ~$8.00*
*Local model cost assumes a single A10G GPU instance at approximately $0.33/hour.
For high-volume classification, a local model or GPT-4o-mini wins on cost. But factor in the engineering time to deploy and maintain local infrastructure. For most teams under 500,000 calls per day, a hosted API with Haiku or GPT-4o-mini is the pragmatic choice.
Complete Working Example: Token Optimization Middleware
Here is a full Express middleware that wraps LLM calls with token counting, budget enforcement, response caching, and cost tracking.
var express = require("express");
var crypto = require("crypto");
var redis = require("redis");
var Anthropic = require("@anthropic-ai/sdk");
var { encoding_for_model } = require("tiktoken");
var app = express();
app.use(express.json());
var anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
var redisClient = redis.createClient({ url: process.env.REDIS_URL || "redis://localhost:6379" });
redisClient.connect();
// ---- Token Counting ----
function estimateTokens(text) {
var enc = encoding_for_model("gpt-4");
var count = enc.encode(text).length;
enc.free();
return count;
}
// ---- Cost Rates (per million tokens) ----
var RATES = {
"claude-haiku-4-20250414": { input: 0.25, output: 1.25 },
"claude-sonnet-4-20250514": { input: 3.0, output: 15.0 }
};
function computeCost(model, inputTokens, outputTokens) {
var rate = RATES[model] || RATES["claude-haiku-4-20250414"];
return ((inputTokens / 1e6) * rate.input) + ((outputTokens / 1e6) * rate.output);
}
// ---- Usage Tracker ----
var usageByRoute = {};
var usageByUser = {};
function trackUsage(route, userId, model, inputTokens, outputTokens, cached) {
var cost = cached ? 0 : computeCost(model, inputTokens, outputTokens);
var today = new Date().toISOString().split("T")[0];
// Track by route
if (!usageByRoute[route]) {
usageByRoute[route] = { calls: 0, tokens: 0, cost: 0, cacheHits: 0 };
}
usageByRoute[route].calls += 1;
usageByRoute[route].tokens += inputTokens + outputTokens;
usageByRoute[route].cost += cost;
if (cached) usageByRoute[route].cacheHits += 1;
// Track by user per day
var userKey = userId + ":" + today;
if (!usageByUser[userKey]) {
usageByUser[userKey] = { calls: 0, cost: 0 };
}
usageByUser[userKey].calls += 1;
usageByUser[userKey].cost += cost;
}
// ---- Budget Enforcement Middleware ----
function enforceBudget(dailyLimitPerUser) {
return function(req, res, next) {
var userId = req.headers["x-user-id"] || "anonymous";
var today = new Date().toISOString().split("T")[0];
var userKey = userId + ":" + today;
var current = usageByUser[userKey] ? usageByUser[userKey].cost : 0;
if (current >= dailyLimitPerUser) {
return res.status(429).json({
error: "Daily token budget exceeded",
spent: "$" + current.toFixed(4),
limit: "$" + dailyLimitPerUser.toFixed(4)
});
}
req.userId = userId;
req.budgetRemaining = dailyLimitPerUser - current;
next();
};
}
// ---- Cached LLM Call ----
function cachedCompletion(options, cacheTTL) {
var cacheKey = "llm:" + crypto.createHash("sha256")
.update(JSON.stringify(options))
.digest("hex");
return redisClient.get(cacheKey).then(function(cached) {
if (cached) {
var parsed = JSON.parse(cached);
parsed._cached = true;
return parsed;
}
return anthropic.messages.create(options).then(function(response) {
var toCache = {
content: response.content,
usage: response.usage,
model: response.model,
_cached: false
};
redisClient.setEx(cacheKey, cacheTTL || 3600, JSON.stringify(toCache));
return toCache;
});
});
}
// ---- API Routes ----
// Sentiment classification - uses Haiku, aggressive caching
app.post("/api/classify", enforceBudget(1.00), function(req, res) {
var text = req.body.text;
if (!text) return res.status(400).json({ error: "text is required" });
var inputEstimate = estimateTokens(text) + 20;
console.log("Estimated input tokens: " + inputEstimate);
var options = {
model: "claude-haiku-4-20250414",
max_tokens: 20,
messages: [{
role: "user",
content: "Classify sentiment as positive, negative, or neutral. " +
"Reply with one word only.\n\n" + text
}]
};
cachedCompletion(options, 7200).then(function(response) {
var inputTokens = response.usage.input_tokens;
var outputTokens = response.usage.output_tokens;
trackUsage("/api/classify", req.userId, options.model,
inputTokens, outputTokens, response._cached);
res.json({
sentiment: response.content[0].text.trim().toLowerCase(),
tokens: { input: inputTokens, output: outputTokens },
cost: "$" + computeCost(options.model, inputTokens, outputTokens).toFixed(6),
cached: response._cached
});
}).catch(function(err) {
res.status(500).json({ error: err.message });
});
});
// Summarization - uses Sonnet for quality, moderate caching
app.post("/api/summarize", enforceBudget(1.00), function(req, res) {
var text = req.body.text;
var maxLength = req.body.maxLength || 150;
if (!text) return res.status(400).json({ error: "text is required" });
var options = {
model: "claude-sonnet-4-20250514",
max_tokens: maxLength,
messages: [{
role: "user",
content: "Summarize in 2-3 sentences:\n\n" + text
}]
};
cachedCompletion(options, 1800).then(function(response) {
var inputTokens = response.usage.input_tokens;
var outputTokens = response.usage.output_tokens;
trackUsage("/api/summarize", req.userId, options.model,
inputTokens, outputTokens, response._cached);
res.json({
summary: response.content[0].text,
tokens: { input: inputTokens, output: outputTokens },
cost: "$" + computeCost(options.model, inputTokens, outputTokens).toFixed(6),
cached: response._cached
});
}).catch(function(err) {
res.status(500).json({ error: err.message });
});
});
// Usage dashboard
app.get("/api/usage", function(req, res) {
res.json({
byRoute: usageByRoute,
timestamp: new Date().toISOString()
});
});
app.listen(3000, function() {
console.log("Token-optimized API server running on port 3000");
});
Sample usage dashboard output after running for a while:
{
"byRoute": {
"/api/classify": {
"calls": 1247,
"tokens": 28914,
"cost": 0.0089,
"cacheHits": 843
},
"/api/summarize": {
"calls": 312,
"tokens": 156230,
"cost": 2.8741,
"cacheHits": 67
}
},
"timestamp": "2026-02-11T18:45:00.000Z"
}
Notice how classification costs almost nothing ($0.009 for 1,247 calls) while summarization is significantly more expensive due to the larger model and longer responses. This data drives decisions about where to optimize further.
Common Issues and Troubleshooting
1. Token Count Mismatch Between Estimation and Billing
Error: Estimated 450 tokens but billed for 523 tokens
This happens because tiktoken uses OpenAI's tokenizer, not Anthropic's. Claude's tokenizer produces different token counts. Use tiktoken only for rough estimation and budget checking. Always use response.usage.input_tokens from the API response for accurate cost tracking. Expect up to a 15-20% variance between estimated and actual counts.
2. Redis Cache Serialization Failures
TypeError: Converting circular structure to JSON
The raw API response objects from the Anthropic SDK may contain circular references or non-serializable properties. Always extract only the fields you need before caching, as shown in the cachedCompletion function above. Never try to JSON.stringify the raw response object directly.
3. Budget Race Conditions Under Load
Budget exceeded: spent $1.47 against limit $1.00
The in-memory budget tracker shown above is not atomic. Under concurrent load, multiple requests can pass the budget check simultaneously. For production, use Redis INCRBYFLOAT with a Lua script for atomic budget checking:
var BUDGET_SCRIPT = [
"local current = tonumber(redis.call('GET', KEYS[1]) or '0')",
"local increment = tonumber(ARGV[1])",
"local limit = tonumber(ARGV[2])",
"if current + increment > limit then",
" return -1",
"end",
"redis.call('INCRBYFLOAT', KEYS[1], increment)",
"redis.call('EXPIRE', KEYS[1], 86400)",
"return current + increment"
].join("\n");
function atomicBudgetCheck(userId, cost, limit) {
var today = new Date().toISOString().split("T")[0];
var key = "budget:" + userId + ":" + today;
return redisClient.eval(BUDGET_SCRIPT, {
keys: [key],
arguments: [cost.toString(), limit.toString()]
}).then(function(result) {
return result >= 0;
});
}
4. Prompt Cache Misses Due to Trailing Whitespace
cache_read_input_tokens: 0 (expected cache hit)
Anthropic's prompt caching is exact-match on the cached content. A single extra space, newline, or different whitespace character in your system prompt will cause a cache miss. Always normalize your prompts before sending:
function normalizePrompt(text) {
return text.replace(/\s+/g, " ").trim();
}
5. Context Window Overflow
Error: prompt is too long: 204,117 tokens > 200,000 token limit
This happens when conversation history grows unchecked. Implement the sliding window or summarization strategy described earlier. Always check total token count before sending and truncate proactively rather than hitting the API limit.
Best Practices
Always set
max_tokensexplicitly. Never rely on the default. For classification tasks, 10-20 tokens is plenty. For summaries, cap at 200-300. Unbounded output tokens are the most common source of wasted spend.Route tasks to the cheapest capable model. Start with Haiku for every new feature. Only upgrade to Sonnet or Opus when you can demonstrate that Haiku's quality is insufficient for your specific use case. Track accuracy by model to justify the upgrade.
Cache aggressively with content-aware TTLs. Classification of static content can be cached for days. Summaries of changing data should have shorter TTLs. A 67% cache hit rate (achievable for most applications) cuts your effective cost by two-thirds.
Compress prompts systematically. Audit your system prompts quarterly. Most grow over time as developers add "just one more instruction." Strip them back to the minimum that produces correct output. Run A/B tests to verify that compressed prompts maintain quality.
Track costs per feature, not just in aggregate. Knowing your total monthly LLM spend is not enough. You need to know that sentiment analysis costs $12/month while document summarization costs $800/month. This drives architectural decisions about where to invest in optimization.
Use batch APIs for non-urgent workloads. Any processing that can tolerate a 24-hour delay should go through the batch API for the 50% discount. Nightly content classification, weekly report generation, and bulk data enrichment are all candidates.
Implement circuit breakers for cost spikes. Set daily and hourly spend limits at the application level. If you hit 80% of your daily budget by noon, something is wrong. Alert on anomalies rather than waiting for the end-of-month bill.
Preprocess input data before sending to the LLM. Strip HTML, remove boilerplate, deduplicate content, and truncate to the relevant sections. Every token you remove from the input is a token you do not pay for.
Monitor token-per-request distributions. Track p50, p95, and p99 token counts per endpoint. A sudden increase in the p95 often indicates that a new type of input is hitting your system that your prompts were not optimized for.
References
- Anthropic API Pricing - Current token pricing for Claude models
- Anthropic Prompt Caching Documentation - Official guide to prompt caching
- OpenAI Tokenizer Tool - Visual token counting tool
- tiktoken on npm - Token counting library for Node.js
- Anthropic Batches API - Batch processing for volume discounts
- Redis Documentation - Caching layer used in examples