Llm Apis

Caching LLM Responses for Cost Reduction

Multi-layer caching strategies for LLM responses including exact match, semantic caching, Redis, and cost tracking in Node.js.

Caching LLM Responses for Cost Reduction

Every call to a large language model costs money. If your application sends the same prompt twice and pays for it twice, you are burning budget for no reason. A well-designed caching layer can cut your LLM API spend by 40-80% while simultaneously reducing latency from seconds to milliseconds for cached responses.

Prerequisites

  • Node.js v18 or later installed
  • Basic familiarity with Express.js and REST APIs
  • An OpenAI or Anthropic API key for testing
  • Redis installed locally or accessible remotely (for distributed caching sections)
  • npm packages: openai, ioredis, crypto, lru-cache

Why Caching Matters for LLM Costs

LLM API pricing is based on token consumption. Every request burns input tokens and output tokens, and the meter runs whether the question is novel or something you asked five minutes ago. Here are real numbers that illustrate the problem:

Model Input Cost (per 1M tokens) Output Cost (per 1M tokens)
GPT-4o $2.50 $10.00
GPT-4 Turbo $10.00 $30.00
Claude 3.5 Sonnet $3.00 $15.00
Claude 3 Opus $15.00 $75.00

Consider a customer support chatbot handling 50,000 queries per day. If 30% of those queries are repeated or semantically identical questions ("What are your business hours?", "When do you open?", "What time do you close?"), you are paying for 15,000 unnecessary API calls daily. At an average of 500 tokens per request using GPT-4o, that is roughly 7.5 million tokens per day wasted — about $94 per day or $2,800 per month thrown away.

Caching eliminates that waste. But LLM caching is more nuanced than traditional HTTP caching because prompts are natural language, responses can be non-deterministic, and cache key strategies need to account for model parameters.

Exact-Match Caching with Normalized Prompt Keys

The simplest form of LLM caching is exact-match: hash the prompt, store the response, return the cached response if the same hash appears again. The critical detail is normalization. Two prompts that differ only by trailing whitespace or capitalization should hit the same cache entry.

var crypto = require("crypto");

function normalizePrompt(prompt) {
  return prompt
    .trim()
    .replace(/\s+/g, " ")
    .toLowerCase();
}

function generateCacheKey(prompt, model, temperature) {
  var normalized = normalizePrompt(prompt);
  var keyData = JSON.stringify({
    prompt: normalized,
    model: model,
    temperature: temperature
  });
  return crypto.createHash("sha256").update(keyData).digest("hex");
}

Notice that the cache key includes the model name and temperature. This is essential. The same prompt sent to GPT-4o and Claude 3.5 Sonnet will produce different responses, and a prompt with temperature 0 will produce different output than one with temperature 0.8. Your cache key must encode every parameter that affects the response.

Semantic Caching with Embeddings

Exact-match caching misses a massive opportunity: semantically identical prompts phrased differently. "How do I reset my password?" and "I forgot my password, how do I change it?" should return the same cached answer. Semantic caching uses embedding vectors to find similar prompts within a configurable similarity threshold.

var OpenAI = require("openai");
var crypto = require("crypto");

var openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

function SemanticCache(options) {
  this.similarityThreshold = options.similarityThreshold || 0.95;
  this.entries = [];
  this.maxEntries = options.maxEntries || 10000;
}

SemanticCache.prototype.getEmbedding = function(text) {
  return openai.embeddings.create({
    model: "text-embedding-3-small",
    input: normalizePrompt(text)
  }).then(function(response) {
    return response.data[0].embedding;
  });
};

SemanticCache.prototype.cosineSimilarity = function(vecA, vecB) {
  var dotProduct = 0;
  var normA = 0;
  var normB = 0;
  for (var i = 0; i < vecA.length; i++) {
    dotProduct += vecA[i] * vecB[i];
    normA += vecA[i] * vecA[i];
    normB += vecB[i] * vecB[i];
  }
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
};

SemanticCache.prototype.get = function(prompt) {
  var self = this;
  return this.getEmbedding(prompt).then(function(queryEmbedding) {
    var bestMatch = null;
    var bestScore = 0;

    for (var i = 0; i < self.entries.length; i++) {
      var score = self.cosineSimilarity(queryEmbedding, self.entries[i].embedding);
      if (score > bestScore) {
        bestScore = score;
        bestMatch = self.entries[i];
      }
    }

    if (bestMatch && bestScore >= self.similarityThreshold) {
      return { hit: true, response: bestMatch.response, similarity: bestScore };
    }
    return { hit: false, embedding: queryEmbedding };
  });
};

SemanticCache.prototype.set = function(prompt, embedding, response) {
  if (this.entries.length >= this.maxEntries) {
    this.entries.shift();
  }
  this.entries.push({
    prompt: prompt,
    embedding: embedding,
    response: response,
    timestamp: Date.now()
  });
};

A word of caution: the embedding call itself costs money (about $0.02 per million tokens for text-embedding-3-small), but it is orders of magnitude cheaper than the LLM completion call. If your average completion costs $0.01, and the embedding lookup costs $0.00002, the economics are overwhelmingly in favor of the embedding check. The similarity threshold is the knob you tune — 0.95 is conservative and safe, 0.90 catches more duplicates but risks returning irrelevant cached responses.

In-Memory Caching with LRU Eviction

For single-process applications or development environments, an in-memory LRU (Least Recently Used) cache is the fastest option. The lru-cache package handles eviction automatically.

var LRUCache = require("lru-cache");

var memoryCache = new LRUCache({
  max: 5000,
  ttl: 1000 * 60 * 60,
  updateAgeOnGet: true,
  sizeCalculation: function(value) {
    return JSON.stringify(value).length;
  },
  maxSize: 50 * 1024 * 1024
});

function getFromMemoryCache(key) {
  var cached = memoryCache.get(key);
  if (cached) {
    return { hit: true, response: cached.response, source: "memory" };
  }
  return { hit: false };
}

function setInMemoryCache(key, response, metadata) {
  memoryCache.set(key, {
    response: response,
    metadata: metadata,
    cachedAt: Date.now()
  });
}

The sizeCalculation function is important. LLM responses vary wildly in size — a code generation response might be 5KB while a summarization response is 200 bytes. Without size-aware eviction, a few large responses could push out thousands of small ones.

Redis Caching for Distributed Systems

In production, you likely have multiple Node.js processes or containers. In-memory caches are not shared across processes, so you need a distributed cache. Redis is the standard choice.

var Redis = require("ioredis");

var redis = new Redis({
  host: process.env.REDIS_HOST || "127.0.0.1",
  port: process.env.REDIS_PORT || 6379,
  password: process.env.REDIS_PASSWORD || undefined,
  maxRetriesPerRequest: 3,
  retryStrategy: function(times) {
    var delay = Math.min(times * 50, 2000);
    return delay;
  }
});

function RedisLLMCache(redisClient, prefix) {
  this.redis = redisClient;
  this.prefix = prefix || "llm:cache:";
}

RedisLLMCache.prototype.get = function(cacheKey) {
  var self = this;
  var fullKey = this.prefix + cacheKey;
  return this.redis.get(fullKey).then(function(data) {
    if (data) {
      var parsed = JSON.parse(data);
      self.redis.hincrby(self.prefix + "stats", "hits", 1);
      return { hit: true, response: parsed.response, metadata: parsed.metadata };
    }
    self.redis.hincrby(self.prefix + "stats", "misses", 1);
    return { hit: false };
  });
};

RedisLLMCache.prototype.set = function(cacheKey, response, metadata, ttlSeconds) {
  var fullKey = this.prefix + cacheKey;
  var data = JSON.stringify({
    response: response,
    metadata: metadata,
    cachedAt: Date.now()
  });
  if (ttlSeconds) {
    return this.redis.setex(fullKey, ttlSeconds, data);
  }
  return this.redis.set(fullKey, data);
};

RedisLLMCache.prototype.getStats = function() {
  return this.redis.hgetall(this.prefix + "stats");
};

Redis gives you TTL expiration for free, persistence across restarts, and shared state across your entire fleet. The tradeoff is network latency — a Redis lookup takes 1-5ms compared to microseconds for in-memory, but that is negligible compared to the 500ms-3s you save by avoiding the LLM API call.

Cache Key Strategies

Building the right cache key is the most important design decision. Here is a comprehensive key builder that accounts for all the variables that affect LLM output:

function buildCacheKey(options) {
  var keyComponents = {
    prompt: normalizePrompt(options.prompt),
    systemPrompt: options.systemPrompt ? normalizePrompt(options.systemPrompt) : "",
    model: options.model,
    temperature: options.temperature || 0,
    maxTokens: options.maxTokens || "default",
    topP: options.topP || 1,
    frequencyPenalty: options.frequencyPenalty || 0,
    presencePenalty: options.presencePenalty || 0
  };

  var keyString = JSON.stringify(keyComponents);
  return crypto.createHash("sha256").update(keyString).digest("hex");
}

The system prompt must be part of the key. If you use different system prompts for different users or contexts, the same user prompt will produce different responses. Temperature is particularly important — at temperature 0, the response is nearly deterministic and caching is safe. At temperature 0.8+, responses vary significantly, and caching a single response may not represent the range of outputs the model would produce. I recommend only caching responses for requests with temperature 0 or very low temperature values.

TTL Strategies for Different Content Types

Not all cached responses should live forever. Different types of content need different expiration policies:

var TTL_STRATEGIES = {
  factual: 7 * 24 * 60 * 60,       // 7 days — "What is the capital of France?"
  code_generation: 24 * 60 * 60,    // 24 hours — code patterns change slowly
  summarization: 12 * 60 * 60,      // 12 hours — summaries of static content
  current_events: 30 * 60,          // 30 minutes — time-sensitive information
  personalized: 60 * 60,            // 1 hour — user-specific responses
  creative: 0                       // no cache — creative writing should be unique
};

function determineTTL(prompt, options) {
  if (options.temperature > 0.5) {
    return TTL_STRATEGIES.creative;
  }

  var promptLower = prompt.toLowerCase();

  if (promptLower.indexOf("summarize") !== -1 || promptLower.indexOf("summary") !== -1) {
    return TTL_STRATEGIES.summarization;
  }
  if (promptLower.indexOf("code") !== -1 || promptLower.indexOf("function") !== -1) {
    return TTL_STRATEGIES.code_generation;
  }
  if (promptLower.indexOf("today") !== -1 || promptLower.indexOf("current") !== -1 || promptLower.indexOf("latest") !== -1) {
    return TTL_STRATEGIES.current_events;
  }

  return TTL_STRATEGIES.factual;
}

In practice, you will want to classify prompts more robustly — perhaps by tagging them at the application layer rather than parsing the prompt text. But the principle holds: cache aggressively for factual content, conservatively for time-sensitive content, and not at all for creative generation.

Cache Warming for Predictable Queries

If you know which queries your users will ask most frequently, you can pre-populate the cache during off-peak hours. This eliminates cold-start latency and ensures that your highest-traffic prompts never miss.

function warmCache(llmCache, llmClient, commonPrompts) {
  var results = { warmed: 0, skipped: 0, errors: 0 };

  return commonPrompts.reduce(function(chain, promptConfig) {
    return chain.then(function() {
      var cacheKey = buildCacheKey(promptConfig);
      return llmCache.get(cacheKey).then(function(cached) {
        if (cached.hit) {
          results.skipped++;
          return;
        }
        return llmClient.chat.completions.create({
          model: promptConfig.model,
          messages: [
            { role: "system", content: promptConfig.systemPrompt },
            { role: "user", content: promptConfig.prompt }
          ],
          temperature: 0
        }).then(function(response) {
          var responseText = response.choices[0].message.content;
          return llmCache.set(cacheKey, responseText, {
            tokens: response.usage.total_tokens,
            warmed: true
          }, promptConfig.ttl);
        }).then(function() {
          results.warmed++;
        });
      }).catch(function(err) {
        console.error("Cache warming error:", err.message);
        results.errors++;
      });
    });
  }, Promise.resolve()).then(function() {
    return results;
  });
}

// Usage
var commonQueries = [
  { prompt: "What are your business hours?", systemPrompt: "You are a helpful assistant.", model: "gpt-4o", ttl: 86400 },
  { prompt: "How do I reset my password?", systemPrompt: "You are a helpful assistant.", model: "gpt-4o", ttl: 86400 },
  { prompt: "What is your refund policy?", systemPrompt: "You are a helpful assistant.", model: "gpt-4o", ttl: 86400 }
];

warmCache(redisCache, openai, commonQueries).then(function(results) {
  console.log("Cache warming complete:", results);
});

Schedule cache warming as a cron job that runs before your peak traffic window. If peak traffic is 9 AM, warm the cache at 8:30 AM.

Anthropic Prompt Caching for System Prompt Reuse

Anthropic offers a provider-side caching feature called prompt caching. This is distinct from application-level caching — it caches the processing of long system prompts on Anthropic's infrastructure, reducing both cost and time-to-first-token. If your system prompt is large (RAG context, detailed instructions, few-shot examples), this can cut input token costs by up to 90% on cache hits.

var Anthropic = require("@anthropic-ai/sdk");

var anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

function callWithPromptCaching(systemPrompt, userMessage) {
  return anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: systemPrompt,
        cache_control: { type: "ephemeral" }
      }
    ],
    messages: [
      { role: "user", content: userMessage }
    ]
  }).then(function(response) {
    console.log("Cache creation input tokens:", response.usage.cache_creation_input_tokens);
    console.log("Cache read input tokens:", response.usage.cache_read_input_tokens);
    return response;
  });
}

The cache_control: { type: "ephemeral" } directive tells Anthropic to cache the system prompt prefix. The first request pays full price plus a small cache write premium. Subsequent requests within the TTL (typically 5 minutes) get a 90% discount on those cached input tokens. This is complementary to application-level caching — you should use both. Anthropic prompt caching reduces cost per API call, while application-level caching eliminates API calls entirely.

Cache Invalidation Strategies

Cache invalidation is one of the two hard problems in computer science. For LLM caches, you need strategies for several scenarios:

function CacheInvalidator(cache) {
  this.cache = cache;
}

// Invalidate a specific prompt
CacheInvalidator.prototype.invalidatePrompt = function(prompt, model, temperature) {
  var cacheKey = buildCacheKey({ prompt: prompt, model: model, temperature: temperature });
  return this.cache.redis.del(this.cache.prefix + cacheKey);
};

// Invalidate all entries matching a pattern (e.g., model upgrade)
CacheInvalidator.prototype.invalidateByPattern = function(pattern) {
  var self = this;
  var fullPattern = this.cache.prefix + pattern;
  return this.cache.redis.keys(fullPattern).then(function(keys) {
    if (keys.length === 0) return 0;
    return self.cache.redis.del(keys);
  });
};

// Invalidate when underlying data changes
CacheInvalidator.prototype.invalidateByTag = function(tag) {
  var tagKey = this.cache.prefix + "tag:" + tag;
  var self = this;
  return this.cache.redis.smembers(tagKey).then(function(cacheKeys) {
    if (cacheKeys.length === 0) return 0;
    var pipeline = self.cache.redis.pipeline();
    cacheKeys.forEach(function(key) {
      pipeline.del(key);
    });
    pipeline.del(tagKey);
    return pipeline.exec().then(function() {
      return cacheKeys.length;
    });
  });
};

// Tag a cache entry for later group invalidation
CacheInvalidator.prototype.tagEntry = function(cacheKey, tags) {
  var self = this;
  var promises = tags.map(function(tag) {
    var tagKey = self.cache.prefix + "tag:" + tag;
    return self.cache.redis.sadd(tagKey, self.cache.prefix + cacheKey);
  });
  return Promise.all(promises);
};

Common invalidation triggers: model version upgrades (your GPT-4o responses are stale when you switch to a newer snapshot), changes to system prompts, updates to the underlying knowledge base used in RAG prompts, and manual corrections when a cached response is found to be wrong.

Measuring Cache Hit Rates and Cost Savings

You cannot optimize what you do not measure. Build cost tracking directly into your cache layer:

function CostTracker(redisClient, prefix) {
  this.redis = redisClient;
  this.prefix = prefix || "llm:costs:";
}

CostTracker.prototype.recordRequest = function(options) {
  var day = new Date().toISOString().split("T")[0];
  var dayKey = this.prefix + day;
  var pipeline = this.redis.pipeline();

  if (options.cached) {
    pipeline.hincrby(dayKey, "cache_hits", 1);
    pipeline.hincrbyfloat(dayKey, "cost_saved", options.estimatedCost);
  } else {
    pipeline.hincrby(dayKey, "cache_misses", 1);
    pipeline.hincrbyfloat(dayKey, "cost_incurred", options.actualCost);
    pipeline.hincrby(dayKey, "tokens_used", options.totalTokens);
  }

  pipeline.hincrby(dayKey, "total_requests", 1);
  pipeline.expire(dayKey, 90 * 24 * 60 * 60); // Keep 90 days of data
  return pipeline.exec();
};

CostTracker.prototype.getDailySummary = function(date) {
  var dayKey = this.prefix + date;
  return this.redis.hgetall(dayKey).then(function(data) {
    var hits = parseInt(data.cache_hits || 0);
    var misses = parseInt(data.cache_misses || 0);
    var total = hits + misses;
    return {
      date: date,
      totalRequests: total,
      cacheHits: hits,
      cacheMisses: misses,
      hitRate: total > 0 ? (hits / total * 100).toFixed(2) + "%" : "0%",
      costSaved: parseFloat(data.cost_saved || 0).toFixed(4),
      costIncurred: parseFloat(data.cost_incurred || 0).toFixed(4),
      tokensUsed: parseInt(data.tokens_used || 0)
    };
  });
};

A healthy LLM cache should show a hit rate of 30-60% in most applications. If you are below 20%, your cache key strategy may be too specific. If you are above 80%, you might be able to reduce your cache infrastructure since most of the work is being handled in-memory.

Cache Layers: L1 Memory, L2 Redis, L3 Database

Production systems benefit from a multi-layer cache, similar to CPU cache hierarchies. Each layer trades capacity for speed:

function MultiLayerCache(options) {
  this.l1 = options.memoryCache;   // LRU in-memory: ~1ms, limited size
  this.l2 = options.redisCache;    // Redis: ~5ms, shared across processes
  this.l3 = options.dbCache;       // PostgreSQL/MongoDB: ~20ms, persistent
}

MultiLayerCache.prototype.get = function(cacheKey) {
  var self = this;

  // L1: Check memory first
  var l1Result = this.l1 ? getFromMemoryCache(cacheKey) : { hit: false };
  if (l1Result.hit) {
    return Promise.resolve({ hit: true, response: l1Result.response, source: "l1_memory" });
  }

  // L2: Check Redis
  return this.l2.get(cacheKey).then(function(l2Result) {
    if (l2Result.hit) {
      // Backfill L1
      if (self.l1) {
        setInMemoryCache(cacheKey, l2Result.response, l2Result.metadata);
      }
      return { hit: true, response: l2Result.response, source: "l2_redis" };
    }

    // L3: Check database
    if (!self.l3) return { hit: false };
    return self.l3.get(cacheKey).then(function(l3Result) {
      if (l3Result.hit) {
        // Backfill L1 and L2
        if (self.l1) {
          setInMemoryCache(cacheKey, l3Result.response, l3Result.metadata);
        }
        self.l2.set(cacheKey, l3Result.response, l3Result.metadata, 3600);
        return { hit: true, response: l3Result.response, source: "l3_database" };
      }
      return { hit: false };
    });
  });
};

MultiLayerCache.prototype.set = function(cacheKey, response, metadata, ttl) {
  // Write to all layers
  if (this.l1) {
    setInMemoryCache(cacheKey, response, metadata);
  }
  var promises = [this.l2.set(cacheKey, response, metadata, ttl)];
  if (this.l3) {
    promises.push(this.l3.set(cacheKey, response, metadata));
  }
  return Promise.all(promises);
};

The backfill pattern is key: when L2 hits, populate L1. When L3 hits, populate both L1 and L2. This ensures that frequently accessed entries migrate to faster layers automatically. L3 (database) is optional but valuable for responses you want to survive Redis restarts and provide long-term analytics.

Handling Non-Deterministic Responses in Cache

When temperature is above 0, LLMs produce varied responses. Caching a single response and returning it every time defeats the purpose of non-deterministic generation. There are two approaches:

Approach 1: Cache multiple responses and return randomly

function NonDeterministicCache(redisClient, prefix) {
  this.redis = redisClient;
  this.prefix = prefix || "llm:nd:";
  this.maxVariants = 5;
}

NonDeterministicCache.prototype.get = function(cacheKey) {
  var listKey = this.prefix + cacheKey;
  var self = this;
  return this.redis.llen(listKey).then(function(length) {
    if (length === 0) return { hit: false };
    var randomIndex = Math.floor(Math.random() * length);
    return self.redis.lindex(listKey, randomIndex).then(function(data) {
      return { hit: true, response: JSON.parse(data) };
    });
  });
};

NonDeterministicCache.prototype.set = function(cacheKey, response, ttl) {
  var listKey = this.prefix + cacheKey;
  var self = this;
  return this.redis.lpush(listKey, JSON.stringify(response)).then(function() {
    return self.redis.ltrim(listKey, 0, self.maxVariants - 1);
  }).then(function() {
    if (ttl) return self.redis.expire(listKey, ttl);
  });
};

Approach 2: Only cache deterministic requests

The simpler and often better approach is to only cache requests where temperature is 0. For most production use cases — classification, extraction, summarization, code generation with specific requirements — temperature 0 is appropriate anyway. Reserve non-deterministic generation for creative features and skip caching for those.

Complete Working Example

Here is a production-ready multi-layer LLM cache with exact match, semantic similarity, Redis backend, and cost tracking:

var express = require("express");
var crypto = require("crypto");
var Redis = require("ioredis");
var LRUCache = require("lru-cache");
var OpenAI = require("openai");

// ---- Configuration ----
var PRICING = {
  "gpt-4o": { input: 2.50 / 1000000, output: 10.00 / 1000000 },
  "gpt-4-turbo": { input: 10.00 / 1000000, output: 30.00 / 1000000 },
  "gpt-3.5-turbo": { input: 0.50 / 1000000, output: 1.50 / 1000000 }
};

var SEMANTIC_THRESHOLD = 0.95;

// ---- Initialize clients ----
var openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
var redis = new Redis(process.env.REDIS_URL || "redis://127.0.0.1:6379");

// ---- L1: In-memory cache ----
var l1Cache = new LRUCache({
  max: 2000,
  ttl: 1000 * 60 * 30,
  updateAgeOnGet: true
});

// ---- Helper functions ----
function normalizePrompt(text) {
  return text.trim().replace(/\s+/g, " ").toLowerCase();
}

function buildKey(prompt, systemPrompt, model, temperature) {
  var payload = JSON.stringify({
    p: normalizePrompt(prompt),
    s: systemPrompt ? normalizePrompt(systemPrompt) : "",
    m: model,
    t: temperature
  });
  return crypto.createHash("sha256").update(payload).digest("hex");
}

function estimateCost(model, inputTokens, outputTokens) {
  var pricing = PRICING[model] || PRICING["gpt-4o"];
  return (inputTokens * pricing.input) + (outputTokens * pricing.output);
}

// ---- Multi-layer cache class ----
function LLMCache() {
  this.statsKey = "llm:cache:stats";
  this.semanticEntries = [];
}

LLMCache.prototype.lookupExact = function(cacheKey) {
  // L1
  var memResult = l1Cache.get(cacheKey);
  if (memResult) {
    return Promise.resolve({ hit: true, data: memResult, source: "l1" });
  }
  // L2
  return redis.get("llm:exact:" + cacheKey).then(function(raw) {
    if (raw) {
      var parsed = JSON.parse(raw);
      l1Cache.set(cacheKey, parsed);
      return { hit: true, data: parsed, source: "l2" };
    }
    return { hit: false };
  });
};

LLMCache.prototype.lookupSemantic = function(promptText) {
  var self = this;
  if (this.semanticEntries.length === 0) {
    return Promise.resolve({ hit: false });
  }
  return openai.embeddings.create({
    model: "text-embedding-3-small",
    input: normalizePrompt(promptText)
  }).then(function(embResponse) {
    var queryVec = embResponse.data[0].embedding;
    var bestScore = 0;
    var bestEntry = null;

    for (var i = 0; i < self.semanticEntries.length; i++) {
      var dot = 0, nA = 0, nB = 0;
      var entryVec = self.semanticEntries[i].embedding;
      for (var j = 0; j < queryVec.length; j++) {
        dot += queryVec[j] * entryVec[j];
        nA += queryVec[j] * queryVec[j];
        nB += entryVec[j] * entryVec[j];
      }
      var sim = dot / (Math.sqrt(nA) * Math.sqrt(nB));
      if (sim > bestScore) {
        bestScore = sim;
        bestEntry = self.semanticEntries[i];
      }
    }

    if (bestEntry && bestScore >= SEMANTIC_THRESHOLD) {
      return { hit: true, data: bestEntry.cached, source: "semantic", similarity: bestScore };
    }
    return { hit: false, embedding: queryVec };
  });
};

LLMCache.prototype.store = function(cacheKey, responseData, embedding, ttl) {
  // L1
  l1Cache.set(cacheKey, responseData);
  // L2
  var redisPromise = ttl
    ? redis.setex("llm:exact:" + cacheKey, ttl, JSON.stringify(responseData))
    : redis.set("llm:exact:" + cacheKey, JSON.stringify(responseData));

  // Semantic index
  if (embedding) {
    this.semanticEntries.push({
      embedding: embedding,
      cached: responseData,
      timestamp: Date.now()
    });
    if (this.semanticEntries.length > 5000) {
      this.semanticEntries = this.semanticEntries.slice(-4000);
    }
  }

  return redisPromise;
};

LLMCache.prototype.recordStats = function(cacheHit, costSaved, costIncurred) {
  var day = new Date().toISOString().split("T")[0];
  var pipeline = redis.pipeline();
  pipeline.hincrby("llm:stats:" + day, "total", 1);
  if (cacheHit) {
    pipeline.hincrby("llm:stats:" + day, "hits", 1);
    pipeline.hincrbyfloat("llm:stats:" + day, "saved", costSaved);
  } else {
    pipeline.hincrby("llm:stats:" + day, "misses", 1);
    pipeline.hincrbyfloat("llm:stats:" + day, "spent", costIncurred);
  }
  pipeline.expire("llm:stats:" + day, 90 * 86400);
  return pipeline.exec();
};

// ---- Main completion function ----
var cache = new LLMCache();

function cachedCompletion(options) {
  var prompt = options.prompt;
  var systemPrompt = options.systemPrompt || "You are a helpful assistant.";
  var model = options.model || "gpt-4o";
  var temperature = options.temperature || 0;
  var maxTokens = options.maxTokens || 1024;

  var cacheKey = buildKey(prompt, systemPrompt, model, temperature);
  var skipCache = temperature > 0.5;

  // Step 1: Try exact match
  return cache.lookupExact(cacheKey).then(function(exactResult) {
    if (exactResult.hit && !skipCache) {
      var savedCost = estimateCost(model, exactResult.data.inputTokens || 200, exactResult.data.outputTokens || 200);
      cache.recordStats(true, savedCost, 0);
      return {
        response: exactResult.data.response,
        cached: true,
        cacheSource: exactResult.source,
        costSaved: savedCost
      };
    }

    // Step 2: Try semantic match
    return cache.lookupSemantic(prompt).then(function(semResult) {
      if (semResult.hit && !skipCache) {
        var savedCost = estimateCost(model, semResult.data.inputTokens || 200, semResult.data.outputTokens || 200);
        cache.recordStats(true, savedCost, 0);
        return {
          response: semResult.data.response,
          cached: true,
          cacheSource: "semantic (similarity: " + semResult.similarity.toFixed(4) + ")",
          costSaved: savedCost
        };
      }

      // Step 3: Call the LLM
      return openai.chat.completions.create({
        model: model,
        messages: [
          { role: "system", content: systemPrompt },
          { role: "user", content: prompt }
        ],
        temperature: temperature,
        max_tokens: maxTokens
      }).then(function(completion) {
        var responseText = completion.choices[0].message.content;
        var inputTokens = completion.usage.prompt_tokens;
        var outputTokens = completion.usage.completion_tokens;
        var actualCost = estimateCost(model, inputTokens, outputTokens);

        var cacheData = {
          response: responseText,
          inputTokens: inputTokens,
          outputTokens: outputTokens,
          model: model,
          cachedAt: Date.now()
        };

        // Store in cache (skip for high-temperature requests)
        if (!skipCache) {
          cache.store(cacheKey, cacheData, semResult.embedding || null, 3600);
        }

        cache.recordStats(false, 0, actualCost);

        return {
          response: responseText,
          cached: false,
          cost: actualCost,
          tokens: { input: inputTokens, output: outputTokens }
        };
      });
    });
  });
}

// ---- Express API ----
var app = express();
app.use(express.json());

app.post("/api/completion", function(req, res) {
  cachedCompletion({
    prompt: req.body.prompt,
    systemPrompt: req.body.systemPrompt,
    model: req.body.model,
    temperature: req.body.temperature,
    maxTokens: req.body.maxTokens
  }).then(function(result) {
    res.json(result);
  }).catch(function(err) {
    console.error("Completion error:", err);
    res.status(500).json({ error: err.message });
  });
});

// Cost tracking dashboard endpoint
app.get("/api/cache/stats", function(req, res) {
  var days = parseInt(req.query.days) || 7;
  var promises = [];
  for (var i = 0; i < days; i++) {
    var date = new Date();
    date.setDate(date.getDate() - i);
    var dayKey = date.toISOString().split("T")[0];
    promises.push(
      redis.hgetall("llm:stats:" + dayKey).then(function(data) {
        return {
          date: dayKey,
          total: parseInt(data.total || 0),
          hits: parseInt(data.hits || 0),
          misses: parseInt(data.misses || 0),
          hitRate: data.total ? ((data.hits || 0) / data.total * 100).toFixed(1) + "%" : "0%",
          costSaved: parseFloat(data.saved || 0).toFixed(4),
          costSpent: parseFloat(data.spent || 0).toFixed(4)
        };
      })
    );
  }
  Promise.all(promises).then(function(stats) {
    var totals = stats.reduce(function(acc, day) {
      acc.totalRequests += day.total;
      acc.totalHits += day.hits;
      acc.totalSaved += parseFloat(day.costSaved);
      acc.totalSpent += parseFloat(day.costSpent);
      return acc;
    }, { totalRequests: 0, totalHits: 0, totalSaved: 0, totalSpent: 0 });
    totals.overallHitRate = totals.totalRequests > 0
      ? (totals.totalHits / totals.totalRequests * 100).toFixed(1) + "%"
      : "0%";
    res.json({ daily: stats, totals: totals });
  }).catch(function(err) {
    res.status(500).json({ error: err.message });
  });
});

app.listen(process.env.PORT || 3000, function() {
  console.log("LLM Cache API running on port " + (process.env.PORT || 3000));
});

Common Issues and Troubleshooting

1. Redis connection refused on startup

Error: connect ECONNREFUSED 127.0.0.1:6379
    at TCPConnectWrap.afterConnect

This happens when Redis is not running or is on a different port. Verify Redis is running with redis-cli ping. In production, ensure your REDIS_URL environment variable includes the correct host, port, and password. Add a graceful fallback so your application still works without Redis, falling back to memory-only caching.

2. Cache key collisions returning wrong responses

Expected response about Python, got response about JavaScript

This occurs when your cache key does not include all relevant parameters. The most common mistake is omitting the system prompt from the key. If you use different system prompts for different contexts ("You are a Python expert" vs. "You are a JavaScript expert"), the same user prompt will produce different responses. Always include model, temperature, system prompt, and any other parameters that affect output in your cache key hash.

3. Memory leak from unbounded semantic cache entries

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

Each embedding vector from text-embedding-3-small is 1536 floats, consuming about 12KB per entry. At 100,000 entries, that is 1.2GB of memory. Always cap your semantic cache with a maximum entry count and implement eviction. The example above trims to 4,000 entries when the limit of 5,000 is reached. For larger-scale semantic caching, use a vector database like Pinecone or pgvector instead of in-memory storage.

4. Stale cache serving outdated responses after model upgrade

User reports: "The AI is giving me the same wrong answer it gave last week"

When you upgrade from one model version to another (e.g., gpt-4o-2024-05-13 to gpt-4o-2024-08-06), your cache is full of responses from the old model. If your cache key uses the generic model name gpt-4o rather than the specific snapshot version, all those old cached responses will continue to be served. Either include the specific model version in cache keys, or flush the cache when you upgrade models. A tag-based invalidation system (shown in the invalidation section) makes this straightforward.

5. Semantic cache returning approximate matches that are wrong

Query: "How do I delete my account?"
Cached: "How do I create my account?"  (similarity: 0.94)

A similarity threshold that is too low will match prompts that are related but have opposite intent. "Delete" vs. "create" are semantically similar (both about account management) but functionally opposite. If you see this in production, raise your threshold from 0.90 to 0.95 or higher. You can also add a negative similarity list — pairs of prompts that should never match despite high similarity scores.

Best Practices

  • Always set temperature to 0 for cacheable requests. Non-deterministic responses cached as a single value misrepresent the model's output distribution. If you need variation, either skip caching or store multiple variants.

  • Include every output-affecting parameter in your cache key. Model name, temperature, max tokens, top_p, frequency penalty, presence penalty, system prompt, and tools/functions all affect the response. Miss one and you serve wrong results.

  • Use TTLs aggressively and differentiate by content type. Factual responses can live for days. Anything referencing current data should expire in minutes. When in doubt, shorter TTLs are safer — a cache miss costs money, but a stale response costs trust.

  • Monitor cache hit rates daily and alert on anomalies. A sudden drop in hit rate might indicate a code change that broke key generation. A sudden spike might indicate a bot sending the same query repeatedly.

  • Combine provider-side and application-side caching. Anthropic's prompt caching reduces the cost of cache misses. Application-level caching eliminates misses entirely for repeated queries. Together they provide defense in depth.

  • Implement graceful degradation. If Redis goes down, fall back to memory-only caching. If the memory cache is full, call the LLM directly. Never let a cache failure become an application failure. Wrap all cache operations in try/catch.

  • Hash your cache keys, do not use raw prompts as keys. Raw prompts can be thousands of characters long, which wastes Redis memory on key storage and slows down lookups. SHA-256 hashing gives you fixed-length 64-character keys.

  • Log cache sources for debugging. When a response comes from cache, record whether it was L1, L2, L3, or semantic. This helps you tune layer sizes and TTLs based on actual access patterns.

  • Separate cache namespaces per environment. Use prefixes like prod:llm:, staging:llm:, dev:llm: to prevent development testing from polluting production caches and vice versa.

References

Powered by Contentful