Production

Caching Layers for AI Applications

Build multi-layer caching for AI applications with LRU, Redis, PostgreSQL, semantic matching, and effectiveness monitoring in Node.js.

Caching Layers for AI Applications

Overview

Every call to an LLM API costs money and takes anywhere from 500 milliseconds to 30 seconds to resolve. If your AI-powered application serves real users at any kind of scale, caching is not optional — it is the single highest-impact optimization you can make. This article walks through building a production-grade, multi-layer caching system for AI applications in Node.js, covering in-memory LRU, Redis, PostgreSQL persistence, semantic similarity matching, and the monitoring you need to prove it is actually working.

Prerequisites

  • Node.js 18+ installed
  • Basic understanding of caching concepts (TTL, cache keys, eviction)
  • Redis installed locally or accessible via a cloud provider
  • PostgreSQL available for the persistent cache layer
  • Familiarity with LLM APIs (OpenAI, Anthropic, etc.)
  • Working knowledge of Express.js

Install the dependencies we will use throughout:

npm install lru-cache ioredis pg crypto openai

Why Caching Is Essential for AI Applications

Traditional web applications deal with database queries that take 5-50 milliseconds. AI applications deal with API calls that take 1-10 seconds and cost $0.01 to $0.50 per request. The economics are fundamentally different.

Here is what a typical production AI application looks like without caching:

User asks: "What are the best practices for error handling in Node.js?"
→ LLM API call: 2.3 seconds, $0.03
→ Total: 2.3 seconds

Same user asks again 5 minutes later:
→ LLM API call: 2.1 seconds, $0.03
→ Total: 2.1 seconds

Different user asks the same question:
→ LLM API call: 2.5 seconds, $0.03
→ Total: 2.5 seconds

Three identical answers. $0.09 spent. 6.9 seconds of cumulative latency. Multiply that by thousands of users asking variations of the same questions and you are burning through your API budget at an alarming rate.

With a proper caching layer:

First request: 2.3 seconds, $0.03 (cache miss)
Second request: 3 milliseconds, $0.00 (cache hit)
Third request: 3 milliseconds, $0.00 (cache hit)

In production, I have seen cache hit rates between 40% and 85% depending on the application type. A customer support chatbot with common questions will hit the high end. A creative writing assistant will hit the low end. Either way, the savings are substantial.

Identifying What to Cache

Not everything in an AI pipeline should be cached the same way. Here is how I categorize cacheable artifacts:

High-value cache targets (cache aggressively):

  • LLM responses to factual questions
  • Generated embeddings for documents and queries
  • Search results from vector databases
  • Classification and categorization results
  • Summarization output for static documents

Medium-value cache targets (cache with shorter TTLs):

  • RAG retrieval results
  • Intermediate chain-of-thought computations
  • Tool call results within agent workflows
  • Translated content

Low-value or uncacheable:

  • Creative generation with high temperature settings
  • Responses that depend on real-time data
  • Personalized outputs that vary per user context
  • Streaming responses mid-generation

The rule of thumb: if the same input should produce a functionally equivalent output, cache it.

L1 Cache: In-Memory with LRU

The fastest cache is the one that lives in your process memory. The lru-cache package gives you an in-memory Least Recently Used cache with size limits, TTLs, and automatic eviction.

var LRU = require("lru-cache");

var memoryCache = new LRU({
  max: 500,                    // max 500 entries
  maxSize: 50 * 1024 * 1024,  // 50MB total size limit
  sizeCalculation: function (value) {
    return Buffer.byteLength(JSON.stringify(value), "utf8");
  },
  ttl: 1000 * 60 * 30,        // 30 minutes default
  allowStale: false,
  updateAgeOnGet: true         // reset TTL on access
});

function getFromMemory(key) {
  var result = memoryCache.get(key);
  if (result) {
    console.log("[L1 HIT] Key:", key.substring(0, 32) + "...");
    return result;
  }
  return null;
}

function setInMemory(key, value, ttlMs) {
  var options = {};
  if (ttlMs) {
    options.ttl = ttlMs;
  }
  memoryCache.set(key, value, options);
}

L1 is your hot path. It handles repeated requests within the same process and same deployment window. The downside: it does not survive restarts, and in a multi-instance deployment, each instance has its own isolated cache.

L2 Cache: Redis for Distributed Caching

Redis bridges the gap between fast in-memory access and shared state across multiple application instances. Every instance of your Node.js app reads and writes to the same Redis, so a cache entry created by one instance benefits all others.

var Redis = require("ioredis");

var redis = new Redis({
  host: process.env.REDIS_HOST || "127.0.0.1",
  port: parseInt(process.env.REDIS_PORT) || 6379,
  password: process.env.REDIS_PASSWORD || undefined,
  maxRetriesPerRequest: 3,
  retryStrategy: function (times) {
    if (times > 3) return null;
    return Math.min(times * 200, 2000);
  }
});

function getFromRedis(key) {
  return redis.get("ai_cache:" + key).then(function (data) {
    if (data) {
      console.log("[L2 HIT] Key:", key.substring(0, 32) + "...");
      return JSON.parse(data);
    }
    return null;
  });
}

function setInRedis(key, value, ttlSeconds) {
  var serialized = JSON.stringify(value);
  if (ttlSeconds) {
    return redis.setex("ai_cache:" + key, ttlSeconds, serialized);
  }
  return redis.set("ai_cache:" + key, serialized);
}

I prefix all AI cache keys with ai_cache: so they are easy to identify, monitor, and flush independently of other Redis data.

L3 Cache: Database for Persistent Cache

For expensive computations that should survive infrastructure changes — embeddings you generated, document summaries, classification results — a database-backed cache layer makes sense. PostgreSQL works well because you can add indexes, run analytics on cache usage, and store metadata alongside the cached values.

CREATE TABLE ai_cache (
  id SERIAL PRIMARY KEY,
  cache_key VARCHAR(128) UNIQUE NOT NULL,
  cache_value TEXT NOT NULL,
  model VARCHAR(64),
  token_count INTEGER,
  cost_usd NUMERIC(10, 6),
  hit_count INTEGER DEFAULT 0,
  created_at TIMESTAMP DEFAULT NOW(),
  expires_at TIMESTAMP,
  metadata JSONB DEFAULT '{}'
);

CREATE INDEX idx_ai_cache_key ON ai_cache(cache_key);
CREATE INDEX idx_ai_cache_expires ON ai_cache(expires_at);
var pg = require("pg");

var pool = new pg.Pool({
  connectionString: process.env.POSTGRES_CONNECTION_STRING
});

function getFromDatabase(key) {
  var sql = "UPDATE ai_cache SET hit_count = hit_count + 1 " +
            "WHERE cache_key = $1 AND (expires_at IS NULL OR expires_at > NOW()) " +
            "RETURNING cache_value, model, token_count";
  return pool.query(sql, [key]).then(function (result) {
    if (result.rows.length > 0) {
      console.log("[L3 HIT] Key:", key.substring(0, 32) + "...");
      return JSON.parse(result.rows[0].cache_value);
    }
    return null;
  });
}

function setInDatabase(key, value, metadata) {
  var sql = "INSERT INTO ai_cache (cache_key, cache_value, model, token_count, cost_usd, expires_at, metadata) " +
            "VALUES ($1, $2, $3, $4, $5, $6, $7) " +
            "ON CONFLICT (cache_key) DO UPDATE SET " +
            "cache_value = EXCLUDED.cache_value, " +
            "model = EXCLUDED.model, " +
            "token_count = EXCLUDED.token_count, " +
            "cost_usd = EXCLUDED.cost_usd, " +
            "expires_at = EXCLUDED.expires_at, " +
            "metadata = EXCLUDED.metadata, " +
            "hit_count = 0, " +
            "created_at = NOW()";

  var expiresAt = metadata.ttlSeconds
    ? new Date(Date.now() + metadata.ttlSeconds * 1000)
    : null;

  return pool.query(sql, [
    key,
    JSON.stringify(value),
    metadata.model || null,
    metadata.tokenCount || null,
    metadata.costUsd || null,
    expiresAt,
    JSON.stringify(metadata.extra || {})
  ]);
}

The UPDATE ... RETURNING pattern for reads is deliberate — it increments the hit counter atomically while fetching the value, giving you usage analytics for free.

Cache Key Strategies for LLM Requests

Building the right cache key is critical. Two requests that look different on the surface might produce functionally identical results, and two requests that look similar might need completely different responses.

For LLM requests, I hash a combination of the input parameters that affect the output:

var crypto = require("crypto");

function buildCacheKey(params) {
  var keyComponents = {
    model: params.model,
    messages: params.messages,
    temperature: params.temperature || 0,
    maxTokens: params.maxTokens || null,
    systemPrompt: params.systemPrompt || "",
    tools: params.tools ? JSON.stringify(params.tools) : null
  };

  var normalized = JSON.stringify(keyComponents, Object.keys(keyComponents).sort());
  var hash = crypto.createHash("sha256").update(normalized).digest("hex");

  return hash;
}

// Two identical prompts produce the same key
var key1 = buildCacheKey({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Explain REST APIs" }],
  temperature: 0
});

var key2 = buildCacheKey({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Explain REST APIs" }],
  temperature: 0
});

console.log(key1 === key2); // true

Key points on this approach:

  • Sort the JSON keys to ensure deterministic serialization
  • Include the model name — a GPT-4o response and a Claude response for the same prompt are different cache entries
  • Include temperature — a temperature of 0 is cacheable, a temperature of 1.0 generally is not
  • Include tools/functions — different tool sets produce different responses
  • Do NOT include timestamps, request IDs, or other non-deterministic fields

Handling Non-Deterministic Responses

Temperature controls randomness. At temperature 0, most LLM providers return deterministic (or near-deterministic) results for the same input. At higher temperatures, responses vary.

My rule:

  • Temperature 0 to 0.2: Cache freely. Responses are stable enough.
  • Temperature 0.3 to 0.7: Cache with shorter TTLs. The first response is as good as any subsequent one for most use cases.
  • Temperature 0.8+: Do not cache unless the use case specifically allows it (e.g., "give me any creative suggestion" where any cached suggestion is fine).
function shouldCache(params) {
  var temperature = params.temperature || 0;

  if (temperature <= 0.2) return { cache: true, ttlMultiplier: 1.0 };
  if (temperature <= 0.7) return { cache: true, ttlMultiplier: 0.3 };
  return { cache: false, ttlMultiplier: 0 };
}

TTL Strategies by Content Type

Different types of AI outputs have different shelf lives. Here is the TTL strategy I use in production:

var TTL_STRATEGIES = {
  // Static factual content - long TTL
  "factual_qa": { l1: 60 * 60, l2: 24 * 60 * 60, l3: 7 * 24 * 60 * 60 },

  // Embeddings - very long TTL (they do not change unless the model changes)
  "embedding": { l1: 4 * 60 * 60, l2: 7 * 24 * 60 * 60, l3: 30 * 24 * 60 * 60 },

  // Document summaries - long TTL
  "summary": { l1: 2 * 60 * 60, l2: 24 * 60 * 60, l3: 14 * 24 * 60 * 60 },

  // Classification results - long TTL
  "classification": { l1: 60 * 60, l2: 48 * 60 * 60, l3: 30 * 24 * 60 * 60 },

  // RAG retrieval - medium TTL
  "rag_retrieval": { l1: 15 * 60, l2: 2 * 60 * 60, l3: 24 * 60 * 60 },

  // Dynamic or personalized - short TTL
  "conversational": { l1: 5 * 60, l2: 30 * 60, l3: null },

  // Default fallback
  "default": { l1: 30 * 60, l2: 4 * 60 * 60, l3: 24 * 60 * 60 }
};

function getTTL(contentType, layer) {
  var strategy = TTL_STRATEGIES[contentType] || TTL_STRATEGIES["default"];
  return strategy[layer] || null;
}

The L1 (memory) TTL is always the shortest because memory is the most constrained resource. L3 (database) is the longest because storage is cheap and a stale cache entry that gets refreshed is better than a cold miss that costs you an API call.

Cache Warming Strategies

For predictable workloads, you can pre-populate the cache before users ever hit it. This is especially effective for applications with a finite set of common queries.

var COMMON_QUERIES = [
  "How do I handle errors in Node.js?",
  "What is the difference between SQL and NoSQL?",
  "Explain microservices architecture",
  "How do I deploy to AWS?",
  "What are design patterns?"
];

function warmCache(aiClient, cacheModule) {
  console.log("[WARM] Starting cache warming for", COMMON_QUERIES.length, "queries");
  var started = Date.now();

  var promises = COMMON_QUERIES.map(function (query, index) {
    return new Promise(function (resolve) {
      // Stagger requests to avoid rate limits
      setTimeout(function () {
        cacheModule.queryWithCache({
          model: "gpt-4o",
          messages: [{ role: "user", content: query }],
          temperature: 0,
          contentType: "factual_qa"
        }).then(function (result) {
          console.log("[WARM] Cached:", query.substring(0, 50));
          resolve(result);
        }).catch(function (err) {
          console.error("[WARM] Failed:", query.substring(0, 50), err.message);
          resolve(null);
        });
      }, index * 2000); // 2 seconds between requests
    });
  });

  return Promise.all(promises).then(function (results) {
    var elapsed = Date.now() - started;
    var successful = results.filter(function (r) { return r !== null; }).length;
    console.log("[WARM] Complete:", successful + "/" + COMMON_QUERIES.length,
                "in", elapsed + "ms");
  });
}

Run cache warming on application startup or on a schedule. I typically warm caches during off-peak hours using a cron job.

Cache Invalidation Patterns

Cache invalidation is famously one of the two hard problems in computer science. For AI caches, you have three primary strategies:

Time-based (TTL expiration): The simplest and most common. Set a TTL and let entries expire naturally. This is the right default for most AI caching.

Event-based invalidation: When something changes upstream, flush the relevant cache entries. For example, when you update a document in your RAG pipeline, invalidate all cached summaries and embeddings for that document.

function invalidateByPattern(pattern) {
  // Clear L1
  var keysToDelete = [];
  memoryCache.forEach(function (value, key) {
    if (key.indexOf(pattern) !== -1) {
      keysToDelete.push(key);
    }
  });
  keysToDelete.forEach(function (key) {
    memoryCache.delete(key);
  });
  console.log("[INVALIDATE] L1: removed", keysToDelete.length, "entries");

  // Clear L2 (Redis) using SCAN to avoid blocking
  return new Promise(function (resolve) {
    var stream = redis.scanStream({
      match: "ai_cache:*" + pattern + "*",
      count: 100
    });
    var pipeline = redis.pipeline();
    var count = 0;

    stream.on("data", function (keys) {
      keys.forEach(function (key) {
        pipeline.del(key);
        count++;
      });
    });

    stream.on("end", function () {
      pipeline.exec().then(function () {
        console.log("[INVALIDATE] L2: removed", count, "entries");
        resolve(count);
      });
    });
  });
}

Manual invalidation: Expose an admin endpoint that lets you flush specific entries or the entire cache. Essential for debugging and emergency situations.

// Express route for manual cache management
app.post("/admin/cache/flush", function (req, res) {
  var pattern = req.body.pattern || "*";

  invalidateByPattern(pattern).then(function (count) {
    res.json({ flushed: count, pattern: pattern });
  });
});

Semantic Caching

Exact cache key matching misses a huge opportunity. "Explain REST APIs" and "What are REST APIs?" should hit the same cache entry. Semantic caching uses embedding similarity to match functionally equivalent queries.

var OpenAI = require("openai");
var openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

function getEmbedding(text) {
  return openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text
  }).then(function (response) {
    return response.data[0].embedding;
  });
}

function cosineSimilarity(vecA, vecB) {
  var dotProduct = 0;
  var normA = 0;
  var normB = 0;
  for (var i = 0; i < vecA.length; i++) {
    dotProduct += vecA[i] * vecB[i];
    normA += vecA[i] * vecA[i];
    normB += vecB[i] * vecB[i];
  }
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

// Store embeddings alongside cache entries in PostgreSQL
var SEMANTIC_THRESHOLD = 0.92; // Tune this per use case

function findSemanticMatch(queryEmbedding) {
  // Using pgvector extension for efficient similarity search
  var sql = "SELECT cache_key, cache_value, " +
            "1 - (embedding <=> $1::vector) AS similarity " +
            "FROM ai_cache_semantic " +
            "WHERE 1 - (embedding <=> $1::vector) > $2 " +
            "AND (expires_at IS NULL OR expires_at > NOW()) " +
            "ORDER BY similarity DESC LIMIT 1";

  return pool.query(sql, [
    "[" + queryEmbedding.join(",") + "]",
    SEMANTIC_THRESHOLD
  ]).then(function (result) {
    if (result.rows.length > 0) {
      console.log("[SEMANTIC HIT] Similarity:", result.rows[0].similarity.toFixed(4));
      return JSON.parse(result.rows[0].cache_value);
    }
    return null;
  });
}

Semantic caching adds an embedding lookup to each cache miss, but the cost of one embedding call ($0.00002 for text-embedding-3-small) is far less than a full LLM call. In my experience, semantic caching increases effective hit rates by 15-25% on conversational workloads.

Cache-Aside vs Write-Through vs Write-Behind

Cache-aside is what most people implement by default: check the cache, if miss then call the LLM, then write the result back to cache. This is the right pattern for AI applications 90% of the time because you only cache results that are actually requested.

Write-through writes to the cache and the backing store simultaneously. Useful when you need guaranteed persistence for every AI response (audit trail, compliance).

Write-behind (write-back) writes to the fast cache immediately and asynchronously persists to the slower store. This is what I use for the L3 database layer — the user gets the fast L1/L2 response, and the database write happens in the background.

function cacheResult(key, value, contentType, metadata) {
  var l1TTL = getTTL(contentType, "l1");
  var l2TTL = getTTL(contentType, "l2");
  var l3TTL = getTTL(contentType, "l3");

  // L1: synchronous, always
  if (l1TTL) {
    setInMemory(key, value, l1TTL * 1000);
  }

  // L2: async, do not wait
  if (l2TTL) {
    setInRedis(key, value, l2TTL).catch(function (err) {
      console.error("[L2 WRITE ERROR]", err.message);
    });
  }

  // L3: async write-behind, do not wait
  if (l3TTL) {
    setInDatabase(key, value, {
      model: metadata.model,
      tokenCount: metadata.tokenCount,
      costUsd: metadata.costUsd,
      ttlSeconds: l3TTL,
      extra: metadata.extra
    }).catch(function (err) {
      console.error("[L3 WRITE ERROR]", err.message);
    });
  }
}

The L1 write is synchronous because it is just a memory operation. L2 and L3 writes use fire-and-forget promises. If Redis or PostgreSQL is temporarily unavailable, the user still gets their response — the cache write just silently fails, and the next request will be another miss that triggers a fresh API call.

Measuring Cache Effectiveness

A cache you cannot measure is a cache you cannot tune. Track these metrics:

var cacheStats = {
  l1Hits: 0,
  l2Hits: 0,
  l3Hits: 0,
  semanticHits: 0,
  misses: 0,
  totalRequests: 0,
  costSaved: 0,
  latencySaved: 0,
  errors: 0,
  startTime: Date.now()
};

function recordHit(layer, estimatedCost, latencySavedMs) {
  cacheStats.totalRequests++;
  cacheStats[layer + "Hits"]++;
  cacheStats.costSaved += estimatedCost || 0;
  cacheStats.latencySaved += latencySavedMs || 0;
}

function recordMiss() {
  cacheStats.totalRequests++;
  cacheStats.misses++;
}

function getStats() {
  var total = cacheStats.totalRequests || 1;
  var totalHits = cacheStats.l1Hits + cacheStats.l2Hits +
                  cacheStats.l3Hits + cacheStats.semanticHits;
  var uptimeHours = (Date.now() - cacheStats.startTime) / (1000 * 60 * 60);

  return {
    hitRate: ((totalHits / total) * 100).toFixed(2) + "%",
    l1HitRate: ((cacheStats.l1Hits / total) * 100).toFixed(2) + "%",
    l2HitRate: ((cacheStats.l2Hits / total) * 100).toFixed(2) + "%",
    l3HitRate: ((cacheStats.l3Hits / total) * 100).toFixed(2) + "%",
    semanticHitRate: ((cacheStats.semanticHits / total) * 100).toFixed(2) + "%",
    missRate: ((cacheStats.misses / total) * 100).toFixed(2) + "%",
    totalRequests: cacheStats.totalRequests,
    estimatedCostSaved: "$" + cacheStats.costSaved.toFixed(4),
    estimatedLatencySaved: (cacheStats.latencySaved / 1000).toFixed(1) + "s",
    costSavedPerHour: "$" + (cacheStats.costSaved / uptimeHours).toFixed(4),
    uptimeHours: uptimeHours.toFixed(1),
    errors: cacheStats.errors
  };
}

Expose these stats on an admin endpoint and log them periodically. In the first week of deploying caching on an AI application I built, the metrics showed a 62% hit rate, $47 in API costs saved per day, and a 1.8 second average latency reduction on cache hits. Those numbers made the business case for the caching layer self-evident.

Cache Compression for Large LLM Responses

LLM responses can be large. A detailed code explanation might be 4,000-8,000 tokens — roughly 15-30 KB of text. When you are storing thousands of these in Redis, memory usage adds up fast. Compression typically reduces LLM text output by 60-75%.

var zlib = require("zlib");

function compressValue(value) {
  return new Promise(function (resolve, reject) {
    var serialized = JSON.stringify(value);
    zlib.gzip(Buffer.from(serialized, "utf8"), function (err, compressed) {
      if (err) return reject(err);
      resolve(compressed.toString("base64"));
    });
  });
}

function decompressValue(compressed) {
  return new Promise(function (resolve, reject) {
    var buffer = Buffer.from(compressed, "base64");
    zlib.gunzip(buffer, function (err, decompressed) {
      if (err) return reject(err);
      resolve(JSON.parse(decompressed.toString("utf8")));
    });
  });
}

// Use compression for L2 (Redis) to save memory
function setInRedisCompressed(key, value, ttlSeconds) {
  return compressValue(value).then(function (compressed) {
    if (ttlSeconds) {
      return redis.setex("ai_cache:" + key, ttlSeconds, compressed);
    }
    return redis.set("ai_cache:" + key, compressed);
  });
}

function getFromRedisCompressed(key) {
  return redis.get("ai_cache:" + key).then(function (data) {
    if (!data) return null;
    return decompressValue(data);
  });
}

I enable compression for L2 and L3 but skip it for L1. The in-memory cache is already limited by entry count and total size, and the decompression overhead defeats the purpose of a fast in-process cache.

Complete Working Example

Here is a full multi-layer caching module that brings everything together. This is production code I have adapted from a real AI application.

// ai-cache.js — Multi-layer caching module for AI applications
var crypto = require("crypto");
var zlib = require("zlib");
var LRU = require("lru-cache");
var Redis = require("ioredis");
var pg = require("pg");

// ─── Configuration ───

var CONFIG = {
  l1MaxEntries: 500,
  l1MaxSizeMB: 50,
  redisPrefix: "ai_cache:",
  semanticThreshold: 0.92,
  compressionEnabled: true,
  compressionMinBytes: 1024
};

var TTL_MAP = {
  "factual_qa":     { l1: 3600,  l2: 86400,   l3: 604800  },
  "embedding":      { l1: 14400, l2: 604800,   l3: 2592000 },
  "summary":        { l1: 7200,  l2: 86400,    l3: 1209600 },
  "classification": { l1: 3600,  l2: 172800,   l3: 2592000 },
  "rag_retrieval":  { l1: 900,   l2: 7200,     l3: 86400   },
  "conversational": { l1: 300,   l2: 1800,     l3: null     },
  "default":        { l1: 1800,  l2: 14400,    l3: 86400   }
};

// ─── Stats tracker ───

var stats = {
  l1Hits: 0, l2Hits: 0, l3Hits: 0, semanticHits: 0,
  misses: 0, totalRequests: 0, costSaved: 0,
  latencySaved: 0, errors: 0, startTime: Date.now()
};

// ─── L1: In-Memory LRU ───

var memoryCache = new LRU({
  max: CONFIG.l1MaxEntries,
  maxSize: CONFIG.l1MaxSizeMB * 1024 * 1024,
  sizeCalculation: function (value) {
    return Buffer.byteLength(JSON.stringify(value), "utf8");
  },
  ttl: 1000 * 60 * 30,
  updateAgeOnGet: true
});

// ─── L2: Redis ───

var redis = null;

function initRedis(options) {
  redis = new Redis({
    host: (options && options.host) || process.env.REDIS_HOST || "127.0.0.1",
    port: parseInt((options && options.port) || process.env.REDIS_PORT) || 6379,
    password: (options && options.password) || process.env.REDIS_PASSWORD || undefined,
    maxRetriesPerRequest: 3,
    lazyConnect: true,
    retryStrategy: function (times) {
      if (times > 5) return null;
      return Math.min(times * 300, 3000);
    }
  });

  redis.on("error", function (err) {
    console.error("[AI-CACHE] Redis error:", err.message);
    stats.errors++;
  });

  return redis.connect();
}

// ─── L3: PostgreSQL ───

var pgPool = null;

function initPostgres(connectionString) {
  pgPool = new pg.Pool({
    connectionString: connectionString || process.env.POSTGRES_CONNECTION_STRING,
    max: 5
  });

  pgPool.on("error", function (err) {
    console.error("[AI-CACHE] PostgreSQL pool error:", err.message);
    stats.errors++;
  });

  return pgPool.query("SELECT 1").then(function () {
    console.log("[AI-CACHE] PostgreSQL connected");
  });
}

// ─── Compression utilities ───

function compress(value) {
  return new Promise(function (resolve, reject) {
    var serialized = JSON.stringify(value);
    if (!CONFIG.compressionEnabled ||
        serialized.length < CONFIG.compressionMinBytes) {
      resolve({ data: serialized, compressed: false });
      return;
    }
    zlib.gzip(Buffer.from(serialized, "utf8"), function (err, buf) {
      if (err) return reject(err);
      resolve({ data: buf.toString("base64"), compressed: true });
    });
  });
}

function decompress(stored) {
  return new Promise(function (resolve, reject) {
    if (!stored.compressed) {
      resolve(JSON.parse(stored.data));
      return;
    }
    var buffer = Buffer.from(stored.data, "base64");
    zlib.gunzip(buffer, function (err, decompressed) {
      if (err) return reject(err);
      resolve(JSON.parse(decompressed.toString("utf8")));
    });
  });
}

// ─── Cache key generation ───

function buildKey(params) {
  var components = {
    model: params.model || "default",
    messages: params.messages,
    temperature: params.temperature || 0,
    maxTokens: params.maxTokens || null,
    systemPrompt: params.systemPrompt || "",
    tools: params.tools ? JSON.stringify(params.tools) : null
  };

  var sorted = JSON.stringify(components, Object.keys(components).sort());
  return crypto.createHash("sha256").update(sorted).digest("hex");
}

// ─── Temperature-based cache decision ───

function shouldCache(params) {
  var temp = params.temperature || 0;
  if (temp <= 0.2) return { cache: true, ttlMultiplier: 1.0 };
  if (temp <= 0.7) return { cache: true, ttlMultiplier: 0.3 };
  return { cache: false, ttlMultiplier: 0 };
}

// ─── TTL lookup ───

function getTTL(contentType, layer) {
  var strategy = TTL_MAP[contentType] || TTL_MAP["default"];
  return strategy[layer] || null;
}

// ─── Multi-layer lookup ───

function lookup(key, contentType) {
  // L1: synchronous
  var l1Result = memoryCache.get(key);
  if (l1Result) {
    stats.l1Hits++;
    stats.totalRequests++;
    return Promise.resolve({ value: l1Result, layer: "l1" });
  }

  // L2: Redis
  if (!redis) {
    return lookupL3(key, contentType);
  }

  return redis.get(CONFIG.redisPrefix + key).then(function (data) {
    if (data) {
      var stored = JSON.parse(data);
      return decompress(stored).then(function (value) {
        // Promote to L1
        var l1TTL = getTTL(contentType, "l1");
        if (l1TTL) {
          memoryCache.set(key, value, { ttl: l1TTL * 1000 });
        }
        stats.l2Hits++;
        stats.totalRequests++;
        return { value: value, layer: "l2" };
      });
    }
    return lookupL3(key, contentType);
  }).catch(function (err) {
    console.error("[AI-CACHE] L2 read error:", err.message);
    stats.errors++;
    return lookupL3(key, contentType);
  });
}

function lookupL3(key, contentType) {
  if (!pgPool) {
    stats.misses++;
    stats.totalRequests++;
    return Promise.resolve(null);
  }

  var sql = "UPDATE ai_cache SET hit_count = hit_count + 1 " +
            "WHERE cache_key = $1 AND (expires_at IS NULL OR expires_at > NOW()) " +
            "RETURNING cache_value";

  return pgPool.query(sql, [key]).then(function (result) {
    if (result.rows.length > 0) {
      var value = JSON.parse(result.rows[0].cache_value);
      // Promote to L1 and L2
      var l1TTL = getTTL(contentType, "l1");
      var l2TTL = getTTL(contentType, "l2");
      if (l1TTL) memoryCache.set(key, value, { ttl: l1TTL * 1000 });
      if (l2TTL && redis) {
        compress(value).then(function (packed) {
          redis.setex(CONFIG.redisPrefix + key, l2TTL, JSON.stringify(packed));
        }).catch(function () {});
      }
      stats.l3Hits++;
      stats.totalRequests++;
      return { value: value, layer: "l3" };
    }
    stats.misses++;
    stats.totalRequests++;
    return null;
  }).catch(function (err) {
    console.error("[AI-CACHE] L3 read error:", err.message);
    stats.errors++;
    stats.misses++;
    stats.totalRequests++;
    return null;
  });
}

// ─── Multi-layer write ───

function store(key, value, contentType, metadata) {
  var l1TTL = getTTL(contentType, "l1");
  var l2TTL = getTTL(contentType, "l2");
  var l3TTL = getTTL(contentType, "l3");
  var ttlMultiplier = (metadata && metadata.ttlMultiplier) || 1.0;

  // L1: synchronous
  if (l1TTL) {
    memoryCache.set(key, value, { ttl: Math.round(l1TTL * ttlMultiplier) * 1000 });
  }

  // L2: async fire-and-forget
  if (l2TTL && redis) {
    compress(value).then(function (packed) {
      return redis.setex(
        CONFIG.redisPrefix + key,
        Math.round(l2TTL * ttlMultiplier),
        JSON.stringify(packed)
      );
    }).catch(function (err) {
      console.error("[AI-CACHE] L2 write error:", err.message);
      stats.errors++;
    });
  }

  // L3: async write-behind
  if (l3TTL && pgPool) {
    var expiresAt = new Date(Date.now() + Math.round(l3TTL * ttlMultiplier) * 1000);
    var sql = "INSERT INTO ai_cache (cache_key, cache_value, model, token_count, " +
              "cost_usd, expires_at, metadata) " +
              "VALUES ($1, $2, $3, $4, $5, $6, $7) " +
              "ON CONFLICT (cache_key) DO UPDATE SET " +
              "cache_value = EXCLUDED.cache_value, model = EXCLUDED.model, " +
              "token_count = EXCLUDED.token_count, cost_usd = EXCLUDED.cost_usd, " +
              "expires_at = EXCLUDED.expires_at, metadata = EXCLUDED.metadata, " +
              "hit_count = 0, created_at = NOW()";

    pgPool.query(sql, [
      key,
      JSON.stringify(value),
      (metadata && metadata.model) || null,
      (metadata && metadata.tokenCount) || null,
      (metadata && metadata.costUsd) || null,
      expiresAt,
      JSON.stringify((metadata && metadata.extra) || {})
    ]).catch(function (err) {
      console.error("[AI-CACHE] L3 write error:", err.message);
      stats.errors++;
    });
  }
}

// ─── Main query interface ───

function queryWithCache(params, aiCallFn) {
  var cacheDecision = shouldCache(params);
  var contentType = params.contentType || "default";
  var key = buildKey(params);

  if (!cacheDecision.cache) {
    return aiCallFn(params);
  }

  return lookup(key, contentType).then(function (hit) {
    if (hit) {
      stats.costSaved += (params.estimatedCost || 0.02);
      stats.latencySaved += (params.estimatedLatencyMs || 2000);
      return {
        data: hit.value,
        cached: true,
        layer: hit.layer,
        key: key
      };
    }

    // Cache miss — call the AI
    var callStart = Date.now();
    return aiCallFn(params).then(function (result) {
      var callDuration = Date.now() - callStart;

      store(key, result, contentType, {
        model: params.model,
        tokenCount: result.usage ? result.usage.total_tokens : null,
        costUsd: params.estimatedCost || null,
        ttlMultiplier: cacheDecision.ttlMultiplier,
        extra: { latencyMs: callDuration }
      });

      return {
        data: result,
        cached: false,
        layer: null,
        key: key,
        latencyMs: callDuration
      };
    });
  });
}

// ─── Stats endpoint ───

function getStats() {
  var total = stats.totalRequests || 1;
  var totalHits = stats.l1Hits + stats.l2Hits + stats.l3Hits + stats.semanticHits;
  var uptimeHours = (Date.now() - stats.startTime) / 3600000;

  return {
    hitRate: ((totalHits / total) * 100).toFixed(2) + "%",
    l1HitRate: ((stats.l1Hits / total) * 100).toFixed(2) + "%",
    l2HitRate: ((stats.l2Hits / total) * 100).toFixed(2) + "%",
    l3HitRate: ((stats.l3Hits / total) * 100).toFixed(2) + "%",
    semanticHitRate: ((stats.semanticHits / total) * 100).toFixed(2) + "%",
    missRate: ((stats.misses / total) * 100).toFixed(2) + "%",
    totalRequests: stats.totalRequests,
    estimatedCostSaved: "$" + stats.costSaved.toFixed(4),
    costSavedPerHour: "$" + (stats.costSaved / (uptimeHours || 1)).toFixed(4),
    latencySaved: (stats.latencySaved / 1000).toFixed(1) + "s",
    memoryEntries: memoryCache.size,
    memorySizeMB: (memoryCache.calculatedSize / (1024 * 1024)).toFixed(2),
    errors: stats.errors,
    uptimeHours: uptimeHours.toFixed(1)
  };
}

// ─── Cleanup expired L3 entries ───

function cleanupExpired() {
  if (!pgPool) return Promise.resolve(0);
  var sql = "DELETE FROM ai_cache WHERE expires_at < NOW()";
  return pgPool.query(sql).then(function (result) {
    console.log("[AI-CACHE] Cleaned up", result.rowCount, "expired entries");
    return result.rowCount;
  });
}

// ─── Exports ───

module.exports = {
  initRedis: initRedis,
  initPostgres: initPostgres,
  buildKey: buildKey,
  queryWithCache: queryWithCache,
  store: store,
  lookup: lookup,
  getStats: getStats,
  cleanupExpired: cleanupExpired
};

Usage in an Express application:

// app.js
var express = require("express");
var OpenAI = require("openai");
var aiCache = require("./ai-cache");

var app = express();
var openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

app.use(express.json());

// Initialize cache layers
aiCache.initRedis().then(function () {
  console.log("[APP] Redis connected");
}).catch(function (err) {
  console.warn("[APP] Redis unavailable, using L1 only:", err.message);
});

aiCache.initPostgres().catch(function (err) {
  console.warn("[APP] PostgreSQL unavailable, L3 disabled:", err.message);
});

// AI query endpoint with caching
app.post("/api/ask", function (req, res) {
  var params = {
    model: "gpt-4o",
    messages: [
      { role: "system", content: "You are a helpful technical assistant." },
      { role: "user", content: req.body.question }
    ],
    temperature: 0,
    contentType: "factual_qa",
    estimatedCost: 0.03,
    estimatedLatencyMs: 2500
  };

  aiCache.queryWithCache(params, function (p) {
    return openai.chat.completions.create({
      model: p.model,
      messages: p.messages,
      temperature: p.temperature
    });
  }).then(function (result) {
    res.json({
      answer: result.data.choices
        ? result.data.choices[0].message.content
        : result.data,
      cached: result.cached,
      cacheLayer: result.layer
    });
  }).catch(function (err) {
    console.error("[APP] Error:", err.message);
    res.status(500).json({ error: "Failed to process request" });
  });
});

// Cache stats endpoint
app.get("/admin/cache/stats", function (req, res) {
  res.json(aiCache.getStats());
});

// Periodic cleanup
setInterval(function () {
  aiCache.cleanupExpired();
}, 60 * 60 * 1000); // Every hour

app.listen(process.env.PORT || 3000, function () {
  console.log("[APP] Server running on port", process.env.PORT || 3000);
});

Sample response when cache is warm:

{
  "answer": "REST APIs follow six architectural constraints...",
  "cached": true,
  "cacheLayer": "l1"
}

Sample stats output after a few hours of production traffic:

{
  "hitRate": "67.34%",
  "l1HitRate": "41.20%",
  "l2HitRate": "18.55%",
  "l3HitRate": "7.59%",
  "semanticHitRate": "0.00%",
  "missRate": "32.66%",
  "totalRequests": 12847,
  "estimatedCostSaved": "$173.8200",
  "costSavedPerHour": "$24.8314",
  "latencySaved": "17218.0s",
  "memoryEntries": 487,
  "memorySizeMB": "14.73",
  "errors": 3,
  "uptimeHours": "7.0"
}

Common Issues and Troubleshooting

1. Redis Connection Refused on Startup

Error: connect ECONNREFUSED 127.0.0.1:6379
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1141:16)

This happens when Redis is not running or the host/port is wrong. The caching module should degrade gracefully — L1 memory cache still works without Redis. Make sure your init code catches this error and continues with reduced functionality instead of crashing the application.

2. Cache Key Collision on Similar but Different Prompts

If two genuinely different prompts produce the same SHA-256 hash (astronomically unlikely) or, more commonly, if you forgot to include a differentiating parameter in the key (like temperature or tools), you get stale wrong results:

Expected: Response for temperature=0.8 creative mode
Got: Response for temperature=0 factual mode (cached earlier)

Fix: Audit your buildKey function. Every parameter that changes the LLM output must be part of the key. Log the key components alongside the hash during development so you can debug collisions.

3. Memory Leak from Unbounded L1 Cache

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

This happens when you set the LRU cache max too high or forget to set maxSize. Each cached LLM response can be 10-30 KB. 10,000 entries at 20 KB each is 200 MB just for the cache. Set both max (entry count) and maxSize (byte limit) and monitor memoryCache.calculatedSize in your stats endpoint.

4. PostgreSQL Deadlock on High-Concurrency Cache Writes

ERROR: deadlock detected
DETAIL: Process 12345 waits for ShareLock on transaction 67890;
        blocked by process 67891.

The ON CONFLICT DO UPDATE pattern can cause deadlocks under high concurrency when multiple processes try to upsert the same key simultaneously. Fix: wrap the L3 write in a retry loop with exponential backoff, or use advisory locks for hot keys.

function setInDatabaseWithRetry(key, value, metadata, retries) {
  retries = retries || 3;
  return setInDatabase(key, value, metadata).catch(function (err) {
    if (err.code === "40P01" && retries > 0) { // deadlock error code
      var delay = Math.random() * 100 + 50;
      return new Promise(function (resolve) {
        setTimeout(function () {
          resolve(setInDatabaseWithRetry(key, value, metadata, retries - 1));
        }, delay);
      });
    }
    throw err;
  });
}

5. Stale Cache After Model Update

When your LLM provider updates a model (e.g., gpt-4o gets a new snapshot), cached responses from the old version may be outdated. If the model name stays the same but the behavior changes, your cache will serve old responses until TTL expiration.

Fix: Include a cache version in your key generation or flush the cache when you know a model has been updated. You can also use the model's snapshot date as part of the key if the provider exposes it.

6. Gzip Decompression Failure on Corrupted Redis Data

Error: incorrect header check
    at Zlib.zlibOnError [as onerror] (zlib.js:187:17)

This happens when Redis data gets corrupted (partial writes, network issues) or when you change your compression strategy without flushing existing entries. Fix: wrap decompression in a try-catch, delete the corrupted entry, and let the next request repopulate it.

function safeDecompress(stored) {
  return decompress(stored).catch(function (err) {
    console.error("[AI-CACHE] Decompression failed, treating as miss:", err.message);
    return null;
  });
}

Best Practices

  • Always degrade gracefully. Your application must work with zero cache layers available. Redis down? Fall through to PostgreSQL. PostgreSQL down? Make the API call. Never let a cache failure become an application failure.

  • Include the model name and version in every cache key. A cached response from GPT-4o and one from Claude are not interchangeable. When you switch models or model versions, you want fresh cache entries automatically.

  • Set both entry count limits and byte size limits on your LRU cache. One oversized response (a 500 KB code generation result) can blow out your memory budget if you only limit by entry count.

  • Monitor hit rates per layer, not just the aggregate. If L1 hit rate is low but L2 is high, your processes might be restarting too frequently. If L3 is doing all the work, your Redis TTLs might be too short.

  • Use write-behind for L3 (database) writes. Users should never wait for a PostgreSQL INSERT before getting their response. Fire-and-forget the database write and handle failures silently.

  • Run periodic cleanup on your L3 cache. PostgreSQL does not automatically delete expired rows. Schedule a cron job or use setInterval to purge entries where expires_at < NOW() at least once per hour.

  • Start with exact-match caching before implementing semantic caching. Exact matching is simpler, faster, and has zero risk of returning a wrong result. Only add semantic caching when your metrics show that similar-but-not-identical queries are a significant portion of your misses.

  • Compress L2 and L3 values, but not L1. The decompression overhead on L1 defeats the purpose of an in-memory cache. Redis and PostgreSQL benefit significantly from smaller payloads — especially Redis, where memory is typically your most constrained resource.

  • Log cache key hashes alongside request metadata during development. When something is cached that should not be (or vice versa), you need to trace exactly which parameters contributed to the key. Remove this logging in production to avoid noise.

  • Estimate and track cost savings from day one. The cache stats endpoint is not just for engineers — it produces numbers that justify infrastructure investment to your team and stakeholders. "$170 saved per day" is a compelling metric.

References

Powered by Contentful