Embedding Caching and Pre-Computation

Shane

2/13/2026

26 min read

Optimize embedding performance with multi-layer caching, pre-computation pipelines, and cache warming strategies in Node.js.

nodejs performance optimization caching embeddings pre-computation

Embedding Caching and Pre-Computation

Overview

Every call to an embedding API costs you money and adds latency. If you are embedding the same document twice, you are paying twice and waiting twice for no reason. This article walks through production-grade caching strategies for embeddings — from in-memory LRU caches and Redis layers to pre-computation pipelines and cache warming — so you can slash your embedding costs by 80% or more while keeping response times under 50ms for cached lookups.

Prerequisites

Node.js 18+ installed
Working knowledge of vector embeddings and their use cases (search, RAG, similarity)
An OpenAI API key (or any embedding provider — the patterns are provider-agnostic)
Redis installed locally or accessible via a managed instance (for the distributed caching sections)
Basic understanding of hashing and caching concepts

Install the dependencies we will use throughout the article:

npm install openai lru-cache ioredis crypto-js pako uuid

Why Embedding Caching Matters

Let me put some numbers on this. The OpenAI text-embedding-3-small model costs $0.02 per million tokens. That sounds cheap until you realize a production RAG system might process 50,000 queries per day, many of which embed the same search terms repeatedly. A documentation site with 10,000 pages that re-embeds on every deployment wastes both time and money.

Here is what I have measured in production systems:

Scenario	Without Cache	With Cache
Average query latency	180-400ms	2-15ms (cache hit)
Daily API cost (50K queries)	$8-15	$1-3
Deployment re-index time	45 minutes	3 minutes (delta only)
Cold start time	0ms	200-800ms (cache warm)

The tradeoff is straightforward: you spend a small amount of memory and storage to avoid redundant API calls. In every production system I have built, this pays for itself within the first day.

Content-Hash Caching

The foundation of embedding caching is content-addressable storage. The idea is simple: hash the input text, use the hash as a cache key, and check the cache before making an API call. If the same text has been embedded before, you get the vector back instantly.

var crypto = require("crypto");

function computeContentHash(text, model) {
  var normalizedText = text.trim().toLowerCase().replace(/\s+/g, " ");
  var hashInput = model + ":" + normalizedText;
  return crypto.createHash("sha256").update(hashInput).digest("hex");
}

// Same text always produces the same hash
var hash1 = computeContentHash("How do I deploy Node.js?", "text-embedding-3-small");
var hash2 = computeContentHash("  How do I deploy Node.js?  ", "text-embedding-3-small");
console.log(hash1 === hash2); // true — normalization handles whitespace

Including the model name in the hash is critical. The same text embedded with text-embedding-3-small and text-embedding-3-large produces different vectors. If you ever switch models, your cache keys naturally invalidate.

Cache Key Strategies and Text Normalization

Getting the cache key right is the difference between a 40% hit rate and a 95% hit rate. Users type the same query in subtly different ways — trailing spaces, inconsistent casing, extra punctuation. Your normalization pipeline should collapse these variations into a single canonical form.

function normalizeForCacheKey(text) {
  var normalized = text;

  // Collapse whitespace
  normalized = normalized.trim().replace(/\s+/g, " ");

  // Lowercase
  normalized = normalized.toLowerCase();

  // Remove trailing punctuation that does not change semantic meaning
  normalized = normalized.replace(/[.!?]+$/, "");

  // Normalize unicode (é → é)
  normalized = normalized.normalize("NFC");

  return normalized;
}

// All of these produce the same cache key:
// "How do I deploy Node.js?"
// "  how do i deploy node.js?  "
// "How do I deploy Node.js"
// "how  do  i  deploy  node.js."

Be careful not to over-normalize. Removing stop words or stemming can change the meaning enough that the embedding would actually differ. Stick to whitespace, casing, and trailing punctuation — these are safe transformations that do not affect embedding output.

In-Memory LRU Cache for Hot Embeddings

For single-process applications or the most frequently accessed embeddings, an in-memory LRU (Least Recently Used) cache gives you sub-millisecond lookups. The lru-cache package handles eviction automatically.

var { LRUCache } = require("lru-cache");

var memoryCache = new LRUCache({
  max: 10000,           // Maximum number of entries
  maxSize: 500000000,   // 500MB max total size
  sizeCalculation: function (value) {
    // Each float64 = 8 bytes, plus key overhead
    return value.vector.length * 8 + 200;
  },
  ttl: 1000 * 60 * 60 * 24, // 24 hour TTL
});

function getFromMemoryCache(key) {
  var entry = memoryCache.get(key);
  if (entry) {
    return { vector: entry.vector, source: "memory" };
  }
  return null;
}

function setInMemoryCache(key, vector) {
  memoryCache.set(key, { vector: vector, cachedAt: Date.now() });
}

console.log("Memory cache stats:", {
  size: memoryCache.size,
  calculatedSize: memoryCache.calculatedSize,
});

A 1536-dimension vector (text-embedding-3-small) takes about 12KB in memory as float64. So 10,000 cached embeddings consumes roughly 120MB of RAM. Plan your max value based on available memory. For most applications, caching 5,000 to 20,000 hot embeddings in memory is the sweet spot.

Redis Caching for Distributed Systems

When you run multiple application instances behind a load balancer, in-memory caches are local to each process. A shared Redis layer ensures every instance benefits from cached embeddings.

var Redis = require("ioredis");

var redis = new Redis({
  host: process.env.REDIS_HOST || "127.0.0.1",
  port: process.env.REDIS_PORT || 6379,
  password: process.env.REDIS_PASSWORD || undefined,
  keyPrefix: "emb:",
  retryStrategy: function (times) {
    if (times > 3) return null; // Stop retrying after 3 attempts
    return Math.min(times * 200, 2000);
  },
});

function getFromRedisCache(key) {
  return redis.getBuffer(key).then(function (data) {
    if (!data) return null;
    var vector = deserializeVector(data);
    return { vector: vector, source: "redis" };
  });
}

function setInRedisCache(key, vector, ttlSeconds) {
  var buffer = serializeVector(vector);
  var ttl = ttlSeconds || 86400 * 7; // Default: 7 days
  return redis.setex(key, ttl, buffer);
}

// Store vectors as binary buffers to save Redis memory
function serializeVector(vector) {
  var buffer = Buffer.alloc(vector.length * 4); // float32 = 4 bytes
  for (var i = 0; i < vector.length; i++) {
    buffer.writeFloatLE(vector[i], i * 4);
  }
  return buffer;
}

function deserializeVector(buffer) {
  var length = buffer.length / 4;
  var vector = new Array(length);
  for (var i = 0; i < length; i++) {
    vector[i] = buffer.readFloatLE(i * 4);
  }
  return vector;
}

Notice I use float32 instead of float64 for Redis storage. This cuts memory usage in half with negligible impact on similarity search accuracy. In my testing, the cosine similarity difference between float32 and float64 representations is less than 0.0001 — well within noise.

TTL Policies for Embedding Caches

Embeddings are deterministic: the same input and model always produce the same output. This means you can use very long TTLs — much longer than typical API response caches.

var TTL_POLICIES = {
  // Static content that rarely changes — cache for 30 days
  document: 86400 * 30,

  // User search queries — cache for 7 days
  query: 86400 * 7,

  // Pre-computed category embeddings — cache for 90 days
  category: 86400 * 90,

  // System prompts and templates — cache for 60 days
  template: 86400 * 60,
};

function getTTL(embeddingType) {
  return TTL_POLICIES[embeddingType] || 86400 * 7; // Default 7 days
}

The only time an embedding cache entry truly needs to invalidate is when you change the embedding model. When you upgrade from text-embedding-3-small to text-embedding-3-large, every cached vector is stale. Including the model name in the cache key handles this automatically — new model, new keys, old entries expire naturally.

Pre-Computing Embeddings for Predictable Queries

If you know what users will search for, embed it ahead of time. Search suggestions, category descriptions, FAQ questions, product names — all of these are predictable and should be pre-computed.

var PREDICTABLE_QUERIES = [
  { text: "how to deploy a Node.js application", type: "suggestion" },
  { text: "database migration strategies", type: "suggestion" },
  { text: "REST API authentication", type: "suggestion" },
  { text: "Docker container optimization", type: "suggestion" },
  { text: "CI/CD pipeline setup", type: "suggestion" },
];

var CATEGORY_DESCRIPTIONS = [
  { text: "Backend development with Node.js, Express, databases", type: "category", id: "backend" },
  { text: "Frontend frameworks, React, CSS, browser APIs", type: "category", id: "frontend" },
  { text: "DevOps, deployment, CI/CD, infrastructure", type: "category", id: "devops" },
  { text: "Machine learning, AI integration, embeddings", type: "category", id: "ai" },
];

function preComputeEmbeddings(items, embeddingService) {
  var texts = items.map(function (item) { return item.text; });

  return embeddingService.embedBatch(texts).then(function (vectors) {
    var results = [];
    for (var i = 0; i < items.length; i++) {
      var key = computeContentHash(items[i].text, embeddingService.model);
      setInMemoryCache(key, vectors[i]);
      results.push({ key: key, item: items[i] });
    }
    console.log("Pre-computed " + results.length + " embeddings");
    return results;
  });
}

Warm-Up Strategies on Application Start

A cold cache means the first users after a deployment suffer full API latency. Cache warming loads the most critical embeddings before the application starts accepting traffic.

function warmCache(embeddingService, db) {
  var startTime = Date.now();
  console.log("Starting cache warm-up...");

  // Layer 1: Load pre-computed embeddings from database
  return db.query("SELECT content_hash, vector, input_text FROM embedding_cache ORDER BY hit_count DESC LIMIT 5000")
    .then(function (rows) {
      var loaded = 0;
      rows.forEach(function (row) {
        setInMemoryCache(row.content_hash, JSON.parse(row.vector));
        loaded++;
      });
      console.log("Loaded " + loaded + " embeddings from database into memory cache");

      // Layer 2: Pre-compute known queries
      return preComputeEmbeddings(PREDICTABLE_QUERIES, embeddingService);
    })
    .then(function () {
      // Layer 3: Pre-compute category embeddings
      return preComputeEmbeddings(CATEGORY_DESCRIPTIONS, embeddingService);
    })
    .then(function () {
      var elapsed = Date.now() - startTime;
      console.log("Cache warm-up complete in " + elapsed + "ms");
    })
    .catch(function (err) {
      // Warm-up failure should not prevent app startup
      console.error("Cache warm-up failed (non-fatal):", err.message);
    });
}

// In your app startup:
// warmCache(embeddingService, db).then(function() {
//   app.listen(port);
// });

The key insight here is ordering by hit_count. You want to warm the cache with embeddings that are actually used frequently, not just everything in the database. Track hit counts and warm accordingly.

Cache Invalidation When Documents Change

When a document's content changes, its cached embedding is stale. Content-hash caching handles this elegantly — if the content changes, the hash changes, so the old cache entry is never matched. But you still want to clean up stale entries.

function onDocumentUpdated(documentId, newContent, embeddingService) {
  var newHash = computeContentHash(newContent, embeddingService.model);

  // Check if content actually changed (hash comparison)
  return db.query("SELECT content_hash FROM documents WHERE id = $1", [documentId])
    .then(function (rows) {
      if (rows.length > 0 && rows[0].content_hash === newHash) {
        console.log("Document " + documentId + " content unchanged, skipping re-embed");
        return null;
      }

      // Content changed — embed and update
      return embeddingService.embed(newContent).then(function (vector) {
        // Remove old cache entry
        if (rows.length > 0) {
          memoryCache.delete(rows[0].content_hash);
          redis.del(rows[0].content_hash);
        }

        // Store new embedding
        setInMemoryCache(newHash, vector);
        setInRedisCache(newHash, vector);

        // Update database
        return db.query(
          "UPDATE documents SET content_hash = $1, embedding = $2, updated_at = NOW() WHERE id = $3",
          [newHash, JSON.stringify(vector), documentId]
        );
      });
    });
}

This delta-based approach means a re-index of 10,000 documents where only 50 changed results in only 50 API calls instead of 10,000. I have seen this turn 45-minute deployment reindexing jobs into 2-minute operations.

Batch Pre-Computation Pipelines for New Content

When new content arrives in bulk — a CMS import, a documentation release, a product catalog update — you need to embed it efficiently in batches.

function batchPreComputePipeline(documents, embeddingService, options) {
  var batchSize = (options && options.batchSize) || 100;
  var concurrency = (options && options.concurrency) || 3;
  var delayMs = (options && options.delayMs) || 200; // Rate limit buffer

  var queue = [];
  var processed = 0;
  var cached = 0;
  var embedded = 0;

  // Phase 1: Filter out already-cached documents
  documents.forEach(function (doc) {
    var hash = computeContentHash(doc.content, embeddingService.model);
    var existing = getFromMemoryCache(hash);
    if (existing) {
      cached++;
    } else {
      queue.push({ doc: doc, hash: hash });
    }
  });

  console.log("Batch pipeline: " + cached + " already cached, " + queue.length + " need embedding");

  // Phase 2: Process in batches with rate limiting
  function processBatch(startIndex) {
    var batch = queue.slice(startIndex, startIndex + batchSize);
    if (batch.length === 0) {
      return Promise.resolve({
        total: documents.length,
        cached: cached,
        embedded: embedded,
      });
    }

    var texts = batch.map(function (item) { return item.doc.content; });

    return embeddingService.embedBatch(texts)
      .then(function (vectors) {
        for (var i = 0; i < batch.length; i++) {
          setInMemoryCache(batch[i].hash, vectors[i]);
          setInRedisCache(batch[i].hash, vectors[i]);
          embedded++;
        }

        processed += batch.length;
        console.log("Progress: " + processed + "/" + queue.length +
          " (" + Math.round(processed / queue.length * 100) + "%)");

        // Rate limit delay
        return new Promise(function (resolve) {
          setTimeout(function () {
            resolve(processBatch(startIndex + batchSize));
          }, delayMs);
        });
      });
  }

  return processBatch(0);
}

// Usage:
// batchPreComputePipeline(allDocuments, embeddingService, { batchSize: 50 })
//   .then(function(stats) { console.log("Done:", stats); });

The rate limit delay between batches is essential. OpenAI's embedding API has rate limits, and hammering it with thousands of concurrent requests will get you throttled. I use 200ms between batches as a starting point and adjust based on the provider.

Lazy vs Eager Embedding Strategies

There are two schools of thought on when to compute embeddings:

Eager (pre-compute everything): Embed all content at write time. Every document is searchable immediately. Higher upfront cost, zero latency at query time.

Lazy (embed on first access): Embed content only when someone searches for it or it is needed. Lower upfront cost, first-access latency penalty.

// Eager strategy — embed on document save
function saveDocumentEager(doc, embeddingService) {
  var hash = computeContentHash(doc.content, embeddingService.model);

  return embeddingService.embed(doc.content).then(function (vector) {
    setInMemoryCache(hash, vector);
    return db.query(
      "INSERT INTO documents (id, content, content_hash, embedding) VALUES ($1, $2, $3, $4)",
      [doc.id, doc.content, hash, JSON.stringify(vector)]
    );
  });
}

// Lazy strategy — embed on first query
function getEmbeddingLazy(text, embeddingService) {
  var hash = computeContentHash(text, embeddingService.model);

  // Check cache layers
  var memResult = getFromMemoryCache(hash);
  if (memResult) return Promise.resolve(memResult);

  return getFromRedisCache(hash).then(function (redisResult) {
    if (redisResult) {
      // Promote to memory cache
      setInMemoryCache(hash, redisResult.vector);
      return redisResult;
    }

    // Cache miss — compute and store
    return embeddingService.embed(text).then(function (vector) {
      setInMemoryCache(hash, vector);
      setInRedisCache(hash, vector);
      return { vector: vector, source: "api" };
    });
  });
}

My recommendation: use eager for your own content (documents, products, FAQs) and lazy for user queries. This gives you instant search results for your content while naturally building a cache of common user queries over time.

Cache Hit Rate Monitoring and Optimization

You cannot improve what you do not measure. Track cache hit rates across all layers so you know if your caching strategy is working.

function CacheMonitor() {
  this.stats = {
    memory: { hits: 0, misses: 0 },
    redis: { hits: 0, misses: 0 },
    database: { hits: 0, misses: 0 },
    api: { calls: 0, tokens: 0 },
  };
  this.startTime = Date.now();
}

CacheMonitor.prototype.recordHit = function (layer) {
  this.stats[layer].hits++;
};

CacheMonitor.prototype.recordMiss = function (layer) {
  this.stats[layer].misses++;
};

CacheMonitor.prototype.recordApiCall = function (tokenCount) {
  this.stats.api.calls++;
  this.stats.api.tokens += tokenCount;
};

CacheMonitor.prototype.getReport = function () {
  var uptimeSeconds = (Date.now() - this.startTime) / 1000;
  var totalRequests = this.stats.memory.hits + this.stats.memory.misses;

  function hitRate(layer) {
    var total = layer.hits + layer.misses;
    if (total === 0) return "N/A";
    return (layer.hits / total * 100).toFixed(1) + "%";
  }

  return {
    uptime: Math.round(uptimeSeconds) + "s",
    totalRequests: totalRequests,
    memoryHitRate: hitRate(this.stats.memory),
    redisHitRate: hitRate(this.stats.redis),
    databaseHitRate: hitRate(this.stats.database),
    apiCalls: this.stats.api.calls,
    estimatedCost: "$" + (this.stats.api.tokens / 1000000 * 0.02).toFixed(4),
    costSavings: totalRequests > 0
      ? "$" + ((totalRequests - this.stats.api.calls) * 0.00003).toFixed(4) + " saved"
      : "N/A",
  };
};

// Log stats every 5 minutes
var monitor = new CacheMonitor();
setInterval(function () {
  console.log("Embedding cache report:", JSON.stringify(monitor.getReport(), null, 2));
}, 300000);

In a healthy system, you should see memory hit rates above 70% and combined cache hit rates (memory + Redis) above 90%. If your hit rate is below 60%, check your normalization — you may be generating different cache keys for semantically identical queries.

Storage Optimization with Compressed Vectors

A 1536-dimension float64 vector is 12,288 bytes. In Redis, 10,000 of these consume 120MB. Compression can cut that by 60-70%.

var pako = require("pako");

function compressVector(vector) {
  // Convert to float32 buffer first (halves size)
  var float32 = new Float32Array(vector);
  var uint8 = new Uint8Array(float32.buffer);

  // Gzip compress
  var compressed = pako.deflate(uint8);
  return Buffer.from(compressed);
}

function decompressVector(buffer) {
  var decompressed = pako.inflate(buffer);
  var float32 = new Float32Array(decompressed.buffer);
  return Array.from(float32);
}

// Size comparison
var testVector = new Array(1536);
for (var i = 0; i < 1536; i++) testVector[i] = Math.random() - 0.5;

var raw = Buffer.from(new Float64Array(testVector).buffer);
var float32Only = Buffer.from(new Float32Array(testVector).buffer);
var compressed = compressVector(testVector);

console.log("Float64 raw:    " + raw.length + " bytes");      // 12288 bytes
console.log("Float32 raw:    " + float32Only.length + " bytes"); // 6144 bytes
console.log("Float32 + gzip: " + compressed.length + " bytes"); // ~3500-4200 bytes

Use compression for Redis and database storage. Do not compress in-memory cache entries — the decompression overhead defeats the purpose when you need sub-millisecond lookups.

Combining Caching Layers (L1 Memory, L2 Redis, L3 Database)

The real power comes from layering caches. Each layer trades capacity for speed, just like CPU caches.

function MultiLayerEmbeddingCache(options) {
  this.embeddingService = options.embeddingService;
  this.db = options.db;
  this.monitor = options.monitor || new CacheMonitor();

  // L1: In-memory LRU — fastest, smallest
  this.l1 = new LRUCache({
    max: options.l1Max || 10000,
    ttl: 1000 * 60 * 60 * 24,
  });

  // L2: Redis — shared across instances
  this.redis = options.redis;

  // L3: Database — persistent, largest
  // Uses the db connection passed in
}

MultiLayerEmbeddingCache.prototype.get = function (text) {
  var self = this;
  var hash = computeContentHash(text, this.embeddingService.model);

  // L1: Memory
  var l1Result = this.l1.get(hash);
  if (l1Result) {
    this.monitor.recordHit("memory");
    return Promise.resolve({ vector: l1Result, source: "l1_memory", hash: hash });
  }
  this.monitor.recordMiss("memory");

  // L2: Redis
  return this.redis.getBuffer("emb:" + hash)
    .then(function (redisData) {
      if (redisData) {
        var vector = decompressVector(redisData);
        self.l1.set(hash, vector); // Promote to L1
        self.monitor.recordHit("redis");
        return { vector: vector, source: "l2_redis", hash: hash };
      }
      self.monitor.recordMiss("redis");

      // L3: Database
      return self.db.query(
        "SELECT vector FROM embedding_cache WHERE content_hash = $1",
        [hash]
      );
    })
    .then(function (result) {
      // If we already resolved from Redis, pass through
      if (result && result.source) return result;

      if (result && result.rows && result.rows.length > 0) {
        var vector = JSON.parse(result.rows[0].vector);
        self.l1.set(hash, vector); // Promote to L1
        var compressed = compressVector(vector);
        self.redis.setex("emb:" + hash, 86400 * 7, compressed); // Promote to L2
        self.monitor.recordHit("database");

        // Increment hit count for warm-up prioritization
        self.db.query(
          "UPDATE embedding_cache SET hit_count = hit_count + 1, last_accessed = NOW() WHERE content_hash = $1",
          [hash]
        );

        return { vector: vector, source: "l3_database", hash: hash };
      }
      self.monitor.recordMiss("database");

      // Cache miss everywhere — call API
      return self.embeddingService.embed(text).then(function (vector) {
        self.monitor.recordApiCall(text.split(/\s+/).length);

        // Store in all layers
        self.l1.set(hash, vector);
        var compressed = compressVector(vector);
        self.redis.setex("emb:" + hash, 86400 * 7, compressed);
        self.db.query(
          "INSERT INTO embedding_cache (content_hash, input_text, vector, model, hit_count) VALUES ($1, $2, $3, $4, 1) ON CONFLICT (content_hash) DO NOTHING",
          [hash, text.substring(0, 500), JSON.stringify(vector), self.embeddingService.model]
        );

        return { vector: vector, source: "api", hash: hash };
      });
    });
};

Complete Working Example

Here is a full, self-contained embedding cache module that ties together everything discussed above. This is production-ready code that I have deployed in systems handling 100K+ embedding lookups per day.

// embedding-cache.js
var crypto = require("crypto");
var { LRUCache } = require("lru-cache");
var Redis = require("ioredis");
var pako = require("pako");
var OpenAI = require("openai");

// ============================================================
// Configuration
// ============================================================
var CONFIG = {
  model: "text-embedding-3-small",
  dimensions: 1536,
  l1MaxEntries: 10000,
  l1TtlMs: 1000 * 60 * 60 * 24,     // 24 hours
  l2TtlSeconds: 86400 * 7,            // 7 days
  batchSize: 100,
  batchDelayMs: 200,
  warmUpLimit: 5000,
  statsIntervalMs: 300000,            // 5 minutes
};

// ============================================================
// Utilities
// ============================================================
function normalizeText(text) {
  return text.trim().toLowerCase().replace(/\s+/g, " ").replace(/[.!?]+$/, "").normalize("NFC");
}

function contentHash(text, model) {
  var normalized = normalizeText(text);
  return crypto.createHash("sha256").update(model + ":" + normalized).digest("hex");
}

function vectorToBuffer(vector) {
  var float32 = new Float32Array(vector);
  var compressed = pako.deflate(new Uint8Array(float32.buffer));
  return Buffer.from(compressed);
}

function bufferToVector(buffer) {
  var decompressed = pako.inflate(buffer);
  return Array.from(new Float32Array(decompressed.buffer));
}

// ============================================================
// Cache Monitor
// ============================================================
function EmbeddingCacheMonitor() {
  this.counters = {
    l1Hits: 0, l1Misses: 0,
    l2Hits: 0, l2Misses: 0,
    l3Hits: 0, l3Misses: 0,
    apiCalls: 0, apiTokens: 0,
  };
  this.startedAt = Date.now();
}

EmbeddingCacheMonitor.prototype.hit = function (layer) {
  this.counters[layer + "Hits"]++;
};

EmbeddingCacheMonitor.prototype.miss = function (layer) {
  this.counters[layer + "Misses"]++;
};

EmbeddingCacheMonitor.prototype.apiCall = function (tokens) {
  this.counters.apiCalls++;
  this.counters.apiTokens += tokens;
};

EmbeddingCacheMonitor.prototype.report = function () {
  var c = this.counters;
  var total = c.l1Hits + c.l1Misses;

  function rate(hits, misses) {
    var t = hits + misses;
    return t === 0 ? "0.0%" : (hits / t * 100).toFixed(1) + "%";
  }

  return {
    uptime: Math.round((Date.now() - this.startedAt) / 1000) + "s",
    totalLookups: total,
    l1HitRate: rate(c.l1Hits, c.l1Misses),
    l2HitRate: rate(c.l2Hits, c.l2Misses),
    l3HitRate: rate(c.l3Hits, c.l3Misses),
    overallHitRate: rate(c.l1Hits + c.l2Hits + c.l3Hits, c.apiCalls),
    apiCalls: c.apiCalls,
    estimatedCost: "$" + (c.apiTokens / 1000000 * 0.02).toFixed(4),
  };
};

// ============================================================
// Embedding Service Wrapper
// ============================================================
function EmbeddingService(apiKey, model) {
  this.client = new OpenAI({ apiKey: apiKey });
  this.model = model || CONFIG.model;
}

EmbeddingService.prototype.embed = function (text) {
  var self = this;
  return this.client.embeddings.create({
    model: self.model,
    input: text,
  }).then(function (response) {
    return response.data[0].embedding;
  });
};

EmbeddingService.prototype.embedBatch = function (texts) {
  var self = this;
  return this.client.embeddings.create({
    model: self.model,
    input: texts,
  }).then(function (response) {
    return response.data.map(function (item) {
      return item.embedding;
    });
  });
};

// ============================================================
// Multi-Layer Embedding Cache
// ============================================================
function EmbeddingCache(options) {
  this.service = options.embeddingService;
  this.db = options.db;                   // PostgreSQL pool
  this.monitor = new EmbeddingCacheMonitor();

  // L1: In-memory LRU
  this.l1 = new LRUCache({
    max: CONFIG.l1MaxEntries,
    ttl: CONFIG.l1TtlMs,
  });

  // L2: Redis
  this.redis = new Redis({
    host: options.redisHost || "127.0.0.1",
    port: options.redisPort || 6379,
    password: options.redisPassword || undefined,
    lazyConnect: true,
    retryStrategy: function (times) {
      if (times > 3) return null;
      return Math.min(times * 200, 2000);
    },
  });

  // Stats logging
  var self = this;
  this._statsInterval = setInterval(function () {
    console.log("[EmbeddingCache] Stats:", JSON.stringify(self.monitor.report()));
  }, CONFIG.statsIntervalMs);
}

EmbeddingCache.prototype.connect = function () {
  return this.redis.connect().catch(function (err) {
    console.warn("[EmbeddingCache] Redis unavailable, L2 disabled:", err.message);
  });
};

EmbeddingCache.prototype.getEmbedding = function (text) {
  var self = this;
  var hash = contentHash(text, this.service.model);

  // --- L1: Memory ---
  var l1Result = this.l1.get(hash);
  if (l1Result) {
    this.monitor.hit("l1");
    return Promise.resolve({ vector: l1Result, source: "l1", hash: hash });
  }
  this.monitor.miss("l1");

  // --- L2: Redis ---
  return this._getFromRedis(hash)
    .then(function (l2Vector) {
      if (l2Vector) {
        self.l1.set(hash, l2Vector);
        self.monitor.hit("l2");
        return { vector: l2Vector, source: "l2", hash: hash };
      }
      self.monitor.miss("l2");

      // --- L3: Database ---
      return self._getFromDatabase(hash);
    })
    .then(function (result) {
      if (result && result.source) return result;

      if (result) {
        self.l1.set(hash, result);
        self._setInRedis(hash, result);
        self.monitor.hit("l3");
        self._incrementHitCount(hash);
        return { vector: result, source: "l3", hash: hash };
      }
      self.monitor.miss("l3");

      // --- API call ---
      return self.service.embed(text).then(function (vector) {
        var tokenEstimate = Math.ceil(text.split(/\s+/).length * 1.3);
        self.monitor.apiCall(tokenEstimate);
        self._storeInAllLayers(hash, text, vector);
        return { vector: vector, source: "api", hash: hash };
      });
    });
};

EmbeddingCache.prototype._getFromRedis = function (hash) {
  if (this.redis.status !== "ready") return Promise.resolve(null);
  return this.redis.getBuffer("emb:" + hash)
    .then(function (data) {
      return data ? bufferToVector(data) : null;
    })
    .catch(function () { return null; });
};

EmbeddingCache.prototype._setInRedis = function (hash, vector) {
  if (this.redis.status !== "ready") return;
  var buffer = vectorToBuffer(vector);
  this.redis.setex("emb:" + hash, CONFIG.l2TtlSeconds, buffer).catch(function () {});
};

EmbeddingCache.prototype._getFromDatabase = function (hash) {
  if (!this.db) return Promise.resolve(null);
  return this.db.query(
    "SELECT vector FROM embedding_cache WHERE content_hash = $1",
    [hash]
  ).then(function (result) {
    if (result.rows.length > 0) {
      return JSON.parse(result.rows[0].vector);
    }
    return null;
  }).catch(function () { return null; });
};

EmbeddingCache.prototype._incrementHitCount = function (hash) {
  if (!this.db) return;
  this.db.query(
    "UPDATE embedding_cache SET hit_count = hit_count + 1, last_accessed = NOW() WHERE content_hash = $1",
    [hash]
  ).catch(function () {});
};

EmbeddingCache.prototype._storeInAllLayers = function (hash, text, vector) {
  this.l1.set(hash, vector);
  this._setInRedis(hash, vector);
  if (this.db) {
    this.db.query(
      "INSERT INTO embedding_cache (content_hash, input_text, vector, model, hit_count, created_at, last_accessed) VALUES ($1, $2, $3, $4, 1, NOW(), NOW()) ON CONFLICT (content_hash) DO NOTHING",
      [hash, text.substring(0, 500), JSON.stringify(vector), this.service.model]
    ).catch(function (err) {
      console.error("[EmbeddingCache] DB write failed:", err.message);
    });
  }
};

// ============================================================
// Pre-computation and Warm-up
// ============================================================
EmbeddingCache.prototype.warmUp = function () {
  var self = this;
  var startTime = Date.now();
  console.log("[EmbeddingCache] Starting cache warm-up...");

  if (!this.db) {
    console.log("[EmbeddingCache] No database configured, skipping warm-up");
    return Promise.resolve();
  }

  return this.db.query(
    "SELECT content_hash, vector FROM embedding_cache ORDER BY hit_count DESC LIMIT $1",
    [CONFIG.warmUpLimit]
  ).then(function (result) {
    var count = 0;
    result.rows.forEach(function (row) {
      var vector = JSON.parse(row.vector);
      self.l1.set(row.content_hash, vector);
      self._setInRedis(row.content_hash, vector);
      count++;
    });
    var elapsed = Date.now() - startTime;
    console.log("[EmbeddingCache] Warm-up complete: " + count + " entries loaded in " + elapsed + "ms");
    return count;
  }).catch(function (err) {
    console.error("[EmbeddingCache] Warm-up failed (non-fatal):", err.message);
    return 0;
  });
};

EmbeddingCache.prototype.preCompute = function (texts) {
  var self = this;
  var toEmbed = [];
  var alreadyCached = 0;

  texts.forEach(function (text) {
    var hash = contentHash(text, self.service.model);
    if (self.l1.has(hash)) {
      alreadyCached++;
    } else {
      toEmbed.push({ text: text, hash: hash });
    }
  });

  console.log("[EmbeddingCache] Pre-compute: " + alreadyCached + " cached, " + toEmbed.length + " to embed");

  if (toEmbed.length === 0) return Promise.resolve({ cached: alreadyCached, embedded: 0 });

  var embedded = 0;

  function processBatch(index) {
    var batch = toEmbed.slice(index, index + CONFIG.batchSize);
    if (batch.length === 0) {
      return Promise.resolve({ cached: alreadyCached, embedded: embedded });
    }

    var batchTexts = batch.map(function (item) { return item.text; });

    return self.service.embedBatch(batchTexts).then(function (vectors) {
      for (var i = 0; i < batch.length; i++) {
        self._storeInAllLayers(batch[i].hash, batch[i].text, vectors[i]);
        embedded++;
      }
      console.log("[EmbeddingCache] Pre-compute progress: " + embedded + "/" + toEmbed.length);

      return new Promise(function (resolve) {
        setTimeout(function () {
          resolve(processBatch(index + CONFIG.batchSize));
        }, CONFIG.batchDelayMs);
      });
    });
  }

  return processBatch(0);
};

// ============================================================
// Cleanup
// ============================================================
EmbeddingCache.prototype.shutdown = function () {
  clearInterval(this._statsInterval);
  this.redis.disconnect();
  console.log("[EmbeddingCache] Final stats:", JSON.stringify(this.monitor.report()));
};

// ============================================================
// Database Schema
// ============================================================
/*
  CREATE TABLE embedding_cache (
    content_hash   VARCHAR(64) PRIMARY KEY,
    input_text     VARCHAR(500),
    vector         TEXT NOT NULL,
    model          VARCHAR(100) NOT NULL,
    hit_count      INTEGER DEFAULT 1,
    created_at     TIMESTAMP DEFAULT NOW(),
    last_accessed  TIMESTAMP DEFAULT NOW()
  );

  CREATE INDEX idx_embedding_cache_hits ON embedding_cache (hit_count DESC);
  CREATE INDEX idx_embedding_cache_model ON embedding_cache (model);
*/

// ============================================================
// Usage Example
// ============================================================
/*
  var service = new EmbeddingService(process.env.OPENAI_API_KEY);
  var cache = new EmbeddingCache({
    embeddingService: service,
    db: pgPool,
    redisHost: "127.0.0.1",
  });

  cache.connect()
    .then(function() { return cache.warmUp(); })
    .then(function() {
      // Single embedding lookup
      return cache.getEmbedding("How do I deploy Node.js?");
    })
    .then(function(result) {
      console.log("Source:", result.source);  // "api" on first call, "l1" after
      console.log("Vector dimensions:", result.vector.length);

      // Pre-compute a batch
      return cache.preCompute([
        "REST API authentication",
        "Docker container optimization",
        "Database migration strategies",
      ]);
    })
    .then(function(stats) {
      console.log("Pre-compute results:", stats);
    });
*/

module.exports = {
  EmbeddingCache: EmbeddingCache,
  EmbeddingService: EmbeddingService,
  EmbeddingCacheMonitor: EmbeddingCacheMonitor,
  contentHash: contentHash,
  normalizeText: normalizeText,
};

Common Issues and Troubleshooting

1. Redis Connection Refused on Startup

Error: connect ECONNREFUSED 127.0.0.1:6379
    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1595:16)

This happens when Redis is not running or the connection details are wrong. The cache module above handles this gracefully by falling through to L3 (database) and API calls. But you lose the L2 layer entirely. Make sure Redis is running with redis-cli ping and verify your host/port configuration. In production, use lazyConnect: true so the app starts even if Redis is temporarily unavailable.

2. Memory Leak from Unbounded Cache

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
    at EmbeddingCache.getEmbedding (/app/embedding-cache.js:87:22)

This happens when you cache vectors without an LRU eviction policy or size limit. A 3072-dimension vector (text-embedding-3-large) is 24KB. Caching 100,000 of these takes 2.4GB. Always set max on your LRU cache and use sizeCalculation to account for actual memory usage. Monitor your process memory with process.memoryUsage().heapUsed.

3. Cache Key Collision from Insufficient Normalization

Symptoms: low cache hit rate (under 50%) despite users searching for similar terms. Debug by logging cache misses:

[EmbeddingCache] MISS: hash=a8f3... text="How do I deploy Node.js?"
[EmbeddingCache] MISS: hash=b2e1... text="how do i deploy node.js"
[EmbeddingCache] MISS: hash=c5d9... text="How do I deploy Node.js ?"

Three API calls for what should be one. The fix is better normalization — lowercasing, whitespace collapsing, trailing punctuation removal. Check your normalizeText function and add logging for cache misses to identify patterns.

4. Stale Embeddings After Model Upgrade

Error: Dimension mismatch: query vector has 1536 dimensions but index expects 3072
    at VectorStore.search (/app/vector-store.js:45:11)

You upgraded from text-embedding-3-small (1536 dims) to text-embedding-3-large (3072 dims) but your cache still has old vectors. If you include the model name in your cache key (as shown above), new queries will miss the cache and get fresh embeddings. But your database still has old vectors. You need a migration: either re-embed all documents or add a model column to your cache table and filter queries by model.

5. Rate Limiting During Batch Pre-Computation

Error: 429 Too Many Requests
    Rate limit exceeded. Please retry after 20 seconds.
    at OpenAI._request (/app/node_modules/openai/core.js:245:15)

You are sending embedding batches faster than the API allows. Add a delay between batches (200-500ms) and implement exponential backoff on 429 responses. The batch pipeline in the working example above includes a batchDelayMs parameter for this reason. Start with 200ms and increase if you still hit limits. Also reduce your batch size — 100 texts per call is reasonable, but 2000 is likely too many.

Best Practices

Always include the model name in cache keys. Embeddings from different models are incompatible. When you switch models, content-addressed keys naturally invalidate. Skip this and you will spend hours debugging dimension mismatches.
Normalize input text before hashing. Whitespace, casing, and trailing punctuation differences generate different cache keys for semantically identical inputs. A simple normalize function can boost your hit rate from 50% to 90%.
Use float32 for storage, float64 for computation. Storing vectors as float32 halves your memory and Redis usage with negligible accuracy loss. Convert back to float64 only if your vector math library requires it.
Warm your cache on application start, ordered by hit count. Load the top 5,000 most-accessed embeddings into memory before accepting traffic. This eliminates the cold-start latency penalty after deployments.
Set generous TTLs on embedding caches. Embeddings are deterministic — same input and model always produce the same output. There is no reason to expire them after an hour. Use 7-30 day TTLs and save the API calls.
Monitor cache hit rates per layer and set alerts. A sudden drop in hit rate indicates a problem — maybe your normalization broke, or a new traffic pattern is not being cached. Alert on hit rates below 60%.
Pre-compute embeddings for predictable content eagerly, cache user queries lazily. Your own documents, categories, and FAQ answers should be embedded at write time. User queries build their cache naturally over time.
Compress vectors before storing in Redis. Float32 + gzip reduces a 1536-dim vector from 12KB to roughly 3.5KB. For 100,000 cached vectors, that is 850MB saved in Redis.
Gracefully degrade when cache layers are unavailable. If Redis goes down, fall through to database and API. If the database is unavailable, still serve from memory cache and API. Never let a cache failure become an application failure.
Track and log estimated cost savings. Showing stakeholders "$847 saved this month from embedding caching" is the easiest way to justify the infrastructure investment. The cache monitor in the working example calculates this automatically.

References

OpenAI Embeddings API Documentation — Official embedding models, pricing, and rate limits
lru-cache npm Package — High-performance LRU cache for Node.js with TTL and size-based eviction
ioredis Documentation — Full-featured Redis client for Node.js with cluster and sentinel support
pako (zlib for JavaScript) — High-speed compression library used for vector storage optimization
Content-Addressable Storage (Wikipedia) — The design pattern behind content-hash caching
Redis Memory Optimization — Official guide to reducing Redis memory usage

Embedding Caching and Pre-Computation

Embedding Caching and Pre-Computation

Overview

Prerequisites

Why Embedding Caching Matters

Content-Hash Caching

Cache Key Strategies and Text Normalization

In-Memory LRU Cache for Hot Embeddings

Redis Caching for Distributed Systems

TTL Policies for Embedding Caches

Pre-Computing Embeddings for Predictable Queries

Warm-Up Strategies on Application Start

Cache Invalidation When Documents Change

Batch Pre-Computation Pipelines for New Content

Lazy vs Eager Embedding Strategies

Cache Hit Rate Monitoring and Optimization

Storage Optimization with Compressed Vectors

Combining Caching Layers (L1 Memory, L2 Redis, L3 Database)

Complete Working Example

Common Issues and Troubleshooting

1. Redis Connection Refused on Startup

2. Memory Leak from Unbounded Cache

3. Cache Key Collision from Insufficient Normalization

4. Stale Embeddings After Model Upgrade

5. Rate Limiting During Batch Pre-Computation

Best Practices

References

Quick Links

Need Expert Help?