Embedding Caching and Pre-Computation
Optimize embedding performance with multi-layer caching, pre-computation pipelines, and cache warming strategies in Node.js.
Embedding Caching and Pre-Computation
Overview
Every call to an embedding API costs you money and adds latency. If you are embedding the same document twice, you are paying twice and waiting twice for no reason. This article walks through production-grade caching strategies for embeddings — from in-memory LRU caches and Redis layers to pre-computation pipelines and cache warming — so you can slash your embedding costs by 80% or more while keeping response times under 50ms for cached lookups.
Prerequisites
- Node.js 18+ installed
- Working knowledge of vector embeddings and their use cases (search, RAG, similarity)
- An OpenAI API key (or any embedding provider — the patterns are provider-agnostic)
- Redis installed locally or accessible via a managed instance (for the distributed caching sections)
- Basic understanding of hashing and caching concepts
Install the dependencies we will use throughout the article:
npm install openai lru-cache ioredis crypto-js pako uuid
Why Embedding Caching Matters
Let me put some numbers on this. The OpenAI text-embedding-3-small model costs $0.02 per million tokens. That sounds cheap until you realize a production RAG system might process 50,000 queries per day, many of which embed the same search terms repeatedly. A documentation site with 10,000 pages that re-embeds on every deployment wastes both time and money.
Here is what I have measured in production systems:
| Scenario | Without Cache | With Cache |
|---|---|---|
| Average query latency | 180-400ms | 2-15ms (cache hit) |
| Daily API cost (50K queries) | $8-15 | $1-3 |
| Deployment re-index time | 45 minutes | 3 minutes (delta only) |
| Cold start time | 0ms | 200-800ms (cache warm) |
The tradeoff is straightforward: you spend a small amount of memory and storage to avoid redundant API calls. In every production system I have built, this pays for itself within the first day.
Content-Hash Caching
The foundation of embedding caching is content-addressable storage. The idea is simple: hash the input text, use the hash as a cache key, and check the cache before making an API call. If the same text has been embedded before, you get the vector back instantly.
var crypto = require("crypto");
function computeContentHash(text, model) {
var normalizedText = text.trim().toLowerCase().replace(/\s+/g, " ");
var hashInput = model + ":" + normalizedText;
return crypto.createHash("sha256").update(hashInput).digest("hex");
}
// Same text always produces the same hash
var hash1 = computeContentHash("How do I deploy Node.js?", "text-embedding-3-small");
var hash2 = computeContentHash(" How do I deploy Node.js? ", "text-embedding-3-small");
console.log(hash1 === hash2); // true — normalization handles whitespace
Including the model name in the hash is critical. The same text embedded with text-embedding-3-small and text-embedding-3-large produces different vectors. If you ever switch models, your cache keys naturally invalidate.
Cache Key Strategies and Text Normalization
Getting the cache key right is the difference between a 40% hit rate and a 95% hit rate. Users type the same query in subtly different ways — trailing spaces, inconsistent casing, extra punctuation. Your normalization pipeline should collapse these variations into a single canonical form.
function normalizeForCacheKey(text) {
var normalized = text;
// Collapse whitespace
normalized = normalized.trim().replace(/\s+/g, " ");
// Lowercase
normalized = normalized.toLowerCase();
// Remove trailing punctuation that does not change semantic meaning
normalized = normalized.replace(/[.!?]+$/, "");
// Normalize unicode (é → é)
normalized = normalized.normalize("NFC");
return normalized;
}
// All of these produce the same cache key:
// "How do I deploy Node.js?"
// " how do i deploy node.js? "
// "How do I deploy Node.js"
// "how do i deploy node.js."
Be careful not to over-normalize. Removing stop words or stemming can change the meaning enough that the embedding would actually differ. Stick to whitespace, casing, and trailing punctuation — these are safe transformations that do not affect embedding output.
In-Memory LRU Cache for Hot Embeddings
For single-process applications or the most frequently accessed embeddings, an in-memory LRU (Least Recently Used) cache gives you sub-millisecond lookups. The lru-cache package handles eviction automatically.
var { LRUCache } = require("lru-cache");
var memoryCache = new LRUCache({
max: 10000, // Maximum number of entries
maxSize: 500000000, // 500MB max total size
sizeCalculation: function (value) {
// Each float64 = 8 bytes, plus key overhead
return value.vector.length * 8 + 200;
},
ttl: 1000 * 60 * 60 * 24, // 24 hour TTL
});
function getFromMemoryCache(key) {
var entry = memoryCache.get(key);
if (entry) {
return { vector: entry.vector, source: "memory" };
}
return null;
}
function setInMemoryCache(key, vector) {
memoryCache.set(key, { vector: vector, cachedAt: Date.now() });
}
console.log("Memory cache stats:", {
size: memoryCache.size,
calculatedSize: memoryCache.calculatedSize,
});
A 1536-dimension vector (text-embedding-3-small) takes about 12KB in memory as float64. So 10,000 cached embeddings consumes roughly 120MB of RAM. Plan your max value based on available memory. For most applications, caching 5,000 to 20,000 hot embeddings in memory is the sweet spot.
Redis Caching for Distributed Systems
When you run multiple application instances behind a load balancer, in-memory caches are local to each process. A shared Redis layer ensures every instance benefits from cached embeddings.
var Redis = require("ioredis");
var redis = new Redis({
host: process.env.REDIS_HOST || "127.0.0.1",
port: process.env.REDIS_PORT || 6379,
password: process.env.REDIS_PASSWORD || undefined,
keyPrefix: "emb:",
retryStrategy: function (times) {
if (times > 3) return null; // Stop retrying after 3 attempts
return Math.min(times * 200, 2000);
},
});
function getFromRedisCache(key) {
return redis.getBuffer(key).then(function (data) {
if (!data) return null;
var vector = deserializeVector(data);
return { vector: vector, source: "redis" };
});
}
function setInRedisCache(key, vector, ttlSeconds) {
var buffer = serializeVector(vector);
var ttl = ttlSeconds || 86400 * 7; // Default: 7 days
return redis.setex(key, ttl, buffer);
}
// Store vectors as binary buffers to save Redis memory
function serializeVector(vector) {
var buffer = Buffer.alloc(vector.length * 4); // float32 = 4 bytes
for (var i = 0; i < vector.length; i++) {
buffer.writeFloatLE(vector[i], i * 4);
}
return buffer;
}
function deserializeVector(buffer) {
var length = buffer.length / 4;
var vector = new Array(length);
for (var i = 0; i < length; i++) {
vector[i] = buffer.readFloatLE(i * 4);
}
return vector;
}
Notice I use float32 instead of float64 for Redis storage. This cuts memory usage in half with negligible impact on similarity search accuracy. In my testing, the cosine similarity difference between float32 and float64 representations is less than 0.0001 — well within noise.
TTL Policies for Embedding Caches
Embeddings are deterministic: the same input and model always produce the same output. This means you can use very long TTLs — much longer than typical API response caches.
var TTL_POLICIES = {
// Static content that rarely changes — cache for 30 days
document: 86400 * 30,
// User search queries — cache for 7 days
query: 86400 * 7,
// Pre-computed category embeddings — cache for 90 days
category: 86400 * 90,
// System prompts and templates — cache for 60 days
template: 86400 * 60,
};
function getTTL(embeddingType) {
return TTL_POLICIES[embeddingType] || 86400 * 7; // Default 7 days
}
The only time an embedding cache entry truly needs to invalidate is when you change the embedding model. When you upgrade from text-embedding-3-small to text-embedding-3-large, every cached vector is stale. Including the model name in the cache key handles this automatically — new model, new keys, old entries expire naturally.
Pre-Computing Embeddings for Predictable Queries
If you know what users will search for, embed it ahead of time. Search suggestions, category descriptions, FAQ questions, product names — all of these are predictable and should be pre-computed.
var PREDICTABLE_QUERIES = [
{ text: "how to deploy a Node.js application", type: "suggestion" },
{ text: "database migration strategies", type: "suggestion" },
{ text: "REST API authentication", type: "suggestion" },
{ text: "Docker container optimization", type: "suggestion" },
{ text: "CI/CD pipeline setup", type: "suggestion" },
];
var CATEGORY_DESCRIPTIONS = [
{ text: "Backend development with Node.js, Express, databases", type: "category", id: "backend" },
{ text: "Frontend frameworks, React, CSS, browser APIs", type: "category", id: "frontend" },
{ text: "DevOps, deployment, CI/CD, infrastructure", type: "category", id: "devops" },
{ text: "Machine learning, AI integration, embeddings", type: "category", id: "ai" },
];
function preComputeEmbeddings(items, embeddingService) {
var texts = items.map(function (item) { return item.text; });
return embeddingService.embedBatch(texts).then(function (vectors) {
var results = [];
for (var i = 0; i < items.length; i++) {
var key = computeContentHash(items[i].text, embeddingService.model);
setInMemoryCache(key, vectors[i]);
results.push({ key: key, item: items[i] });
}
console.log("Pre-computed " + results.length + " embeddings");
return results;
});
}
Warm-Up Strategies on Application Start
A cold cache means the first users after a deployment suffer full API latency. Cache warming loads the most critical embeddings before the application starts accepting traffic.
function warmCache(embeddingService, db) {
var startTime = Date.now();
console.log("Starting cache warm-up...");
// Layer 1: Load pre-computed embeddings from database
return db.query("SELECT content_hash, vector, input_text FROM embedding_cache ORDER BY hit_count DESC LIMIT 5000")
.then(function (rows) {
var loaded = 0;
rows.forEach(function (row) {
setInMemoryCache(row.content_hash, JSON.parse(row.vector));
loaded++;
});
console.log("Loaded " + loaded + " embeddings from database into memory cache");
// Layer 2: Pre-compute known queries
return preComputeEmbeddings(PREDICTABLE_QUERIES, embeddingService);
})
.then(function () {
// Layer 3: Pre-compute category embeddings
return preComputeEmbeddings(CATEGORY_DESCRIPTIONS, embeddingService);
})
.then(function () {
var elapsed = Date.now() - startTime;
console.log("Cache warm-up complete in " + elapsed + "ms");
})
.catch(function (err) {
// Warm-up failure should not prevent app startup
console.error("Cache warm-up failed (non-fatal):", err.message);
});
}
// In your app startup:
// warmCache(embeddingService, db).then(function() {
// app.listen(port);
// });
The key insight here is ordering by hit_count. You want to warm the cache with embeddings that are actually used frequently, not just everything in the database. Track hit counts and warm accordingly.
Cache Invalidation When Documents Change
When a document's content changes, its cached embedding is stale. Content-hash caching handles this elegantly — if the content changes, the hash changes, so the old cache entry is never matched. But you still want to clean up stale entries.
function onDocumentUpdated(documentId, newContent, embeddingService) {
var newHash = computeContentHash(newContent, embeddingService.model);
// Check if content actually changed (hash comparison)
return db.query("SELECT content_hash FROM documents WHERE id = $1", [documentId])
.then(function (rows) {
if (rows.length > 0 && rows[0].content_hash === newHash) {
console.log("Document " + documentId + " content unchanged, skipping re-embed");
return null;
}
// Content changed — embed and update
return embeddingService.embed(newContent).then(function (vector) {
// Remove old cache entry
if (rows.length > 0) {
memoryCache.delete(rows[0].content_hash);
redis.del(rows[0].content_hash);
}
// Store new embedding
setInMemoryCache(newHash, vector);
setInRedisCache(newHash, vector);
// Update database
return db.query(
"UPDATE documents SET content_hash = $1, embedding = $2, updated_at = NOW() WHERE id = $3",
[newHash, JSON.stringify(vector), documentId]
);
});
});
}
This delta-based approach means a re-index of 10,000 documents where only 50 changed results in only 50 API calls instead of 10,000. I have seen this turn 45-minute deployment reindexing jobs into 2-minute operations.
Batch Pre-Computation Pipelines for New Content
When new content arrives in bulk — a CMS import, a documentation release, a product catalog update — you need to embed it efficiently in batches.
function batchPreComputePipeline(documents, embeddingService, options) {
var batchSize = (options && options.batchSize) || 100;
var concurrency = (options && options.concurrency) || 3;
var delayMs = (options && options.delayMs) || 200; // Rate limit buffer
var queue = [];
var processed = 0;
var cached = 0;
var embedded = 0;
// Phase 1: Filter out already-cached documents
documents.forEach(function (doc) {
var hash = computeContentHash(doc.content, embeddingService.model);
var existing = getFromMemoryCache(hash);
if (existing) {
cached++;
} else {
queue.push({ doc: doc, hash: hash });
}
});
console.log("Batch pipeline: " + cached + " already cached, " + queue.length + " need embedding");
// Phase 2: Process in batches with rate limiting
function processBatch(startIndex) {
var batch = queue.slice(startIndex, startIndex + batchSize);
if (batch.length === 0) {
return Promise.resolve({
total: documents.length,
cached: cached,
embedded: embedded,
});
}
var texts = batch.map(function (item) { return item.doc.content; });
return embeddingService.embedBatch(texts)
.then(function (vectors) {
for (var i = 0; i < batch.length; i++) {
setInMemoryCache(batch[i].hash, vectors[i]);
setInRedisCache(batch[i].hash, vectors[i]);
embedded++;
}
processed += batch.length;
console.log("Progress: " + processed + "/" + queue.length +
" (" + Math.round(processed / queue.length * 100) + "%)");
// Rate limit delay
return new Promise(function (resolve) {
setTimeout(function () {
resolve(processBatch(startIndex + batchSize));
}, delayMs);
});
});
}
return processBatch(0);
}
// Usage:
// batchPreComputePipeline(allDocuments, embeddingService, { batchSize: 50 })
// .then(function(stats) { console.log("Done:", stats); });
The rate limit delay between batches is essential. OpenAI's embedding API has rate limits, and hammering it with thousands of concurrent requests will get you throttled. I use 200ms between batches as a starting point and adjust based on the provider.
Lazy vs Eager Embedding Strategies
There are two schools of thought on when to compute embeddings:
Eager (pre-compute everything): Embed all content at write time. Every document is searchable immediately. Higher upfront cost, zero latency at query time.
Lazy (embed on first access): Embed content only when someone searches for it or it is needed. Lower upfront cost, first-access latency penalty.
// Eager strategy — embed on document save
function saveDocumentEager(doc, embeddingService) {
var hash = computeContentHash(doc.content, embeddingService.model);
return embeddingService.embed(doc.content).then(function (vector) {
setInMemoryCache(hash, vector);
return db.query(
"INSERT INTO documents (id, content, content_hash, embedding) VALUES ($1, $2, $3, $4)",
[doc.id, doc.content, hash, JSON.stringify(vector)]
);
});
}
// Lazy strategy — embed on first query
function getEmbeddingLazy(text, embeddingService) {
var hash = computeContentHash(text, embeddingService.model);
// Check cache layers
var memResult = getFromMemoryCache(hash);
if (memResult) return Promise.resolve(memResult);
return getFromRedisCache(hash).then(function (redisResult) {
if (redisResult) {
// Promote to memory cache
setInMemoryCache(hash, redisResult.vector);
return redisResult;
}
// Cache miss — compute and store
return embeddingService.embed(text).then(function (vector) {
setInMemoryCache(hash, vector);
setInRedisCache(hash, vector);
return { vector: vector, source: "api" };
});
});
}
My recommendation: use eager for your own content (documents, products, FAQs) and lazy for user queries. This gives you instant search results for your content while naturally building a cache of common user queries over time.
Cache Hit Rate Monitoring and Optimization
You cannot improve what you do not measure. Track cache hit rates across all layers so you know if your caching strategy is working.
function CacheMonitor() {
this.stats = {
memory: { hits: 0, misses: 0 },
redis: { hits: 0, misses: 0 },
database: { hits: 0, misses: 0 },
api: { calls: 0, tokens: 0 },
};
this.startTime = Date.now();
}
CacheMonitor.prototype.recordHit = function (layer) {
this.stats[layer].hits++;
};
CacheMonitor.prototype.recordMiss = function (layer) {
this.stats[layer].misses++;
};
CacheMonitor.prototype.recordApiCall = function (tokenCount) {
this.stats.api.calls++;
this.stats.api.tokens += tokenCount;
};
CacheMonitor.prototype.getReport = function () {
var uptimeSeconds = (Date.now() - this.startTime) / 1000;
var totalRequests = this.stats.memory.hits + this.stats.memory.misses;
function hitRate(layer) {
var total = layer.hits + layer.misses;
if (total === 0) return "N/A";
return (layer.hits / total * 100).toFixed(1) + "%";
}
return {
uptime: Math.round(uptimeSeconds) + "s",
totalRequests: totalRequests,
memoryHitRate: hitRate(this.stats.memory),
redisHitRate: hitRate(this.stats.redis),
databaseHitRate: hitRate(this.stats.database),
apiCalls: this.stats.api.calls,
estimatedCost: "$" + (this.stats.api.tokens / 1000000 * 0.02).toFixed(4),
costSavings: totalRequests > 0
? "$" + ((totalRequests - this.stats.api.calls) * 0.00003).toFixed(4) + " saved"
: "N/A",
};
};
// Log stats every 5 minutes
var monitor = new CacheMonitor();
setInterval(function () {
console.log("Embedding cache report:", JSON.stringify(monitor.getReport(), null, 2));
}, 300000);
In a healthy system, you should see memory hit rates above 70% and combined cache hit rates (memory + Redis) above 90%. If your hit rate is below 60%, check your normalization — you may be generating different cache keys for semantically identical queries.
Storage Optimization with Compressed Vectors
A 1536-dimension float64 vector is 12,288 bytes. In Redis, 10,000 of these consume 120MB. Compression can cut that by 60-70%.
var pako = require("pako");
function compressVector(vector) {
// Convert to float32 buffer first (halves size)
var float32 = new Float32Array(vector);
var uint8 = new Uint8Array(float32.buffer);
// Gzip compress
var compressed = pako.deflate(uint8);
return Buffer.from(compressed);
}
function decompressVector(buffer) {
var decompressed = pako.inflate(buffer);
var float32 = new Float32Array(decompressed.buffer);
return Array.from(float32);
}
// Size comparison
var testVector = new Array(1536);
for (var i = 0; i < 1536; i++) testVector[i] = Math.random() - 0.5;
var raw = Buffer.from(new Float64Array(testVector).buffer);
var float32Only = Buffer.from(new Float32Array(testVector).buffer);
var compressed = compressVector(testVector);
console.log("Float64 raw: " + raw.length + " bytes"); // 12288 bytes
console.log("Float32 raw: " + float32Only.length + " bytes"); // 6144 bytes
console.log("Float32 + gzip: " + compressed.length + " bytes"); // ~3500-4200 bytes
Use compression for Redis and database storage. Do not compress in-memory cache entries — the decompression overhead defeats the purpose when you need sub-millisecond lookups.
Combining Caching Layers (L1 Memory, L2 Redis, L3 Database)
The real power comes from layering caches. Each layer trades capacity for speed, just like CPU caches.
function MultiLayerEmbeddingCache(options) {
this.embeddingService = options.embeddingService;
this.db = options.db;
this.monitor = options.monitor || new CacheMonitor();
// L1: In-memory LRU — fastest, smallest
this.l1 = new LRUCache({
max: options.l1Max || 10000,
ttl: 1000 * 60 * 60 * 24,
});
// L2: Redis — shared across instances
this.redis = options.redis;
// L3: Database — persistent, largest
// Uses the db connection passed in
}
MultiLayerEmbeddingCache.prototype.get = function (text) {
var self = this;
var hash = computeContentHash(text, this.embeddingService.model);
// L1: Memory
var l1Result = this.l1.get(hash);
if (l1Result) {
this.monitor.recordHit("memory");
return Promise.resolve({ vector: l1Result, source: "l1_memory", hash: hash });
}
this.monitor.recordMiss("memory");
// L2: Redis
return this.redis.getBuffer("emb:" + hash)
.then(function (redisData) {
if (redisData) {
var vector = decompressVector(redisData);
self.l1.set(hash, vector); // Promote to L1
self.monitor.recordHit("redis");
return { vector: vector, source: "l2_redis", hash: hash };
}
self.monitor.recordMiss("redis");
// L3: Database
return self.db.query(
"SELECT vector FROM embedding_cache WHERE content_hash = $1",
[hash]
);
})
.then(function (result) {
// If we already resolved from Redis, pass through
if (result && result.source) return result;
if (result && result.rows && result.rows.length > 0) {
var vector = JSON.parse(result.rows[0].vector);
self.l1.set(hash, vector); // Promote to L1
var compressed = compressVector(vector);
self.redis.setex("emb:" + hash, 86400 * 7, compressed); // Promote to L2
self.monitor.recordHit("database");
// Increment hit count for warm-up prioritization
self.db.query(
"UPDATE embedding_cache SET hit_count = hit_count + 1, last_accessed = NOW() WHERE content_hash = $1",
[hash]
);
return { vector: vector, source: "l3_database", hash: hash };
}
self.monitor.recordMiss("database");
// Cache miss everywhere — call API
return self.embeddingService.embed(text).then(function (vector) {
self.monitor.recordApiCall(text.split(/\s+/).length);
// Store in all layers
self.l1.set(hash, vector);
var compressed = compressVector(vector);
self.redis.setex("emb:" + hash, 86400 * 7, compressed);
self.db.query(
"INSERT INTO embedding_cache (content_hash, input_text, vector, model, hit_count) VALUES ($1, $2, $3, $4, 1) ON CONFLICT (content_hash) DO NOTHING",
[hash, text.substring(0, 500), JSON.stringify(vector), self.embeddingService.model]
);
return { vector: vector, source: "api", hash: hash };
});
});
};
Complete Working Example
Here is a full, self-contained embedding cache module that ties together everything discussed above. This is production-ready code that I have deployed in systems handling 100K+ embedding lookups per day.
// embedding-cache.js
var crypto = require("crypto");
var { LRUCache } = require("lru-cache");
var Redis = require("ioredis");
var pako = require("pako");
var OpenAI = require("openai");
// ============================================================
// Configuration
// ============================================================
var CONFIG = {
model: "text-embedding-3-small",
dimensions: 1536,
l1MaxEntries: 10000,
l1TtlMs: 1000 * 60 * 60 * 24, // 24 hours
l2TtlSeconds: 86400 * 7, // 7 days
batchSize: 100,
batchDelayMs: 200,
warmUpLimit: 5000,
statsIntervalMs: 300000, // 5 minutes
};
// ============================================================
// Utilities
// ============================================================
function normalizeText(text) {
return text.trim().toLowerCase().replace(/\s+/g, " ").replace(/[.!?]+$/, "").normalize("NFC");
}
function contentHash(text, model) {
var normalized = normalizeText(text);
return crypto.createHash("sha256").update(model + ":" + normalized).digest("hex");
}
function vectorToBuffer(vector) {
var float32 = new Float32Array(vector);
var compressed = pako.deflate(new Uint8Array(float32.buffer));
return Buffer.from(compressed);
}
function bufferToVector(buffer) {
var decompressed = pako.inflate(buffer);
return Array.from(new Float32Array(decompressed.buffer));
}
// ============================================================
// Cache Monitor
// ============================================================
function EmbeddingCacheMonitor() {
this.counters = {
l1Hits: 0, l1Misses: 0,
l2Hits: 0, l2Misses: 0,
l3Hits: 0, l3Misses: 0,
apiCalls: 0, apiTokens: 0,
};
this.startedAt = Date.now();
}
EmbeddingCacheMonitor.prototype.hit = function (layer) {
this.counters[layer + "Hits"]++;
};
EmbeddingCacheMonitor.prototype.miss = function (layer) {
this.counters[layer + "Misses"]++;
};
EmbeddingCacheMonitor.prototype.apiCall = function (tokens) {
this.counters.apiCalls++;
this.counters.apiTokens += tokens;
};
EmbeddingCacheMonitor.prototype.report = function () {
var c = this.counters;
var total = c.l1Hits + c.l1Misses;
function rate(hits, misses) {
var t = hits + misses;
return t === 0 ? "0.0%" : (hits / t * 100).toFixed(1) + "%";
}
return {
uptime: Math.round((Date.now() - this.startedAt) / 1000) + "s",
totalLookups: total,
l1HitRate: rate(c.l1Hits, c.l1Misses),
l2HitRate: rate(c.l2Hits, c.l2Misses),
l3HitRate: rate(c.l3Hits, c.l3Misses),
overallHitRate: rate(c.l1Hits + c.l2Hits + c.l3Hits, c.apiCalls),
apiCalls: c.apiCalls,
estimatedCost: "$" + (c.apiTokens / 1000000 * 0.02).toFixed(4),
};
};
// ============================================================
// Embedding Service Wrapper
// ============================================================
function EmbeddingService(apiKey, model) {
this.client = new OpenAI({ apiKey: apiKey });
this.model = model || CONFIG.model;
}
EmbeddingService.prototype.embed = function (text) {
var self = this;
return this.client.embeddings.create({
model: self.model,
input: text,
}).then(function (response) {
return response.data[0].embedding;
});
};
EmbeddingService.prototype.embedBatch = function (texts) {
var self = this;
return this.client.embeddings.create({
model: self.model,
input: texts,
}).then(function (response) {
return response.data.map(function (item) {
return item.embedding;
});
});
};
// ============================================================
// Multi-Layer Embedding Cache
// ============================================================
function EmbeddingCache(options) {
this.service = options.embeddingService;
this.db = options.db; // PostgreSQL pool
this.monitor = new EmbeddingCacheMonitor();
// L1: In-memory LRU
this.l1 = new LRUCache({
max: CONFIG.l1MaxEntries,
ttl: CONFIG.l1TtlMs,
});
// L2: Redis
this.redis = new Redis({
host: options.redisHost || "127.0.0.1",
port: options.redisPort || 6379,
password: options.redisPassword || undefined,
lazyConnect: true,
retryStrategy: function (times) {
if (times > 3) return null;
return Math.min(times * 200, 2000);
},
});
// Stats logging
var self = this;
this._statsInterval = setInterval(function () {
console.log("[EmbeddingCache] Stats:", JSON.stringify(self.monitor.report()));
}, CONFIG.statsIntervalMs);
}
EmbeddingCache.prototype.connect = function () {
return this.redis.connect().catch(function (err) {
console.warn("[EmbeddingCache] Redis unavailable, L2 disabled:", err.message);
});
};
EmbeddingCache.prototype.getEmbedding = function (text) {
var self = this;
var hash = contentHash(text, this.service.model);
// --- L1: Memory ---
var l1Result = this.l1.get(hash);
if (l1Result) {
this.monitor.hit("l1");
return Promise.resolve({ vector: l1Result, source: "l1", hash: hash });
}
this.monitor.miss("l1");
// --- L2: Redis ---
return this._getFromRedis(hash)
.then(function (l2Vector) {
if (l2Vector) {
self.l1.set(hash, l2Vector);
self.monitor.hit("l2");
return { vector: l2Vector, source: "l2", hash: hash };
}
self.monitor.miss("l2");
// --- L3: Database ---
return self._getFromDatabase(hash);
})
.then(function (result) {
if (result && result.source) return result;
if (result) {
self.l1.set(hash, result);
self._setInRedis(hash, result);
self.monitor.hit("l3");
self._incrementHitCount(hash);
return { vector: result, source: "l3", hash: hash };
}
self.monitor.miss("l3");
// --- API call ---
return self.service.embed(text).then(function (vector) {
var tokenEstimate = Math.ceil(text.split(/\s+/).length * 1.3);
self.monitor.apiCall(tokenEstimate);
self._storeInAllLayers(hash, text, vector);
return { vector: vector, source: "api", hash: hash };
});
});
};
EmbeddingCache.prototype._getFromRedis = function (hash) {
if (this.redis.status !== "ready") return Promise.resolve(null);
return this.redis.getBuffer("emb:" + hash)
.then(function (data) {
return data ? bufferToVector(data) : null;
})
.catch(function () { return null; });
};
EmbeddingCache.prototype._setInRedis = function (hash, vector) {
if (this.redis.status !== "ready") return;
var buffer = vectorToBuffer(vector);
this.redis.setex("emb:" + hash, CONFIG.l2TtlSeconds, buffer).catch(function () {});
};
EmbeddingCache.prototype._getFromDatabase = function (hash) {
if (!this.db) return Promise.resolve(null);
return this.db.query(
"SELECT vector FROM embedding_cache WHERE content_hash = $1",
[hash]
).then(function (result) {
if (result.rows.length > 0) {
return JSON.parse(result.rows[0].vector);
}
return null;
}).catch(function () { return null; });
};
EmbeddingCache.prototype._incrementHitCount = function (hash) {
if (!this.db) return;
this.db.query(
"UPDATE embedding_cache SET hit_count = hit_count + 1, last_accessed = NOW() WHERE content_hash = $1",
[hash]
).catch(function () {});
};
EmbeddingCache.prototype._storeInAllLayers = function (hash, text, vector) {
this.l1.set(hash, vector);
this._setInRedis(hash, vector);
if (this.db) {
this.db.query(
"INSERT INTO embedding_cache (content_hash, input_text, vector, model, hit_count, created_at, last_accessed) VALUES ($1, $2, $3, $4, 1, NOW(), NOW()) ON CONFLICT (content_hash) DO NOTHING",
[hash, text.substring(0, 500), JSON.stringify(vector), this.service.model]
).catch(function (err) {
console.error("[EmbeddingCache] DB write failed:", err.message);
});
}
};
// ============================================================
// Pre-computation and Warm-up
// ============================================================
EmbeddingCache.prototype.warmUp = function () {
var self = this;
var startTime = Date.now();
console.log("[EmbeddingCache] Starting cache warm-up...");
if (!this.db) {
console.log("[EmbeddingCache] No database configured, skipping warm-up");
return Promise.resolve();
}
return this.db.query(
"SELECT content_hash, vector FROM embedding_cache ORDER BY hit_count DESC LIMIT $1",
[CONFIG.warmUpLimit]
).then(function (result) {
var count = 0;
result.rows.forEach(function (row) {
var vector = JSON.parse(row.vector);
self.l1.set(row.content_hash, vector);
self._setInRedis(row.content_hash, vector);
count++;
});
var elapsed = Date.now() - startTime;
console.log("[EmbeddingCache] Warm-up complete: " + count + " entries loaded in " + elapsed + "ms");
return count;
}).catch(function (err) {
console.error("[EmbeddingCache] Warm-up failed (non-fatal):", err.message);
return 0;
});
};
EmbeddingCache.prototype.preCompute = function (texts) {
var self = this;
var toEmbed = [];
var alreadyCached = 0;
texts.forEach(function (text) {
var hash = contentHash(text, self.service.model);
if (self.l1.has(hash)) {
alreadyCached++;
} else {
toEmbed.push({ text: text, hash: hash });
}
});
console.log("[EmbeddingCache] Pre-compute: " + alreadyCached + " cached, " + toEmbed.length + " to embed");
if (toEmbed.length === 0) return Promise.resolve({ cached: alreadyCached, embedded: 0 });
var embedded = 0;
function processBatch(index) {
var batch = toEmbed.slice(index, index + CONFIG.batchSize);
if (batch.length === 0) {
return Promise.resolve({ cached: alreadyCached, embedded: embedded });
}
var batchTexts = batch.map(function (item) { return item.text; });
return self.service.embedBatch(batchTexts).then(function (vectors) {
for (var i = 0; i < batch.length; i++) {
self._storeInAllLayers(batch[i].hash, batch[i].text, vectors[i]);
embedded++;
}
console.log("[EmbeddingCache] Pre-compute progress: " + embedded + "/" + toEmbed.length);
return new Promise(function (resolve) {
setTimeout(function () {
resolve(processBatch(index + CONFIG.batchSize));
}, CONFIG.batchDelayMs);
});
});
}
return processBatch(0);
};
// ============================================================
// Cleanup
// ============================================================
EmbeddingCache.prototype.shutdown = function () {
clearInterval(this._statsInterval);
this.redis.disconnect();
console.log("[EmbeddingCache] Final stats:", JSON.stringify(this.monitor.report()));
};
// ============================================================
// Database Schema
// ============================================================
/*
CREATE TABLE embedding_cache (
content_hash VARCHAR(64) PRIMARY KEY,
input_text VARCHAR(500),
vector TEXT NOT NULL,
model VARCHAR(100) NOT NULL,
hit_count INTEGER DEFAULT 1,
created_at TIMESTAMP DEFAULT NOW(),
last_accessed TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_embedding_cache_hits ON embedding_cache (hit_count DESC);
CREATE INDEX idx_embedding_cache_model ON embedding_cache (model);
*/
// ============================================================
// Usage Example
// ============================================================
/*
var service = new EmbeddingService(process.env.OPENAI_API_KEY);
var cache = new EmbeddingCache({
embeddingService: service,
db: pgPool,
redisHost: "127.0.0.1",
});
cache.connect()
.then(function() { return cache.warmUp(); })
.then(function() {
// Single embedding lookup
return cache.getEmbedding("How do I deploy Node.js?");
})
.then(function(result) {
console.log("Source:", result.source); // "api" on first call, "l1" after
console.log("Vector dimensions:", result.vector.length);
// Pre-compute a batch
return cache.preCompute([
"REST API authentication",
"Docker container optimization",
"Database migration strategies",
]);
})
.then(function(stats) {
console.log("Pre-compute results:", stats);
});
*/
module.exports = {
EmbeddingCache: EmbeddingCache,
EmbeddingService: EmbeddingService,
EmbeddingCacheMonitor: EmbeddingCacheMonitor,
contentHash: contentHash,
normalizeText: normalizeText,
};
Common Issues and Troubleshooting
1. Redis Connection Refused on Startup
Error: connect ECONNREFUSED 127.0.0.1:6379
at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1595:16)
This happens when Redis is not running or the connection details are wrong. The cache module above handles this gracefully by falling through to L3 (database) and API calls. But you lose the L2 layer entirely. Make sure Redis is running with redis-cli ping and verify your host/port configuration. In production, use lazyConnect: true so the app starts even if Redis is temporarily unavailable.
2. Memory Leak from Unbounded Cache
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
at EmbeddingCache.getEmbedding (/app/embedding-cache.js:87:22)
This happens when you cache vectors without an LRU eviction policy or size limit. A 3072-dimension vector (text-embedding-3-large) is 24KB. Caching 100,000 of these takes 2.4GB. Always set max on your LRU cache and use sizeCalculation to account for actual memory usage. Monitor your process memory with process.memoryUsage().heapUsed.
3. Cache Key Collision from Insufficient Normalization
Symptoms: low cache hit rate (under 50%) despite users searching for similar terms. Debug by logging cache misses:
[EmbeddingCache] MISS: hash=a8f3... text="How do I deploy Node.js?"
[EmbeddingCache] MISS: hash=b2e1... text="how do i deploy node.js"
[EmbeddingCache] MISS: hash=c5d9... text="How do I deploy Node.js ?"
Three API calls for what should be one. The fix is better normalization — lowercasing, whitespace collapsing, trailing punctuation removal. Check your normalizeText function and add logging for cache misses to identify patterns.
4. Stale Embeddings After Model Upgrade
Error: Dimension mismatch: query vector has 1536 dimensions but index expects 3072
at VectorStore.search (/app/vector-store.js:45:11)
You upgraded from text-embedding-3-small (1536 dims) to text-embedding-3-large (3072 dims) but your cache still has old vectors. If you include the model name in your cache key (as shown above), new queries will miss the cache and get fresh embeddings. But your database still has old vectors. You need a migration: either re-embed all documents or add a model column to your cache table and filter queries by model.
5. Rate Limiting During Batch Pre-Computation
Error: 429 Too Many Requests
Rate limit exceeded. Please retry after 20 seconds.
at OpenAI._request (/app/node_modules/openai/core.js:245:15)
You are sending embedding batches faster than the API allows. Add a delay between batches (200-500ms) and implement exponential backoff on 429 responses. The batch pipeline in the working example above includes a batchDelayMs parameter for this reason. Start with 200ms and increase if you still hit limits. Also reduce your batch size — 100 texts per call is reasonable, but 2000 is likely too many.
Best Practices
Always include the model name in cache keys. Embeddings from different models are incompatible. When you switch models, content-addressed keys naturally invalidate. Skip this and you will spend hours debugging dimension mismatches.
Normalize input text before hashing. Whitespace, casing, and trailing punctuation differences generate different cache keys for semantically identical inputs. A simple normalize function can boost your hit rate from 50% to 90%.
Use float32 for storage, float64 for computation. Storing vectors as float32 halves your memory and Redis usage with negligible accuracy loss. Convert back to float64 only if your vector math library requires it.
Warm your cache on application start, ordered by hit count. Load the top 5,000 most-accessed embeddings into memory before accepting traffic. This eliminates the cold-start latency penalty after deployments.
Set generous TTLs on embedding caches. Embeddings are deterministic — same input and model always produce the same output. There is no reason to expire them after an hour. Use 7-30 day TTLs and save the API calls.
Monitor cache hit rates per layer and set alerts. A sudden drop in hit rate indicates a problem — maybe your normalization broke, or a new traffic pattern is not being cached. Alert on hit rates below 60%.
Pre-compute embeddings for predictable content eagerly, cache user queries lazily. Your own documents, categories, and FAQ answers should be embedded at write time. User queries build their cache naturally over time.
Compress vectors before storing in Redis. Float32 + gzip reduces a 1536-dim vector from 12KB to roughly 3.5KB. For 100,000 cached vectors, that is 850MB saved in Redis.
Gracefully degrade when cache layers are unavailable. If Redis goes down, fall through to database and API. If the database is unavailable, still serve from memory cache and API. Never let a cache failure become an application failure.
Track and log estimated cost savings. Showing stakeholders "$847 saved this month from embedding caching" is the easiest way to justify the infrastructure investment. The cache monitor in the working example calculates this automatically.
References
- OpenAI Embeddings API Documentation — Official embedding models, pricing, and rate limits
- lru-cache npm Package — High-performance LRU cache for Node.js with TTL and size-based eviction
- ioredis Documentation — Full-featured Redis client for Node.js with cluster and sentinel support
- pako (zlib for JavaScript) — High-speed compression library used for vector storage optimization
- Content-Addressable Storage (Wikipedia) — The design pattern behind content-hash caching
- Redis Memory Optimization — Official guide to reducing Redis memory usage