Cost-Effective Embedding Strategies at Scale

Shane

2/13/2026

26 min read

Reduce embedding costs at scale with batching, caching, incremental updates, and budget tracking in Node.js.

nodejs cost-optimization caching embeddings batch processing scale

Cost-Effective Embedding Strategies at Scale

Overview

Embedding APIs charge per token, and those costs compound fast when you are processing millions of documents. I have watched teams burn through thousands of dollars in embedding spend during a single data migration because nobody stopped to think about caching, batching, or deduplication. This article covers the practical strategies I use to keep embedding costs under control in production Node.js systems — from content-hash caching and incremental updates to budget alerting and storage cost estimation.

Prerequisites

Node.js v18 or later installed
Basic understanding of vector embeddings and their use in semantic search or RAG
An OpenAI API key (or equivalent provider key)
PostgreSQL with pgvector extension for storage examples
Familiarity with Express.js and async patterns in Node.js

Understanding Embedding Costs

Every embedding API charges per token. The math is straightforward, but it catches people off guard at scale. Here is a rough comparison of popular providers as of early 2026:

Provider	Model	Dimensions	Price per 1M Tokens
OpenAI	text-embedding-3-small	1536	$0.02
OpenAI	text-embedding-3-large	3072	$0.13
OpenAI	text-embedding-ada-002	1536	$0.10
Cohere	embed-english-v3.0	1024	$0.10
Google	text-embedding-004	768	$0.00625
Voyage AI	voyage-large-2	1536	$0.12

A single document of 500 tokens costs fractions of a cent. But when you are embedding 10 million documents at 500 tokens each, that is 5 billion tokens. At $0.02 per million tokens with text-embedding-3-small, that comes to $100. With text-embedding-3-large, it is $650. Every re-embedding of unchanged content is money thrown away.

// Quick cost estimator
var COST_PER_MILLION_TOKENS = {
  "text-embedding-3-small": 0.02,
  "text-embedding-3-large": 0.13,
  "text-embedding-ada-002": 0.10
};

function estimateCost(totalTokens, model) {
  var rate = COST_PER_MILLION_TOKENS[model] || 0.02;
  return (totalTokens / 1000000) * rate;
}

// 10 million docs × 500 tokens each
var totalTokens = 10000000 * 500;
console.log("Small model cost: $" + estimateCost(totalTokens, "text-embedding-3-small"));
// Small model cost: $100
console.log("Large model cost: $" + estimateCost(totalTokens, "text-embedding-3-large"));
// Large model cost: $650

Batch Embedding for Bulk Operations

The OpenAI embeddings endpoint accepts up to 2048 inputs in a single request, with a combined limit of around 8191 tokens per input. Sending one document per HTTP request is the single most wasteful pattern I see. Every individual request has network overhead, and you burn time on round trips that could be eliminated by batching.

var axios = require("axios");

function batchEmbed(texts, apiKey, model, batchSize) {
  var model = model || "text-embedding-3-small";
  var batchSize = batchSize || 100;
  var results = [];

  return new Promise(function (resolve, reject) {
    var batches = [];
    for (var i = 0; i < texts.length; i += batchSize) {
      batches.push(texts.slice(i, i + batchSize));
    }

    var processedBatches = 0;

    function processBatch(index) {
      if (index >= batches.length) {
        return resolve(results);
      }

      axios.post("https://api.openai.com/v1/embeddings", {
        input: batches[index],
        model: model
      }, {
        headers: {
          "Authorization": "Bearer " + apiKey,
          "Content-Type": "application/json"
        }
      }).then(function (response) {
        var embeddings = response.data.data.map(function (item) {
          return item.embedding;
        });
        results = results.concat(embeddings);
        processedBatches++;
        console.log("Batch " + processedBatches + "/" + batches.length + " complete (" + embeddings.length + " embeddings)");
        processBatch(index + 1);
      }).catch(function (err) {
        reject(err);
      });
    }

    processBatch(0);
  });
}

With 10,000 documents and a batch size of 100, that is 100 HTTP requests instead of 10,000. The latency improvement alone is dramatic, and you avoid getting rate-limited as quickly.

Caching Embeddings to Avoid Re-Computation

The most effective cost optimization is simply not embedding things twice. I use a content hash as the cache key. If the document content has not changed, the embedding has not changed. This is a deterministic relationship for any given model.

var crypto = require("crypto");

function contentHash(text, model) {
  return crypto.createHash("sha256").update(model + ":" + text).digest("hex");
}

// Cache layer using a simple Map (swap for Redis in production)
var embeddingCache = {};

function getCachedEmbedding(text, model) {
  var hash = contentHash(text, model);
  return embeddingCache[hash] || null;
}

function setCachedEmbedding(text, model, embedding) {
  var hash = contentHash(text, model);
  embeddingCache[hash] = {
    embedding: embedding,
    cachedAt: new Date().toISOString(),
    model: model
  };
}

In production, I store these in Redis with an expiry of 30 to 90 days. The hash includes the model name because switching from text-embedding-3-small to text-embedding-3-large produces entirely different vectors. Missing this detail is a bug I have seen in the wild.

For a PostgreSQL-backed cache:

CREATE TABLE embedding_cache (
  content_hash VARCHAR(64) PRIMARY KEY,
  model VARCHAR(64) NOT NULL,
  embedding VECTOR(1536),
  token_count INTEGER,
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_cache_model ON embedding_cache(model);

Incremental Embedding

Never re-embed your entire corpus when only a handful of documents changed. Track document versions and only process what is new or modified.

var crypto = require("crypto");

function IncrementalEmbedder(db, embedFn) {
  this.db = db;
  this.embedFn = embedFn;
}

IncrementalEmbedder.prototype.processDocuments = function (documents) {
  var self = this;
  var toEmbed = [];
  var skipped = 0;

  return self.db.query("SELECT doc_id, content_hash FROM documents").then(function (rows) {
    var existingHashes = {};
    rows.forEach(function (row) {
      existingHashes[row.doc_id] = row.content_hash;
    });

    documents.forEach(function (doc) {
      var hash = crypto.createHash("sha256").update(doc.content).digest("hex");
      if (existingHashes[doc.id] === hash) {
        skipped++;
      } else {
        doc._contentHash = hash;
        toEmbed.push(doc);
      }
    });

    console.log("Skipping " + skipped + " unchanged documents");
    console.log("Embedding " + toEmbed.length + " new/modified documents");

    if (toEmbed.length === 0) {
      return { embedded: 0, skipped: skipped };
    }

    var texts = toEmbed.map(function (doc) { return doc.content; });
    return self.embedFn(texts).then(function (embeddings) {
      var updates = toEmbed.map(function (doc, i) {
        return {
          doc_id: doc.id,
          content_hash: doc._contentHash,
          embedding: embeddings[i]
        };
      });
      return self.db.upsertEmbeddings(updates).then(function () {
        return { embedded: toEmbed.length, skipped: skipped };
      });
    });
  });
};

On a corpus of 500,000 documents where 2% change daily, this saves you from re-embedding 490,000 documents every run. That is a 98% cost reduction on daily operations.

Choosing the Right Model Size

Not every task needs the largest model. I use text-embedding-3-small for the majority of workloads and only reach for text-embedding-3-large when retrieval quality is measurably better on my evaluation set.

Use Case	Recommended Model	Why
FAQ matching	small (1536d)	Short queries, limited vocabulary
Code search	large (3072d)	Semantic nuance matters
Product catalog	small (1536d)	Keyword-heavy, structured data
Legal document retrieval	large (3072d)	Subtle distinctions in language
Log anomaly detection	small (1536d)	Pattern matching, not nuance

Run your own evaluation before choosing. I have seen teams default to the expensive model "just in case" and spend 6.5x more for a 2% improvement in recall that did not matter for their application.

Dimension Reduction to Save Storage Costs

OpenAI's text-embedding-3 models support native dimension reduction via the dimensions parameter. You can request 256, 512, or 1024 dimensions instead of the full 1536 or 3072. This shrinks storage costs and speeds up similarity search with minimal quality loss for many tasks.

function embedWithReducedDimensions(texts, apiKey, dimensions) {
  var dimensions = dimensions || 512;

  return axios.post("https://api.openai.com/v1/embeddings", {
    input: texts,
    model: "text-embedding-3-small",
    dimensions: dimensions
  }, {
    headers: {
      "Authorization": "Bearer " + apiKey,
      "Content-Type": "application/json"
    }
  }).then(function (response) {
    return response.data.data.map(function (item) {
      return item.embedding;
    });
  });
}

At 512 dimensions instead of 1536, you use one-third the storage while retaining roughly 95% of retrieval quality on most benchmarks. I have tested this on internal RAG systems and the difference is negligible for document retrieval. Only precision-sensitive tasks like code search or legal review justify the full dimensions.

Estimating Storage Costs for pgvector at Scale

Every vector you store costs disk space, and at millions of rows the math matters. pgvector stores each dimension as a 4-byte float.

Storage per vector = (dimensions × 4 bytes) + 8 bytes overhead

1536 dimensions: (1536 × 4) + 8 = 6,152 bytes ≈ 6 KB
3072 dimensions: (3072 × 4) + 8 = 12,296 bytes ≈ 12 KB
512 dimensions:  (512 × 4) + 8  = 2,056 bytes ≈ 2 KB

At scale:

Rows	512d Storage	1536d Storage	3072d Storage
100K	~196 MB	~586 MB	~1.2 GB
1M	~1.96 GB	~5.86 GB	~11.7 GB
10M	~19.6 GB	~58.6 GB	~117 GB
100M	~196 GB	~586 GB	~1.17 TB

Add HNSW indexes on top and storage roughly doubles. A 10-million-row table at 1536 dimensions with an HNSW index needs approximately 120 GB of disk. On managed PostgreSQL (DigitalOcean, AWS RDS), that is real money — often $100 to $300 per month just for disk. Reducing to 512 dimensions cuts that to around 40 GB.

Connection Pooling for High-Throughput Embedding Pipelines

When you are writing thousands of embeddings to PostgreSQL, connection management matters. A naive approach opens a new connection per insert and chokes the database.

var { Pool } = require("pg");

var pool = new Pool({
  connectionString: process.env.POSTGRES_CONNECTION_STRING,
  max: 20,
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 5000
});

function bulkInsertEmbeddings(records) {
  var batchSize = 500;
  var batches = [];

  for (var i = 0; i < records.length; i += batchSize) {
    batches.push(records.slice(i, i + batchSize));
  }

  return batches.reduce(function (promise, batch, index) {
    return promise.then(function () {
      var values = [];
      var params = [];
      var paramIndex = 1;

      batch.forEach(function (record) {
        values.push(
          "($" + paramIndex + ", $" + (paramIndex + 1) + ", $" + (paramIndex + 2) + ")"
        );
        params.push(record.doc_id, record.content_hash, JSON.stringify(record.embedding));
        paramIndex += 3;
      });

      var sql = "INSERT INTO documents (doc_id, content_hash, embedding) VALUES " +
        values.join(", ") +
        " ON CONFLICT (doc_id) DO UPDATE SET content_hash = EXCLUDED.content_hash, embedding = EXCLUDED.embedding";

      return pool.query(sql, params).then(function () {
        console.log("Inserted batch " + (index + 1) + "/" + batches.length);
      });
    });
  }, Promise.resolve());
}

With a pool size of 20 and batch inserts of 500 rows, I can sustain around 10,000 inserts per second on a modestly sized PostgreSQL instance. Without pooling, the same workload maxes out at a few hundred.

Async Embedding Queues for Non-Blocking Ingestion

For web applications where users upload documents, you do not want the HTTP request to block while embeddings are generated. Push the work onto a queue and process it asynchronously.

var EventEmitter = require("events");

function EmbeddingQueue(embedFn, concurrency) {
  this.embedFn = embedFn;
  this.concurrency = concurrency || 3;
  this.queue = [];
  this.active = 0;
  this.events = new EventEmitter();
}

EmbeddingQueue.prototype.enqueue = function (docId, text) {
  this.queue.push({ docId: docId, text: text });
  this._processNext();
  return docId;
};

EmbeddingQueue.prototype._processNext = function () {
  var self = this;

  while (self.active < self.concurrency && self.queue.length > 0) {
    var item = self.queue.shift();
    self.active++;

    (function (current) {
      self.embedFn([current.text]).then(function (embeddings) {
        self.active--;
        self.events.emit("complete", {
          docId: current.docId,
          embedding: embeddings[0],
          queueSize: self.queue.length
        });
        self._processNext();
      }).catch(function (err) {
        self.active--;
        self.events.emit("error", { docId: current.docId, error: err });
        self._processNext();
      });
    })(item);
  }
};

// Usage in an Express route
var queue = new EmbeddingQueue(batchEmbed, 3);

queue.events.on("complete", function (result) {
  console.log("Embedded doc " + result.docId + " — " + result.queueSize + " remaining");
  // Store the embedding in the database
});

app.post("/api/documents", function (req, res) {
  var docId = req.body.id;
  var text = req.body.content;
  queue.enqueue(docId, text);
  res.status(202).json({ status: "queued", docId: docId });
});

The user gets a 202 response immediately. The embedding happens in the background. For production systems with persistence guarantees, swap this in-memory queue for BullMQ backed by Redis.

Pre-Computing Embeddings for Predictable Queries

If you know the queries your users will make — categories, filter labels, FAQ questions — embed them once at deploy time and store them. This eliminates query-time embedding costs entirely for those patterns.

var fs = require("fs");

var PREDICTABLE_QUERIES = [
  "How do I reset my password?",
  "What are your pricing plans?",
  "How do I cancel my subscription?",
  "Where can I find my invoice?",
  "How do I contact support?"
];

function preComputeQueryEmbeddings(apiKey) {
  return batchEmbed(PREDICTABLE_QUERIES, apiKey, "text-embedding-3-small", 100)
    .then(function (embeddings) {
      var cache = {};
      PREDICTABLE_QUERIES.forEach(function (query, i) {
        cache[query.toLowerCase()] = embeddings[i];
      });
      fs.writeFileSync("./query-embeddings-cache.json", JSON.stringify(cache));
      console.log("Pre-computed " + Object.keys(cache).length + " query embeddings");
      return cache;
    });
}

// At query time, check the pre-computed cache first
function getQueryEmbedding(queryText, apiKey) {
  var cached = queryEmbeddingsCache[queryText.toLowerCase()];
  if (cached) {
    console.log("Cache hit for query: " + queryText);
    return Promise.resolve(cached);
  }
  // Fall back to API call
  return batchEmbed([queryText], apiKey).then(function (embeddings) {
    return embeddings[0];
  });
}

On a support chatbot I built, 60% of incoming queries matched one of 200 pre-computed embeddings. That cut query-side embedding costs by more than half.

Deduplication Before Embedding

Duplicate content is surprisingly common in real datasets — copied pages, syndicated articles, near-identical product descriptions. Embedding duplicates is pure waste.

var crypto = require("crypto");

function deduplicateDocuments(documents) {
  var seen = {};
  var unique = [];
  var duplicates = 0;

  documents.forEach(function (doc) {
    var normalized = doc.content.trim().toLowerCase().replace(/\s+/g, " ");
    var hash = crypto.createHash("sha256").update(normalized).digest("hex");

    if (!seen[hash]) {
      seen[hash] = doc.id;
      unique.push(doc);
    } else {
      duplicates++;
      // Store the mapping so we can reference the original embedding
      doc._duplicateOf = seen[hash];
    }
  });

  console.log("Found " + duplicates + " duplicates in " + documents.length + " documents");
  console.log("Only embedding " + unique.length + " unique documents");

  return { unique: unique, totalDuplicates: duplicates };
}

On a recent e-commerce project, deduplication removed 18% of the catalog before embedding. At 500,000 products, that saved approximately $1.80 per run at $0.02/M tokens — which is $54 per month with daily re-indexing.

Monitoring Embedding Spend with Cost Tracking Middleware

You cannot optimize what you do not measure. I wrap every embedding call with cost tracking middleware that logs tokens consumed, cost per request, and cumulative spend.

function CostTracker(budgetLimit) {
  this.totalTokens = 0;
  this.totalCost = 0;
  this.requestCount = 0;
  this.budgetLimit = budgetLimit || Infinity;
  this.history = [];
}

CostTracker.prototype.track = function (response, model) {
  var usage = response.data.usage;
  var tokens = usage.total_tokens;
  var cost = estimateCost(tokens, model);

  this.totalTokens += tokens;
  this.totalCost += cost;
  this.requestCount++;

  this.history.push({
    timestamp: new Date().toISOString(),
    tokens: tokens,
    cost: cost,
    cumulativeCost: this.totalCost,
    model: model
  });

  if (this.totalCost >= this.budgetLimit * 0.8) {
    console.warn("[COST ALERT] Approaching budget limit: $" +
      this.totalCost.toFixed(4) + " / $" + this.budgetLimit);
  }

  if (this.totalCost >= this.budgetLimit) {
    throw new Error("EMBEDDING_BUDGET_EXCEEDED: Spent $" +
      this.totalCost.toFixed(4) + " (limit: $" + this.budgetLimit + ")");
  }

  return { tokens: tokens, cost: cost, cumulative: this.totalCost };
};

CostTracker.prototype.report = function () {
  return {
    totalRequests: this.requestCount,
    totalTokens: this.totalTokens,
    totalCost: "$" + this.totalCost.toFixed(4),
    avgCostPerRequest: "$" + (this.totalCost / this.requestCount).toFixed(6),
    avgTokensPerRequest: Math.round(this.totalTokens / this.requestCount),
    budgetRemaining: "$" + (this.budgetLimit - this.totalCost).toFixed(4)
  };
};

Comparing Self-Hosted vs API Embedding Costs

At some scale, running your own embedding model becomes cheaper than API calls. The break-even depends on your volume and hardware costs.

Monthly Volume	API Cost (3-small)	Self-Hosted (GPU Instance)	Winner
10M tokens	$0.20	~$200 (T4 instance)	API
100M tokens	$2.00	~$200	API
1B tokens	$20	~$200	API
10B tokens	$200	~$200	Break-even
100B tokens	$2,000	~$400 (amortized)	Self-hosted
1T tokens	$20,000	~$800 (multiple GPUs)	Self-hosted

Below 10 billion tokens per month, the API is almost always cheaper when you factor in operational burden — model updates, GPU driver issues, monitoring, scaling. Above that threshold, self-hosted models like BAAI/bge-large-en-v1.5 or nomic-embed-text-v1.5 running on a dedicated T4 or L4 GPU save serious money.

// Rough self-hosted cost estimator
function selfHostedBreakEven(monthlyTokens, apiCostPerMillion, gpuMonthlyCost) {
  var apiCost = (monthlyTokens / 1000000) * apiCostPerMillion;
  var savings = apiCost - gpuMonthlyCost;
  var worthIt = savings > 0;

  return {
    apiMonthlyCost: "$" + apiCost.toFixed(2),
    gpuMonthlyCost: "$" + gpuMonthlyCost.toFixed(2),
    monthlySavings: "$" + savings.toFixed(2),
    recommendation: worthIt ? "Self-host" : "Use API"
  };
}

console.log(selfHostedBreakEven(50000000000, 0.02, 200));
// { apiMonthlyCost: '$1000.00', gpuMonthlyCost: '$200.00',
//   monthlySavings: '$800.00', recommendation: 'Self-host' }

Budgeting and Forecasting Embedding Costs

Track your embedding spend daily and project it forward. Surprises in your cloud bill are never fun.

function EmbeddingBudget(monthlyBudget) {
  this.monthlyBudget = monthlyBudget;
  this.dailySpend = [];
}

EmbeddingBudget.prototype.recordDay = function (cost) {
  this.dailySpend.push({ date: new Date().toISOString().split("T")[0], cost: cost });
};

EmbeddingBudget.prototype.forecast = function () {
  if (this.dailySpend.length < 3) {
    return { error: "Need at least 3 days of data to forecast" };
  }

  var recentDays = this.dailySpend.slice(-7);
  var avgDaily = recentDays.reduce(function (sum, d) { return sum + d.cost; }, 0) / recentDays.length;

  var today = new Date();
  var daysInMonth = new Date(today.getFullYear(), today.getMonth() + 1, 0).getDate();
  var dayOfMonth = today.getDate();
  var daysRemaining = daysInMonth - dayOfMonth;

  var spentSoFar = this.dailySpend.reduce(function (sum, d) { return sum + d.cost; }, 0);
  var projected = spentSoFar + (avgDaily * daysRemaining);
  var onTrack = projected <= this.monthlyBudget;

  return {
    spentSoFar: "$" + spentSoFar.toFixed(2),
    avgDailySpend: "$" + avgDaily.toFixed(2),
    projectedMonthly: "$" + projected.toFixed(2),
    monthlyBudget: "$" + this.monthlyBudget.toFixed(2),
    daysRemaining: daysRemaining,
    status: onTrack ? "ON_TRACK" : "OVER_BUDGET",
    overage: onTrack ? "$0.00" : "$" + (projected - this.monthlyBudget).toFixed(2)
  };
};

Complete Working Example

Here is a full cost-optimized embedding pipeline that ties together batching, content-hash caching, incremental updates, cost tracking, and budget alerting.

var axios = require("axios");
var crypto = require("crypto");
var { Pool } = require("pg");
var EventEmitter = require("events");

// ─── Configuration ───────────────────────────────────────────

var CONFIG = {
  apiKey: process.env.OPENAI_API_KEY,
  model: "text-embedding-3-small",
  dimensions: 1536,
  batchSize: 100,
  monthlyBudget: 50.00,
  dbConnectionString: process.env.POSTGRES_CONNECTION_STRING
};

// ─── Database Pool ───────────────────────────────────────────

var pool = new Pool({
  connectionString: CONFIG.dbConnectionString,
  max: 20,
  idleTimeoutMillis: 30000
});

// ─── Cost Estimator ──────────────────────────────────────────

var COST_RATES = {
  "text-embedding-3-small": 0.02,
  "text-embedding-3-large": 0.13
};

function estimateCost(tokens, model) {
  var rate = COST_RATES[model] || 0.02;
  return (tokens / 1000000) * rate;
}

// ─── Cost Tracker ────────────────────────────────────────────

function CostTracker(budgetLimit) {
  this.totalTokens = 0;
  this.totalCost = 0;
  this.requestCount = 0;
  this.budgetLimit = budgetLimit;
  this.events = new EventEmitter();
}

CostTracker.prototype.track = function (tokensUsed, model) {
  var cost = estimateCost(tokensUsed, model);
  this.totalTokens += tokensUsed;
  this.totalCost += cost;
  this.requestCount++;

  if (this.totalCost >= this.budgetLimit * 0.8) {
    this.events.emit("warning", {
      message: "80% of budget consumed",
      spent: this.totalCost,
      limit: this.budgetLimit
    });
  }

  if (this.totalCost >= this.budgetLimit) {
    this.events.emit("exceeded", {
      spent: this.totalCost,
      limit: this.budgetLimit
    });
    throw new Error("EMBEDDING_BUDGET_EXCEEDED: $" +
      this.totalCost.toFixed(4) + " / $" + this.budgetLimit);
  }

  return cost;
};

CostTracker.prototype.report = function () {
  return {
    requests: this.requestCount,
    tokens: this.totalTokens,
    cost: "$" + this.totalCost.toFixed(4),
    budgetUsed: ((this.totalCost / this.budgetLimit) * 100).toFixed(1) + "%"
  };
};

// ─── Content Hash Cache ──────────────────────────────────────

function contentHash(text, model) {
  return crypto.createHash("sha256").update(model + ":" + text).digest("hex");
}

function getCachedEmbeddings(texts, model) {
  var hashes = texts.map(function (t) { return contentHash(t, model); });
  var placeholders = hashes.map(function (_, i) { return "$" + (i + 1); });
  var sql = "SELECT content_hash, embedding FROM embedding_cache WHERE content_hash IN (" +
    placeholders.join(", ") + ")";

  return pool.query(sql, hashes).then(function (result) {
    var cached = {};
    result.rows.forEach(function (row) {
      cached[row.content_hash] = row.embedding;
    });
    return cached;
  });
}

function storeCachedEmbeddings(entries) {
  if (entries.length === 0) return Promise.resolve();

  var values = [];
  var params = [];
  var idx = 1;

  entries.forEach(function (entry) {
    values.push("($" + idx + ", $" + (idx + 1) + ", $" + (idx + 2) + ", $" + (idx + 3) + ")");
    params.push(entry.hash, CONFIG.model, JSON.stringify(entry.embedding), entry.tokenCount);
    idx += 4;
  });

  var sql = "INSERT INTO embedding_cache (content_hash, model, embedding, token_count) VALUES " +
    values.join(", ") + " ON CONFLICT (content_hash) DO NOTHING";

  return pool.query(sql, params);
}

// ─── Batch Embedder with Cache ───────────────────────────────

function EmbeddingPipeline(config) {
  this.config = config;
  this.costTracker = new CostTracker(config.monthlyBudget);

  var self = this;
  this.costTracker.events.on("warning", function (data) {
    console.warn("[BUDGET WARNING] " + data.message +
      " — $" + data.spent.toFixed(4) + " / $" + data.limit);
  });
  this.costTracker.events.on("exceeded", function (data) {
    console.error("[BUDGET EXCEEDED] $" + data.spent.toFixed(4) + " / $" + data.limit);
  });
}

EmbeddingPipeline.prototype.embed = function (documents) {
  var self = this;
  var model = self.config.model;
  var texts = documents.map(function (d) { return d.content; });

  // Step 1: Check cache
  return getCachedEmbeddings(texts, model).then(function (cached) {
    var uncachedDocs = [];
    var results = new Array(documents.length);

    documents.forEach(function (doc, i) {
      var hash = contentHash(doc.content, model);
      if (cached[hash]) {
        results[i] = { docId: doc.id, embedding: cached[hash], source: "cache" };
      } else {
        uncachedDocs.push({ index: i, doc: doc, hash: hash });
      }
    });

    var cacheHits = documents.length - uncachedDocs.length;
    console.log("Cache: " + cacheHits + " hits, " + uncachedDocs.length + " misses");

    if (uncachedDocs.length === 0) {
      return results;
    }

    // Step 2: Deduplicate uncached docs
    var uniqueTexts = {};
    var dedupMap = {};

    uncachedDocs.forEach(function (item) {
      var normalized = item.doc.content.trim();
      var hash = contentHash(normalized, model);
      if (!uniqueTexts[hash]) {
        uniqueTexts[hash] = { text: normalized, indices: [item.index], hash: item.hash };
      } else {
        uniqueTexts[hash].indices.push(item.index);
      }
    });

    var uniqueTextList = Object.keys(uniqueTexts).map(function (key) {
      return uniqueTexts[key];
    });

    console.log("Deduplication: " + uncachedDocs.length + " docs -> " +
      uniqueTextList.length + " unique texts");

    // Step 3: Batch embed unique texts
    var batchTexts = uniqueTextList.map(function (u) { return u.text; });
    return self._batchEmbed(batchTexts).then(function (embeddingResults) {
      var cacheEntries = [];

      uniqueTextList.forEach(function (unique, i) {
        var embedding = embeddingResults.embeddings[i];
        unique.indices.forEach(function (docIndex) {
          results[docIndex] = {
            docId: documents[docIndex].id,
            embedding: embedding,
            source: "api"
          };
        });
        cacheEntries.push({
          hash: unique.hash,
          embedding: embedding,
          tokenCount: Math.ceil(unique.text.length / 4)
        });
      });

      // Step 4: Store in cache
      return storeCachedEmbeddings(cacheEntries).then(function () {
        console.log("Cached " + cacheEntries.length + " new embeddings");
        console.log("Cost report: " + JSON.stringify(self.costTracker.report()));
        return results;
      });
    });
  });
};

EmbeddingPipeline.prototype._batchEmbed = function (texts) {
  var self = this;
  var batchSize = self.config.batchSize;
  var allEmbeddings = [];
  var totalTokens = 0;

  var batches = [];
  for (var i = 0; i < texts.length; i += batchSize) {
    batches.push(texts.slice(i, i + batchSize));
  }

  return batches.reduce(function (promise, batch, batchIndex) {
    return promise.then(function () {
      return axios.post("https://api.openai.com/v1/embeddings", {
        input: batch,
        model: self.config.model
      }, {
        headers: {
          "Authorization": "Bearer " + self.config.apiKey,
          "Content-Type": "application/json"
        },
        timeout: 60000
      }).then(function (response) {
        var embeddings = response.data.data.map(function (item) {
          return item.embedding;
        });
        allEmbeddings = allEmbeddings.concat(embeddings);

        var tokens = response.data.usage.total_tokens;
        totalTokens += tokens;
        self.costTracker.track(tokens, self.config.model);

        console.log("Batch " + (batchIndex + 1) + "/" + batches.length +
          " — " + tokens + " tokens");
      });
    });
  }, Promise.resolve()).then(function () {
    return { embeddings: allEmbeddings, totalTokens: totalTokens };
  });
};

// ─── Incremental Update Runner ───────────────────────────────

function runIncrementalUpdate(documents) {
  var pipeline = new EmbeddingPipeline(CONFIG);

  // Step 1: Check which documents have changed
  var docIds = documents.map(function (d) { return d.id; });
  var placeholders = docIds.map(function (_, i) { return "$" + (i + 1); });
  var sql = "SELECT doc_id, content_hash FROM documents WHERE doc_id IN (" +
    placeholders.join(", ") + ")";

  return pool.query(sql, docIds).then(function (result) {
    var existing = {};
    result.rows.forEach(function (row) {
      existing[row.doc_id] = row.content_hash;
    });

    var changed = documents.filter(function (doc) {
      var hash = crypto.createHash("sha256").update(doc.content).digest("hex");
      return existing[doc.id] !== hash;
    });

    console.log("Incremental update: " + changed.length + " of " +
      documents.length + " documents need embedding");

    if (changed.length === 0) {
      console.log("Nothing to embed. All documents are up to date.");
      return { embedded: 0, skipped: documents.length };
    }

    return pipeline.embed(changed).then(function (results) {
      // Write results to database
      var apiResults = results.filter(function (r) { return r && r.embedding; });
      var values = [];
      var params = [];
      var idx = 1;

      apiResults.forEach(function (r) {
        var doc = changed.find(function (d) { return d.id === r.docId; });
        var hash = crypto.createHash("sha256").update(doc.content).digest("hex");
        values.push("($" + idx + ", $" + (idx + 1) + ", $" + (idx + 2) + ", $" + (idx + 3) + ")");
        params.push(r.docId, hash, JSON.stringify(r.embedding), doc.content);
        idx += 4;
      });

      if (values.length === 0) return { embedded: 0, skipped: documents.length };

      var upsertSql = "INSERT INTO documents (doc_id, content_hash, embedding, content) VALUES " +
        values.join(", ") +
        " ON CONFLICT (doc_id) DO UPDATE SET content_hash = EXCLUDED.content_hash, " +
        "embedding = EXCLUDED.embedding, content = EXCLUDED.content, updated_at = NOW()";

      return pool.query(upsertSql, params).then(function () {
        var report = pipeline.costTracker.report();
        console.log("\n=== Embedding Run Complete ===");
        console.log("Embedded: " + changed.length);
        console.log("Skipped: " + (documents.length - changed.length));
        console.log("Cost: " + report.cost);
        console.log("Budget used: " + report.budgetUsed);

        return {
          embedded: changed.length,
          skipped: documents.length - changed.length,
          costReport: report
        };
      });
    });
  });
}

// ─── Usage ───────────────────────────────────────────────────

var sampleDocuments = [
  { id: "doc-001", content: "PostgreSQL is an advanced open-source relational database..." },
  { id: "doc-002", content: "Redis provides in-memory data structure storage..." },
  { id: "doc-003", content: "Docker containers package applications with their dependencies..." }
];

runIncrementalUpdate(sampleDocuments)
  .then(function (result) {
    console.log("Final result:", result);
    process.exit(0);
  })
  .catch(function (err) {
    console.error("Pipeline error:", err.message);
    process.exit(1);
  });

To run this, you need the database schema set up first:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  doc_id VARCHAR(128) PRIMARY KEY,
  content TEXT NOT NULL,
  content_hash VARCHAR(64) NOT NULL,
  embedding VECTOR(1536),
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE embedding_cache (
  content_hash VARCHAR(64) PRIMARY KEY,
  model VARCHAR(64) NOT NULL,
  embedding VECTOR(1536),
  token_count INTEGER,
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_documents_hash ON documents(content_hash);
CREATE INDEX idx_cache_model ON embedding_cache(model);
CREATE INDEX idx_documents_embedding ON documents USING hnsw (embedding vector_cosine_ops);

npm install axios pg
export OPENAI_API_KEY="sk-..."
export POSTGRES_CONNECTION_STRING="postgresql://user:pass@localhost:5432/embeddings"
node pipeline.js

Expected output:

Incremental update: 3 of 3 documents need embedding
Cache: 0 hits, 3 misses
Deduplication: 3 docs -> 3 unique texts
Batch 1/1 — 847 tokens
Cached 3 new embeddings
Cost report: {"requests":1,"tokens":847,"cost":"$0.0000","budgetUsed":"0.0%"}

=== Embedding Run Complete ===
Embedded: 3
Skipped: 0
Cost: $0.0000
Budget used: 0.0%
Final result: { embedded: 3, skipped: 0, costReport: { ... } }

Run it again with the same documents and you will see:

Incremental update: 0 of 3 documents need embedding
Nothing to embed. All documents are up to date.

Common Issues and Troubleshooting

1. Rate Limiting on Batch Requests

Error: Request failed with status code 429
data: { error: { message: "Rate limit reached for text-embedding-3-small in organization org-xxx on tokens per min (TPM): Limit 1000000, Used 987342, Requested 156000." } }

This happens when you fire too many batch requests in parallel. Add exponential backoff between batches:

function delay(ms) {
  return new Promise(function (resolve) { setTimeout(resolve, ms); });
}

// In your batch loop, add a delay between requests
return promise.then(function () {
  return delay(200);  // 200ms between batches
}).then(function () {
  return axios.post(/* ... */);
});

2. Dimension Mismatch After Model Change

ERROR: expected 1536 dimensions, not 3072
DETAIL: column "embedding" has type vector(1536)

If you switch from text-embedding-3-small (1536d) to text-embedding-3-large (3072d), your pgvector column will reject the new vectors. Either alter the column or create a new table. This is why the content hash includes the model name — cached embeddings from the old model will not be served for the new model.

-- Option 1: Alter the column (requires re-embedding everything)
ALTER TABLE documents ALTER COLUMN embedding TYPE VECTOR(3072);

-- Option 2: Truncate and re-embed
TRUNCATE TABLE embedding_cache;

3. Memory Overflow on Large Batch Jobs

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

When embedding millions of documents, holding all vectors in memory at once will crash Node.js. Process in chunks and write to the database as you go:

// Process 10,000 documents at a time, not all at once
var CHUNK_SIZE = 10000;
for (var i = 0; i < allDocs.length; i += CHUNK_SIZE) {
  var chunk = allDocs.slice(i, i + CHUNK_SIZE);
  // embed and write this chunk before loading the next
}

You can also increase Node.js memory with --max-old-space-size=4096, but fixing the batching is the right solution.

4. Budget Exceeded Mid-Pipeline

Error: EMBEDDING_BUDGET_EXCEEDED: $50.0034 / $50

This happens when a long-running job crosses your budget threshold. The pipeline throws hard and stops. You have two options: increase the budget, or implement a "soft stop" that finishes the current batch and saves progress:

CostTracker.prototype.track = function (tokensUsed, model) {
  var cost = estimateCost(tokensUsed, model);
  this.totalCost += cost;

  if (this.totalCost >= this.budgetLimit) {
    // Soft stop: set a flag instead of throwing
    this.budgetExceeded = true;
    console.warn("[BUDGET] Limit reached. Completing current batch then stopping.");
    return cost;
  }
  return cost;
};

5. Stale Cache Serving Wrong Embeddings

If you update an embedding model's API version and the provider changes the underlying model (OpenAI has done this), your cache will serve stale vectors that are no longer compatible with fresh ones. Include a model version in your cache key:

function contentHash(text, model, version) {
  var version = version || "v1";
  return crypto.createHash("sha256")
    .update(model + ":" + version + ":" + text)
    .digest("hex");
}

Best Practices

Always batch your embedding requests. Single-document API calls waste network overhead and invite rate limiting. Pack as many texts as the API allows per request — up to 2048 inputs for OpenAI.
Cache by content hash, not document ID. Documents get renamed, moved, and re-uploaded. The content hash guarantees you only re-embed when the actual text changes, regardless of metadata changes.
Include the model name in your cache key. Switching models invalidates all cached embeddings. A cache miss is cheap; serving a mismatched embedding is a subtle, hard-to-debug accuracy problem.
Start with the smallest model that meets your quality bar. Run a retrieval evaluation on your actual data before choosing text-embedding-3-large. Most applications perform well enough with the small model at 6.5x lower cost.
Set hard budget limits with automatic cutoffs. It is better to stop a pipeline at 80% completion than to blow your monthly budget on a single runaway job. Budget alerts at 50%, 80%, and 100% thresholds are the minimum.
Deduplicate before embedding, not after. Running deduplication before API calls is free. Running it after means you already paid for the duplicate embeddings.
Use dimension reduction when storage costs matter more than marginal recall. Going from 1536 to 512 dimensions cuts storage by 66% with minimal quality degradation for most workloads.
Monitor and log every embedding request. Track tokens consumed, cost per batch, cache hit rates, and daily spend. Without metrics, cost creep is invisible until the invoice arrives.
Process incrementally, not in full rebuilds. Re-embedding your entire corpus nightly because "it is simpler" is the single most expensive mistake in embedding pipelines. Track content hashes and only embed what changed.
Estimate your storage costs before choosing dimensions and row counts. A 10-million-row pgvector table at 1536 dimensions with HNSW indexes needs over 100 GB. Plan your database provisioning accordingly.

Cost-Effective Embedding Strategies at Scale

Cost-Effective Embedding Strategies at Scale

Overview

Prerequisites

Understanding Embedding Costs

Batch Embedding for Bulk Operations

Caching Embeddings to Avoid Re-Computation

Incremental Embedding

Choosing the Right Model Size

Dimension Reduction to Save Storage Costs

Estimating Storage Costs for pgvector at Scale

Connection Pooling for High-Throughput Embedding Pipelines

Async Embedding Queues for Non-Blocking Ingestion

Pre-Computing Embeddings for Predictable Queries

Deduplication Before Embedding

Monitoring Embedding Spend with Cost Tracking Middleware

Comparing Self-Hosted vs API Embedding Costs

Budgeting and Forecasting Embedding Costs

Complete Working Example

Common Issues and Troubleshooting

1. Rate Limiting on Batch Requests

2. Dimension Mismatch After Model Change

3. Memory Overflow on Large Batch Jobs

4. Budget Exceeded Mid-Pipeline

5. Stale Cache Serving Wrong Embeddings

Best Practices

References

Quick Links

Need Expert Help?