Llm Apis

Embedding APIs: Beyond Text Generation

Complete guide to embedding APIs for semantic search, classification, and vector storage with pgvector in Node.js applications.

Embedding APIs: Beyond Text Generation

Overview

Most developers interact with LLM APIs exclusively through chat completions, but the embedding endpoints are arguably more useful for production systems. Embeddings convert text into dense numerical vectors that capture semantic meaning, enabling similarity search, classification, clustering, and recommendation systems without ever generating a single token of text. This article covers everything you need to build embedding-powered features in Node.js — from API calls and vector storage to hybrid search and drift monitoring.

Prerequisites

  • Node.js v18+ installed
  • PostgreSQL 15+ with the pgvector extension
  • An OpenAI API key (for text-embedding-3-small or text-embedding-3-large)
  • Basic familiarity with Express.js and SQL
  • psql CLI or a PostgreSQL client

What Embeddings Are and Why They Matter

An embedding is a fixed-length array of floating-point numbers that represents the semantic meaning of a piece of text. The word "king" and the word "monarch" will produce vectors that are close together in high-dimensional space, while "king" and "toaster" will be far apart.

This is not keyword matching. The sentence "How do I terminate a process?" and "How do I kill a running program?" share almost no words in common, but their embeddings will be nearly identical. That is the power of vector representations — they encode meaning, not surface tokens.

Embeddings enable a class of features that are impossible or painful with traditional text search:

  • Semantic search — find documents by meaning, not keywords
  • Classification — assign categories by comparing an input vector to labeled reference vectors
  • Clustering — group similar documents without predefined labels
  • Recommendations — suggest content similar to what a user has already consumed
  • Deduplication — detect near-duplicate content even when wording differs
  • Anomaly detection — flag content that does not belong in a corpus

The key insight is that once you have embeddings, all of these features reduce to the same operation: measuring the distance between vectors.

OpenAI Embeddings API

OpenAI offers two embedding models as of early 2026:

Model Dimensions Max Tokens Cost per 1M Tokens
text-embedding-3-small 1536 8191 $0.02
text-embedding-3-large 3072 8191 $0.13

text-embedding-3-small is the workhorse. It is cheap, fast, and good enough for the vast majority of use cases. I have shipped production semantic search systems on it and the quality is excellent. Use text-embedding-3-large only when you have empirically measured a quality gap on your specific dataset — which is rare.

Both models support a dimensions parameter that lets you reduce the output size. You can request 256-dimensional vectors from text-embedding-3-small instead of the full 1536, trading some accuracy for significant storage and computation savings. I will cover this in the dimensionality reduction section.

Here is a basic embedding call:

var axios = require("axios");

function getEmbedding(text, apiKey) {
  return axios.post(
    "https://api.openai.com/v1/embeddings",
    {
      model: "text-embedding-3-small",
      input: text
    },
    {
      headers: {
        "Authorization": "Bearer " + apiKey,
        "Content-Type": "application/json"
      }
    }
  ).then(function (response) {
    return response.data.data[0].embedding;
  });
}

The API accepts a single string or an array of strings. When you pass an array, you get back one embedding per input — and it is significantly faster than making individual calls.

Cohere Embed API as an Alternative

OpenAI is not the only game in town. Cohere's embed-v3.0 model is a strong alternative with one killer feature: it supports distinct input_type parameters for documents versus queries. This asymmetric embedding can improve search quality because the model knows whether it is encoding a document to be searched or a query doing the searching.

var axios = require("axios");

function getCohereEmbedding(texts, inputType, apiKey) {
  return axios.post(
    "https://api.cohere.ai/v1/embed",
    {
      model: "embed-english-v3.0",
      texts: texts,
      input_type: inputType,
      truncate: "END"
    },
    {
      headers: {
        "Authorization": "Bearer " + apiKey,
        "Content-Type": "application/json"
      }
    }
  ).then(function (response) {
    return response.data.embeddings;
  });
}

// When embedding documents for storage:
getCohereEmbedding(["Your document text here"], "search_document", COHERE_KEY);

// When embedding a search query:
getCohereEmbedding(["user search query"], "search_query", COHERE_KEY);

Cohere's model produces 1024-dimensional vectors by default, which means lower storage costs than OpenAI's 1536. The pricing is competitive. If you are building a search-heavy system and want to benchmark alternatives, Cohere is worth testing.

Generating Embeddings in Node.js

Here is a complete, production-ready embedding client that handles retries, rate limits, and batch processing:

var axios = require("axios");

var OPENAI_API_KEY = process.env.OPENAI_API_KEY;
var EMBEDDING_MODEL = "text-embedding-3-small";
var MAX_BATCH_SIZE = 2048;

function sleep(ms) {
  return new Promise(function (resolve) {
    setTimeout(resolve, ms);
  });
}

function embedTexts(texts, retries) {
  if (typeof retries === "undefined") retries = 3;

  return axios.post(
    "https://api.openai.com/v1/embeddings",
    {
      model: EMBEDDING_MODEL,
      input: texts
    },
    {
      headers: {
        "Authorization": "Bearer " + OPENAI_API_KEY,
        "Content-Type": "application/json"
      },
      timeout: 30000
    }
  ).then(function (response) {
    var sorted = response.data.data.sort(function (a, b) {
      return a.index - b.index;
    });
    return sorted.map(function (item) {
      return item.embedding;
    });
  }).catch(function (err) {
    if (retries > 0 && err.response && err.response.status === 429) {
      var retryAfter = parseInt(err.response.headers["retry-after"] || "5", 10);
      console.log("Rate limited. Retrying in " + retryAfter + "s...");
      return sleep(retryAfter * 1000).then(function () {
        return embedTexts(texts, retries - 1);
      });
    }
    throw err;
  });
}

function embedBatch(texts) {
  var batches = [];
  for (var i = 0; i < texts.length; i += MAX_BATCH_SIZE) {
    batches.push(texts.slice(i, i + MAX_BATCH_SIZE));
  }

  var allEmbeddings = [];
  var chain = Promise.resolve();

  batches.forEach(function (batch, idx) {
    chain = chain.then(function () {
      console.log("Processing batch " + (idx + 1) + "/" + batches.length +
        " (" + batch.length + " texts)");
      return embedTexts(batch);
    }).then(function (embeddings) {
      allEmbeddings = allEmbeddings.concat(embeddings);
      return sleep(200);
    });
  });

  return chain.then(function () {
    return allEmbeddings;
  });
}

module.exports = {
  embedTexts: embedTexts,
  embedBatch: embedBatch
};

A few things to note. The API returns embeddings with an index field that corresponds to the position in the input array, but the response order is not guaranteed to match the input order. Always sort by index. The MAX_BATCH_SIZE of 2048 is the OpenAI limit per request. The 200ms sleep between batches prevents you from slamming the rate limit on large ingestion jobs.

Storing Embeddings in PostgreSQL with pgvector

pgvector turns PostgreSQL into a vector database. You get the full power of SQL — joins, transactions, indexes, backups — with native vector similarity search. I have deployed this in production and it handles millions of vectors without breaking a sweat.

Install pgvector and create the schema:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id SERIAL PRIMARY KEY,
  title TEXT NOT NULL,
  content TEXT NOT NULL,
  chunk_index INTEGER NOT NULL DEFAULT 0,
  embedding vector(1536),
  metadata JSONB DEFAULT '{}',
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

CREATE INDEX idx_documents_metadata ON documents USING gin (metadata);

The vector(1536) type stores a 1536-dimensional vector. The ivfflat index partitions vectors into 100 lists for approximate nearest neighbor search. For datasets under 100,000 rows, you can also use exact search without the index — it will just scan every row.

For larger datasets (over 1 million rows), switch to HNSW indexing:

CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

HNSW is slower to build but faster to query at scale. Choose based on your dataset size and write patterns.

Cosine Similarity Search Implementation

Cosine similarity measures the angle between two vectors, ignoring magnitude. Two vectors pointing in the same direction have a cosine similarity of 1.0, orthogonal vectors score 0.0, and opposite vectors score -1.0. In pgvector, the <=> operator computes cosine distance (1 - cosine similarity), so lower values mean more similar.

var { Pool } = require("pg");

var pool = new Pool({
  connectionString: process.env.POSTGRES_CONNECTION_STRING
});

function searchSimilar(queryEmbedding, limit, threshold) {
  if (!limit) limit = 10;
  if (!threshold) threshold = 0.3;

  var vectorStr = "[" + queryEmbedding.join(",") + "]";

  return pool.query(
    "SELECT id, title, content, metadata, " +
    "1 - (embedding <=> $1::vector) AS similarity " +
    "FROM documents " +
    "WHERE 1 - (embedding <=> $1::vector) > $2 " +
    "ORDER BY embedding <=> $1::vector " +
    "LIMIT $3",
    [vectorStr, threshold, limit]
  ).then(function (result) {
    return result.rows;
  });
}

The threshold parameter is critical. Without it, you always get results — even if nothing in your database is remotely related to the query. A threshold of 0.3 is a reasonable starting point for text-embedding-3-small. Tune it based on your data. I have seen production systems use thresholds anywhere from 0.2 to 0.5 depending on how semantically diverse the corpus is.

Document Chunking Strategies

Embedding an entire 10,000-word document into a single vector is a bad idea. The model has a token limit (8191 tokens for OpenAI's models), and even within that limit, long texts produce embeddings that are too general — they try to represent everything and end up representing nothing well.

Chunking is the process of splitting documents into smaller pieces before embedding. There are three main strategies:

Sentence-Level Chunking

Split on sentence boundaries. Good for FAQ-style content where individual sentences carry complete meaning.

function chunkBySentence(text, maxSentences) {
  if (!maxSentences) maxSentences = 3;

  var sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
  var chunks = [];

  for (var i = 0; i < sentences.length; i += maxSentences) {
    var chunk = sentences.slice(i, i + maxSentences).join(" ").trim();
    if (chunk.length > 20) {
      chunks.push(chunk);
    }
  }

  return chunks;
}

Paragraph-Level Chunking

Split on double newlines. Best for articles, documentation, and blog posts where paragraphs are natural semantic units.

function chunkByParagraph(text, maxChars) {
  if (!maxChars) maxChars = 1500;

  var paragraphs = text.split(/\n\n+/).filter(function (p) {
    return p.trim().length > 0;
  });

  var chunks = [];
  var currentChunk = "";

  paragraphs.forEach(function (para) {
    if ((currentChunk + "\n\n" + para).length > maxChars && currentChunk.length > 0) {
      chunks.push(currentChunk.trim());
      currentChunk = para;
    } else {
      currentChunk = currentChunk ? currentChunk + "\n\n" + para : para;
    }
  });

  if (currentChunk.trim().length > 0) {
    chunks.push(currentChunk.trim());
  }

  return chunks;
}

Sliding Window Chunking

Overlapping windows ensure no context is lost at chunk boundaries. This is the most robust approach for general-purpose search.

function chunkSlidingWindow(text, windowSize, overlap) {
  if (!windowSize) windowSize = 500;
  if (!overlap) overlap = 100;

  var words = text.split(/\s+/);
  var chunks = [];
  var step = windowSize - overlap;

  for (var i = 0; i < words.length; i += step) {
    var chunk = words.slice(i, i + windowSize).join(" ");
    if (chunk.split(/\s+/).length > 20) {
      chunks.push(chunk);
    }
    if (i + windowSize >= words.length) break;
  }

  return chunks;
}

My recommendation: start with paragraph-level chunking with a max of 1000-1500 characters per chunk, and add sliding window overlap if you see context getting lost at boundaries. Most applications do not need sentence-level granularity.

Embedding Caching for Repeated Content

Embedding the same text twice is wasting money. If your documents do not change often, cache the embedding alongside the content hash:

var crypto = require("crypto");

function contentHash(text) {
  return crypto.createHash("sha256").update(text).digest("hex");
}

function getOrCreateEmbedding(text, pool, embedFn) {
  var hash = contentHash(text);

  return pool.query(
    "SELECT embedding FROM embedding_cache WHERE content_hash = $1",
    [hash]
  ).then(function (result) {
    if (result.rows.length > 0) {
      return JSON.parse(result.rows[0].embedding);
    }

    return embedFn([text]).then(function (embeddings) {
      var embedding = embeddings[0];
      var vectorJson = JSON.stringify(embedding);

      return pool.query(
        "INSERT INTO embedding_cache (content_hash, embedding, created_at) " +
        "VALUES ($1, $2, NOW()) ON CONFLICT (content_hash) DO NOTHING",
        [hash, vectorJson]
      ).then(function () {
        return embedding;
      });
    });
  });
}

The cache table:

CREATE TABLE embedding_cache (
  content_hash VARCHAR(64) PRIMARY KEY,
  embedding TEXT NOT NULL,
  created_at TIMESTAMP DEFAULT NOW()
);

This is especially valuable during development when you are re-ingesting the same corpus repeatedly while tuning chunking parameters.

Batch Embedding for Large Document Sets

When you need to embed thousands of documents, the batch embedding client from earlier handles the mechanics, but you also need an ingestion pipeline that tracks progress and handles failures:

var embedClient = require("./embed-client");

function ingestDocuments(documents, pool) {
  var totalChunks = 0;
  var processed = 0;

  var allChunks = [];
  documents.forEach(function (doc) {
    var chunks = chunkByParagraph(doc.content);
    chunks.forEach(function (chunk, idx) {
      allChunks.push({
        title: doc.title,
        content: chunk,
        chunkIndex: idx,
        metadata: doc.metadata || {}
      });
    });
  });

  totalChunks = allChunks.length;
  console.log("Ingesting " + totalChunks + " chunks from " +
    documents.length + " documents");

  var texts = allChunks.map(function (c) { return c.content; });

  return embedClient.embedBatch(texts).then(function (embeddings) {
    var chain = Promise.resolve();

    allChunks.forEach(function (chunk, idx) {
      chain = chain.then(function () {
        var vectorStr = "[" + embeddings[idx].join(",") + "]";
        processed++;

        if (processed % 100 === 0) {
          console.log("Stored " + processed + "/" + totalChunks);
        }

        return pool.query(
          "INSERT INTO documents (title, content, chunk_index, embedding, metadata) " +
          "VALUES ($1, $2, $3, $4::vector, $5)",
          [chunk.title, chunk.content, chunk.chunkIndex, vectorStr,
            JSON.stringify(chunk.metadata)]
        );
      });
    });

    return chain;
  }).then(function () {
    console.log("Ingestion complete. " + totalChunks + " chunks stored.");
  });
}

For very large ingestion jobs (100,000+ documents), consider writing embeddings to a CSV file and using PostgreSQL's COPY command for bulk loading. It is an order of magnitude faster than individual inserts.

Dimensionality Reduction for Cost and Speed

OpenAI's text-embedding-3-small produces 1536-dimensional vectors, but you can request fewer dimensions via the dimensions parameter:

function getReducedEmbedding(text, dimensions) {
  return axios.post(
    "https://api.openai.com/v1/embeddings",
    {
      model: "text-embedding-3-small",
      input: text,
      dimensions: dimensions
    },
    {
      headers: {
        "Authorization": "Bearer " + process.env.OPENAI_API_KEY,
        "Content-Type": "application/json"
      }
    }
  ).then(function (response) {
    return response.data.data[0].embedding;
  });
}

// 256 dimensions instead of 1536
getReducedEmbedding("Your text here", 256);

The impact on storage is dramatic:

Dimensions Bytes per Vector 1M Vectors Storage
1536 6,144 ~5.7 GB
768 3,072 ~2.9 GB
256 1,024 ~0.95 GB

In my testing, 256 dimensions still delivers 90-95% of the search quality of full 1536 dimensions for most use cases. The only time I have seen meaningful degradation is with highly technical, domain-specific content where subtle distinctions matter. Start with 256 or 512, measure your recall, and only go higher if the numbers demand it.

Remember to update your schema to match:

-- For 256-dimensional embeddings
ALTER TABLE documents ALTER COLUMN embedding TYPE vector(256);

Combining Embeddings with Full-Text Search (Hybrid Search)

Pure vector search misses exact keyword matches. If a user searches for "error code ERR_CONNECTION_REFUSED", semantic search might find conceptually similar content about connection errors, but full-text search will find the exact error code. Hybrid search combines both.

-- Add a tsvector column for full-text search
ALTER TABLE documents ADD COLUMN tsv tsvector;
UPDATE documents SET tsv = to_tsvector('english', title || ' ' || content);
CREATE INDEX idx_documents_tsv ON documents USING gin (tsv);

The hybrid search query uses Reciprocal Rank Fusion (RRF) to combine scores from both methods:

function hybridSearch(queryText, queryEmbedding, pool, limit) {
  if (!limit) limit = 10;
  var vectorStr = "[" + queryEmbedding.join(",") + "]";

  var sql =
    "WITH semantic AS ( " +
    "  SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS rank_s " +
    "  FROM documents " +
    "  ORDER BY embedding <=> $1::vector " +
    "  LIMIT 50 " +
    "), " +
    "fulltext AS ( " +
    "  SELECT id, ROW_NUMBER() OVER (ORDER BY ts_rank(tsv, query) DESC) AS rank_f " +
    "  FROM documents, plainto_tsquery('english', $2) query " +
    "  WHERE tsv @@ query " +
    "  LIMIT 50 " +
    "), " +
    "combined AS ( " +
    "  SELECT COALESCE(s.id, f.id) AS id, " +
    "    COALESCE(1.0 / (60 + s.rank_s), 0) + " +
    "    COALESCE(1.0 / (60 + f.rank_f), 0) AS rrf_score " +
    "  FROM semantic s " +
    "  FULL OUTER JOIN fulltext f ON s.id = f.id " +
    ") " +
    "SELECT d.id, d.title, d.content, d.metadata, c.rrf_score " +
    "FROM combined c " +
    "JOIN documents d ON d.id = c.id " +
    "ORDER BY c.rrf_score DESC " +
    "LIMIT $3";

  return pool.query(sql, [vectorStr, queryText, limit]).then(function (result) {
    return result.rows;
  });
}

The constant 60 in the RRF formula is standard. It prevents any single high-ranking result from dominating the combined score. Hybrid search consistently outperforms either method alone in my benchmarks — typically 10-15% better recall on mixed workloads that include both natural language queries and exact-match searches.

Embedding-Based Classification and Clustering

Once you have embeddings, classification is just nearest-neighbor lookup. Create reference embeddings for each category, then classify new content by finding the closest reference:

function classifyText(textEmbedding, referenceEmbeddings) {
  var bestCategory = null;
  var bestSimilarity = -1;

  Object.keys(referenceEmbeddings).forEach(function (category) {
    var refEmbedding = referenceEmbeddings[category];
    var similarity = cosineSimilarity(textEmbedding, refEmbedding);

    if (similarity > bestSimilarity) {
      bestSimilarity = similarity;
      bestCategory = category;
    }
  });

  return {
    category: bestCategory,
    confidence: bestSimilarity
  };
}

function cosineSimilarity(a, b) {
  var dotProduct = 0;
  var normA = 0;
  var normB = 0;

  for (var i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }

  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

For clustering, a simple k-means implementation works well for grouping documents into topics:

function kMeansCluster(embeddings, k, maxIterations) {
  if (!maxIterations) maxIterations = 50;

  // Initialize centroids randomly
  var centroids = [];
  var used = {};
  for (var c = 0; c < k; c++) {
    var idx;
    do {
      idx = Math.floor(Math.random() * embeddings.length);
    } while (used[idx]);
    used[idx] = true;
    centroids.push(embeddings[idx].slice());
  }

  var assignments = new Array(embeddings.length);

  for (var iter = 0; iter < maxIterations; iter++) {
    var changed = false;

    // Assign each embedding to nearest centroid
    for (var i = 0; i < embeddings.length; i++) {
      var bestCluster = 0;
      var bestDist = Infinity;
      for (var j = 0; j < k; j++) {
        var dist = 1 - cosineSimilarity(embeddings[i], centroids[j]);
        if (dist < bestDist) {
          bestDist = dist;
          bestCluster = j;
        }
      }
      if (assignments[i] !== bestCluster) {
        assignments[i] = bestCluster;
        changed = true;
      }
    }

    if (!changed) break;

    // Recompute centroids
    for (var ci = 0; ci < k; ci++) {
      var members = [];
      for (var mi = 0; mi < embeddings.length; mi++) {
        if (assignments[mi] === ci) members.push(embeddings[mi]);
      }
      if (members.length > 0) {
        centroids[ci] = averageVector(members);
      }
    }
  }

  return assignments;
}

function averageVector(vectors) {
  var dim = vectors[0].length;
  var avg = new Array(dim).fill(0);
  vectors.forEach(function (v) {
    for (var i = 0; i < dim; i++) {
      avg[i] += v[i];
    }
  });
  for (var i = 0; i < dim; i++) {
    avg[i] /= vectors.length;
  }
  return avg;
}

I have used embedding-based classification to auto-tag support tickets, route customer inquiries, and categorize incoming content. It works surprisingly well with just 5-10 reference examples per category.

Monitoring Embedding Quality and Drift

Embedding models get updated. Your data distribution changes over time. If you do not monitor, your search quality will silently degrade. Here are the metrics that matter:

Intra-cluster distance: For each category or cluster, measure the average distance between members. If this increases over time, your embeddings are becoming less coherent.

Query-result similarity distribution: Track the similarity scores of returned results. A sudden drop in average similarity means something changed — either the queries, the corpus, or the model.

function monitorSearchQuality(pool) {
  return pool.query(
    "SELECT " +
    "  DATE(created_at) AS day, " +
    "  AVG(similarity_score) AS avg_similarity, " +
    "  MIN(similarity_score) AS min_similarity, " +
    "  COUNT(*) AS query_count " +
    "FROM search_logs " +
    "WHERE created_at > NOW() - INTERVAL '30 days' " +
    "GROUP BY DATE(created_at) " +
    "ORDER BY day"
  ).then(function (result) {
    result.rows.forEach(function (row) {
      if (row.avg_similarity < 0.4) {
        console.warn("WARNING: Low average similarity on " + row.day +
          ": " + row.avg_similarity);
      }
    });
    return result.rows;
  });
}

Log every search query, the returned results, and their similarity scores. Build a search_logs table:

CREATE TABLE search_logs (
  id SERIAL PRIMARY KEY,
  query_text TEXT,
  similarity_score FLOAT,
  result_count INTEGER,
  created_at TIMESTAMP DEFAULT NOW()
);

When you detect drift, re-embed your corpus with the current model version and compare search quality before and after. Keep a test set of known query-document pairs and measure recall regularly.

Complete Working Example: Semantic Search Service

Here is a complete Express.js API that embeds documents, stores them in PostgreSQL with pgvector, and serves similarity search queries:

var express = require("express");
var bodyParser = require("body-parser");
var { Pool } = require("pg");
var axios = require("axios");
var crypto = require("crypto");

var app = express();
app.use(bodyParser.json({ limit: "5mb" }));

var pool = new Pool({
  connectionString: process.env.POSTGRES_CONNECTION_STRING
});

var OPENAI_API_KEY = process.env.OPENAI_API_KEY;
var EMBEDDING_MODEL = "text-embedding-3-small";
var EMBEDDING_DIMENSIONS = 1536;
var PORT = process.env.PORT || 3000;

// --- Embedding Helpers ---

function sleep(ms) {
  return new Promise(function (resolve) {
    setTimeout(resolve, ms);
  });
}

function embedTexts(texts, retries) {
  if (typeof retries === "undefined") retries = 3;

  return axios.post(
    "https://api.openai.com/v1/embeddings",
    {
      model: EMBEDDING_MODEL,
      input: texts,
      dimensions: EMBEDDING_DIMENSIONS
    },
    {
      headers: {
        "Authorization": "Bearer " + OPENAI_API_KEY,
        "Content-Type": "application/json"
      },
      timeout: 30000
    }
  ).then(function (response) {
    var sorted = response.data.data.sort(function (a, b) {
      return a.index - b.index;
    });
    return sorted.map(function (item) {
      return item.embedding;
    });
  }).catch(function (err) {
    if (retries > 0 && err.response && err.response.status === 429) {
      var retryAfter = parseInt(err.response.headers["retry-after"] || "5", 10);
      console.log("Rate limited. Retrying in " + retryAfter + "s...");
      return sleep(retryAfter * 1000).then(function () {
        return embedTexts(texts, retries - 1);
      });
    }
    throw err;
  });
}

// --- Chunking ---

function chunkByParagraph(text, maxChars) {
  if (!maxChars) maxChars = 1500;

  var paragraphs = text.split(/\n\n+/).filter(function (p) {
    return p.trim().length > 0;
  });

  var chunks = [];
  var current = "";

  paragraphs.forEach(function (para) {
    if ((current + "\n\n" + para).length > maxChars && current.length > 0) {
      chunks.push(current.trim());
      current = para;
    } else {
      current = current ? current + "\n\n" + para : para;
    }
  });

  if (current.trim().length > 0) {
    chunks.push(current.trim());
  }

  return chunks;
}

// --- Routes ---

// Health check
app.get("/health", function (req, res) {
  pool.query("SELECT 1").then(function () {
    res.json({ status: "ok", database: "connected" });
  }).catch(function () {
    res.status(500).json({ status: "error", database: "disconnected" });
  });
});

// Ingest a document
app.post("/documents", function (req, res) {
  var title = req.body.title;
  var content = req.body.content;
  var metadata = req.body.metadata || {};

  if (!title || !content) {
    return res.status(400).json({ error: "title and content are required" });
  }

  var chunks = chunkByParagraph(content);
  var texts = chunks;

  console.log("Embedding " + chunks.length + " chunks for: " + title);

  embedTexts(texts).then(function (embeddings) {
    var insertions = chunks.map(function (chunk, idx) {
      var vectorStr = "[" + embeddings[idx].join(",") + "]";
      return pool.query(
        "INSERT INTO documents (title, content, chunk_index, embedding, metadata) " +
        "VALUES ($1, $2, $3, $4::vector, $5) RETURNING id",
        [title, chunk, idx, vectorStr, JSON.stringify(metadata)]
      );
    });

    return Promise.all(insertions);
  }).then(function (results) {
    var ids = results.map(function (r) { return r.rows[0].id; });
    res.status(201).json({
      message: "Document ingested",
      chunks: chunks.length,
      ids: ids
    });
  }).catch(function (err) {
    console.error("Ingest error:", err.message);
    res.status(500).json({ error: "Failed to ingest document" });
  });
});

// Search documents
app.get("/search", function (req, res) {
  var query = req.query.q;
  var limit = parseInt(req.query.limit, 10) || 10;
  var threshold = parseFloat(req.query.threshold) || 0.3;
  var mode = req.query.mode || "semantic";

  if (!query) {
    return res.status(400).json({ error: "q parameter is required" });
  }

  embedTexts([query]).then(function (embeddings) {
    var queryEmbedding = embeddings[0];
    var vectorStr = "[" + queryEmbedding.join(",") + "]";

    var sql;
    var params;

    if (mode === "hybrid") {
      sql =
        "WITH semantic AS ( " +
        "  SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS rank_s " +
        "  FROM documents ORDER BY embedding <=> $1::vector LIMIT 50 " +
        "), " +
        "fulltext AS ( " +
        "  SELECT id, ROW_NUMBER() OVER (ORDER BY ts_rank(tsv, query) DESC) AS rank_f " +
        "  FROM documents, plainto_tsquery('english', $2) query " +
        "  WHERE tsv @@ query LIMIT 50 " +
        "), " +
        "combined AS ( " +
        "  SELECT COALESCE(s.id, f.id) AS id, " +
        "    COALESCE(1.0 / (60 + s.rank_s), 0) + " +
        "    COALESCE(1.0 / (60 + f.rank_f), 0) AS rrf_score " +
        "  FROM semantic s FULL OUTER JOIN fulltext f ON s.id = f.id " +
        ") " +
        "SELECT d.id, d.title, d.content, d.metadata, c.rrf_score AS score " +
        "FROM combined c JOIN documents d ON d.id = c.id " +
        "ORDER BY c.rrf_score DESC LIMIT $3";
      params = [vectorStr, query, limit];
    } else {
      sql =
        "SELECT id, title, content, metadata, " +
        "1 - (embedding <=> $1::vector) AS score " +
        "FROM documents " +
        "WHERE 1 - (embedding <=> $1::vector) > $2 " +
        "ORDER BY embedding <=> $1::vector LIMIT $3";
      params = [vectorStr, threshold, limit];
    }

    return pool.query(sql, params);
  }).then(function (result) {
    // Log search for monitoring
    pool.query(
      "INSERT INTO search_logs (query_text, similarity_score, result_count) " +
      "VALUES ($1, $2, $3)",
      [query, result.rows.length > 0 ? result.rows[0].score : 0, result.rows.length]
    ).catch(function () {});

    res.json({
      query: query,
      mode: mode,
      results: result.rows.map(function (row) {
        return {
          id: row.id,
          title: row.title,
          content: row.content,
          score: parseFloat(row.score).toFixed(4),
          metadata: row.metadata
        };
      })
    });
  }).catch(function (err) {
    console.error("Search error:", err.message);
    res.status(500).json({ error: "Search failed" });
  });
});

// Get search quality metrics
app.get("/metrics", function (req, res) {
  pool.query(
    "SELECT DATE(created_at) AS day, " +
    "  ROUND(AVG(similarity_score)::numeric, 4) AS avg_score, " +
    "  COUNT(*) AS queries " +
    "FROM search_logs " +
    "WHERE created_at > NOW() - INTERVAL '30 days' " +
    "GROUP BY DATE(created_at) ORDER BY day"
  ).then(function (result) {
    res.json({ metrics: result.rows });
  }).catch(function (err) {
    res.status(500).json({ error: "Failed to fetch metrics" });
  });
});

// --- Start Server ---

app.listen(PORT, function () {
  console.log("Semantic search service running on port " + PORT);
});

Test it with curl:

# Ingest a document
curl -X POST http://localhost:3000/documents \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Node.js Stream Processing",
    "content": "Streams are one of the fundamental concepts in Node.js. They allow you to read or write data piece by piece without loading the entire dataset into memory.\n\nReadable streams emit data events as chunks become available. Writable streams accept chunks and write them to a destination. Transform streams modify data as it passes through.\n\nBackpressure is the mechanism that prevents a fast producer from overwhelming a slow consumer. When a writable stream cannot keep up, it signals the readable stream to pause.",
    "metadata": {"category": "nodejs", "author": "Shane"}
  }'

# Semantic search
curl "http://localhost:3000/search?q=how+to+handle+memory+when+processing+large+files&limit=5"

# Hybrid search
curl "http://localhost:3000/search?q=backpressure+streams&mode=hybrid&limit=5"

Expected output:

{
  "query": "how to handle memory when processing large files",
  "mode": "semantic",
  "results": [
    {
      "id": 1,
      "title": "Node.js Stream Processing",
      "content": "Streams are one of the fundamental concepts in Node.js. They allow you to read or write data piece by piece without loading the entire dataset into memory.",
      "score": "0.7823",
      "metadata": {"category": "nodejs", "author": "Shane"}
    }
  ]
}

Notice how the query "how to handle memory when processing large files" matched content about streams — even though the words do not overlap much. That is the power of semantic search.

Common Issues and Troubleshooting

1. "Error: vector must have X dimensions, not Y"

ERROR: expected 1536 dimensions, not 3072

This happens when you switch embedding models (e.g., from text-embedding-3-small to text-embedding-3-large) without updating your schema. The column type must match the model output. Fix by either changing the column dimension or re-embedding with the correct model:

ALTER TABLE documents ALTER COLUMN embedding TYPE vector(3072);

If you used the dimensions parameter, make sure it matches your schema. There is no runtime validation — pgvector will reject the insert at write time.

2. "Error: Could not choose a best candidate operator"

ERROR: could not choose a best candidate operator
HINT: No operator matches the given name and argument types.

You forgot to create the pgvector extension. Run CREATE EXTENSION IF NOT EXISTS vector; in your database. Also ensure your PostgreSQL version supports pgvector — you need PostgreSQL 12 or later with the extension installed.

3. Rate Limiting on Batch Ingestion

Request failed with status code 429
Error: Rate limit reached for text-embedding-3-small

OpenAI enforces token-per-minute and request-per-minute limits. For Tier 1 accounts, the default is 1,000,000 tokens per minute. When ingesting large document sets, add delays between batches. The retry logic in the embedding client handles this, but if you are consistently hitting limits, reduce batch size or add a longer sleep:

// Increase delay between batches
return sleep(1000); // 1 second instead of 200ms

4. Search Returns Irrelevant Results with High Similarity

This usually means your chunks are too large. A 5,000-character chunk that covers three different topics will have an embedding that is vaguely similar to many queries but precisely similar to none. Fix by reducing your chunk size to 500-1000 characters and re-embedding.

Also check that you are not accidentally embedding metadata, HTML tags, or boilerplate alongside the actual content. Strip noise before embedding:

function cleanText(text) {
  return text
    .replace(/<[^>]+>/g, "")
    .replace(/\s+/g, " ")
    .trim();
}

5. HNSW Index Build Kills Memory

ERROR: out of memory
DETAIL: Failed on request of size 134217728 in memory context "HNSW build"

HNSW indexes are memory-intensive during construction. For large tables, increase maintenance_work_mem:

SET maintenance_work_mem = '2GB';
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

Alternatively, use ivfflat which requires far less memory to build. HNSW is only necessary when you need sub-millisecond queries on millions of vectors.

6. Embeddings Differ Between Runs for the Same Text

OpenAI's API is deterministic for the same model version, but model versions can change without notice. If you need reproducible embeddings, pin to a specific model version (when available) and always store the model name alongside the embedding so you know when re-embedding is needed.

Best Practices

  • Choose the smallest embedding model that meets your quality bar. Start with text-embedding-3-small at 256 or 512 dimensions. Only scale up when you have measured a quality gap on your actual data, not on vibes.

  • Always set a similarity threshold. Never return results just because they are the "most similar" — a cosine similarity of 0.15 is garbage, even if it is the best match in the database. A threshold of 0.3-0.5 for text-embedding-3-small is a good starting range.

  • Chunk before you embed. Documents longer than 500-1500 characters should be split. The optimal chunk size depends on your content, but paragraph-level with a 1000-character cap is a reliable default.

  • Use hybrid search in production. Pure semantic search misses exact matches. Pure full-text search misses semantic similarity. Combine them with RRF for the best results. It is not significantly more complex and the quality improvement is consistent.

  • Cache embeddings aggressively. Hash your content and store embeddings alongside the hash. Re-embed only when content changes. This saves money and keeps your ingestion pipeline idempotent.

  • Log everything. Track query text, result scores, result count, and latency for every search. This data is essential for monitoring quality drift, debugging relevance issues, and proving to your team that semantic search actually works.

  • Batch your embedding calls. Never make one API call per text when you have multiple texts to embed. A single call with 100 inputs is faster, cheaper, and more reliable than 100 individual calls.

  • Store the model name with your embeddings. When OpenAI updates a model, existing embeddings become incompatible. If you know which model generated each embedding, you can re-embed incrementally instead of guessing.

  • Test with real queries from real users. Synthetic test queries from developers are not representative. Collect actual search queries from your users and use them to evaluate chunking strategies, similarity thresholds, and model choices.

  • Plan for re-embedding. Your corpus will need to be re-embedded when you change models, adjust dimensions, or improve chunking. Build your ingestion pipeline to be re-runnable from day one.

References

Powered by Contentful