Embeddings

Production RAG Architectures

Build production RAG systems with advanced retrieval, reranking, citation tracking, and quality monitoring in Node.js.

Production RAG Architectures

Retrieval Augmented Generation has become the standard pattern for building LLM applications that need to answer questions over private data. The concept is simple — retrieve relevant documents, stuff them into a prompt, and let the model generate an answer — but the gap between a demo RAG system and a production RAG system is enormous. This article covers the architectural patterns, retrieval strategies, quality monitoring, and operational concerns that separate toy implementations from systems that actually work at scale.

Prerequisites

  • Intermediate to advanced Node.js experience
  • Familiarity with PostgreSQL and pgvector extension
  • Basic understanding of embeddings and vector similarity
  • Working knowledge of Express.js
  • An OpenAI API key (or equivalent embedding/completion provider)
  • PostgreSQL 15+ with pgvector installed

RAG Architecture Overview

Every RAG system has two pipelines: an ingestion pipeline that prepares documents for retrieval, and a query pipeline that answers user questions. The fundamental flow is retrieve, augment, generate.

Retrieve: Given a user query, find the most relevant chunks of text from your document store. This can use vector similarity, keyword search, or both.

Augment: Pack the retrieved chunks into the LLM's context window along with the original query and any system instructions. This is where you format citations, apply relevance filtering, and manage token budgets.

Generate: Send the augmented prompt to the LLM and parse the response, extracting citations and validating that the answer is grounded in the provided context.

The architecture looks straightforward on a whiteboard, but each stage introduces failure modes that compound. Bad chunking produces bad retrieval. Bad retrieval produces hallucinated answers. No monitoring means you discover quality problems from angry users instead of dashboards.

Naive RAG vs Advanced RAG

Naive RAG is what you build in a weekend hackathon: split documents into fixed-size chunks, embed them, store them in a vector database, and retrieve the top-k nearest neighbors for every query. It works surprisingly well for demos and breaks in predictable ways in production.

The problems with naive RAG:

  • Fixed-size chunking splits sentences mid-thought and destroys context
  • Single retrieval pass returns results that are semantically similar but not actually relevant
  • No reranking means the ordering from vector similarity may not reflect true usefulness
  • No query understanding means ambiguous or multi-part questions get poor results
  • No citation tracking means users cannot verify answers

Advanced RAG addresses these with a layered approach: semantic chunking, hybrid retrieval, reranking, query rewriting, citation extraction, and continuous quality monitoring. The rest of this article covers each layer.

Document Ingestion Pipeline

The ingestion pipeline transforms raw documents into retrievable chunks. The stages are parse, chunk, embed, and store.

Parsing

Before you can chunk a document, you need clean text. PDF parsing is notoriously unreliable — tables, headers, footers, and multi-column layouts all cause problems. For production systems, invest in a parsing layer that handles your specific document types.

var pdf = require("pdf-parse");
var mammoth = require("mammoth");
var cheerio = require("cheerio");

function parseDocument(buffer, mimeType) {
  if (mimeType === "application/pdf") {
    return pdf(buffer).then(function (data) {
      return {
        text: data.text,
        metadata: { pages: data.numpages }
      };
    });
  }

  if (mimeType === "application/vnd.openxmlformats-officedocument.wordprocessingml.document") {
    return mammoth.extractRawText({ buffer: buffer }).then(function (result) {
      return {
        text: result.value,
        metadata: {}
      };
    });
  }

  if (mimeType === "text/html") {
    var $ = cheerio.load(buffer.toString("utf-8"));
    $("script, style, nav, footer, header").remove();
    return Promise.resolve({
      text: $("body").text().replace(/\s+/g, " ").trim(),
      metadata: { title: $("title").text() }
    });
  }

  return Promise.resolve({
    text: buffer.toString("utf-8"),
    metadata: {}
  });
}

module.exports = { parseDocument: parseDocument };

Chunking

Fixed-size chunking (split every N tokens) is the naive approach. Semantic chunking produces better results by splitting on natural boundaries — paragraph breaks, section headers, sentence endings — while respecting a maximum size.

function semanticChunk(text, options) {
  var maxTokens = (options && options.maxTokens) || 512;
  var overlap = (options && options.overlap) || 64;
  var paragraphs = text.split(/\n\s*\n/);
  var chunks = [];
  var currentChunk = "";
  var currentTokenCount = 0;

  for (var i = 0; i < paragraphs.length; i++) {
    var para = paragraphs[i].trim();
    if (!para) continue;

    var paraTokens = estimateTokens(para);

    if (currentTokenCount + paraTokens > maxTokens && currentChunk.length > 0) {
      chunks.push(currentChunk.trim());

      // Create overlap by keeping the last portion
      var sentences = currentChunk.split(/(?<=[.!?])\s+/);
      var overlapText = "";
      var overlapTokens = 0;
      for (var j = sentences.length - 1; j >= 0; j--) {
        var sentenceTokens = estimateTokens(sentences[j]);
        if (overlapTokens + sentenceTokens > overlap) break;
        overlapText = sentences[j] + " " + overlapText;
        overlapTokens += sentenceTokens;
      }
      currentChunk = overlapText + para;
      currentTokenCount = overlapTokens + paraTokens;
    } else {
      currentChunk += (currentChunk ? "\n\n" : "") + para;
      currentTokenCount += paraTokens;
    }
  }

  if (currentChunk.trim()) {
    chunks.push(currentChunk.trim());
  }

  return chunks;
}

function estimateTokens(text) {
  // Rough estimate: 1 token per 4 characters for English
  return Math.ceil(text.length / 4);
}

The overlap parameter is critical. Without overlap, you lose context at chunk boundaries. With too much overlap, you waste storage and retrieval bandwidth. I have found 10-15% overlap works well for most document types.

Embedding and Storage

Once you have clean chunks, embed them and store them with their metadata. Using pgvector keeps everything in PostgreSQL, which simplifies your infrastructure compared to running a separate vector database.

var { Pool } = require("pg");
var OpenAI = require("openai");

var pool = new Pool({
  connectionString: process.env.POSTGRES_CONNECTION_STRING
});

var openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

function embedTexts(texts) {
  return openai.embeddings.create({
    model: "text-embedding-3-small",
    input: texts
  }).then(function (response) {
    return response.data.map(function (item) {
      return item.embedding;
    });
  });
}

function storeChunks(documentId, chunks, embeddings, metadata) {
  var client;
  return pool.connect().then(function (c) {
    client = c;
    return client.query("BEGIN");
  }).then(function () {
    var promises = chunks.map(function (chunk, i) {
      return client.query(
        "INSERT INTO document_chunks (document_id, chunk_index, content, embedding, metadata) " +
        "VALUES ($1, $2, $3, $4::vector, $5)",
        [
          documentId,
          i,
          chunk,
          "[" + embeddings[i].join(",") + "]",
          JSON.stringify(metadata)
        ]
      );
    });
    return Promise.all(promises);
  }).then(function () {
    return client.query("COMMIT");
  }).then(function () {
    client.release();
  }).catch(function (err) {
    if (client) {
      client.query("ROLLBACK").then(function () {
        client.release();
      });
    }
    throw err;
  });
}

The database schema for this:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  title TEXT NOT NULL,
  source_url TEXT,
  mime_type TEXT,
  ingested_at TIMESTAMP DEFAULT NOW(),
  metadata JSONB DEFAULT '{}'
);

CREATE TABLE document_chunks (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  document_id UUID REFERENCES documents(id) ON DELETE CASCADE,
  chunk_index INTEGER NOT NULL,
  content TEXT NOT NULL,
  embedding vector(1536),
  metadata JSONB DEFAULT '{}',
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_chunks_embedding ON document_chunks
  USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

CREATE INDEX idx_chunks_document ON document_chunks(document_id);

-- Full-text search index for hybrid retrieval
ALTER TABLE document_chunks ADD COLUMN tsv tsvector
  GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;

CREATE INDEX idx_chunks_tsv ON document_chunks USING gin(tsv);

Query Pipeline

The query pipeline is where most RAG quality is won or lost. The stages are: embed the query, retrieve candidates, rerank, pack context, and generate.

Implementing Retrieval with pgvector

The retrieval layer performs a nearest-neighbor search against stored embeddings. For production, you want to combine vector similarity with keyword search.

function retrieveChunks(queryEmbedding, queryText, options) {
  var limit = (options && options.limit) || 20;
  var similarityThreshold = (options && options.similarityThreshold) || 0.3;

  var sql = [
    "WITH semantic AS (",
    "  SELECT id, document_id, chunk_index, content, metadata,",
    "    1 - (embedding <=> $1::vector) AS similarity_score",
    "  FROM document_chunks",
    "  WHERE 1 - (embedding <=> $1::vector) > $3",
    "  ORDER BY embedding <=> $1::vector",
    "  LIMIT $2",
    "),",
    "keyword AS (",
    "  SELECT id, document_id, chunk_index, content, metadata,",
    "    ts_rank(tsv, plainto_tsquery('english', $4)) AS keyword_score",
    "  FROM document_chunks",
    "  WHERE tsv @@ plainto_tsquery('english', $4)",
    "  ORDER BY keyword_score DESC",
    "  LIMIT $2",
    ")",
    "SELECT COALESCE(s.id, k.id) AS id,",
    "  COALESCE(s.document_id, k.document_id) AS document_id,",
    "  COALESCE(s.chunk_index, k.chunk_index) AS chunk_index,",
    "  COALESCE(s.content, k.content) AS content,",
    "  COALESCE(s.metadata, k.metadata) AS metadata,",
    "  COALESCE(s.similarity_score, 0) AS similarity_score,",
    "  COALESCE(k.keyword_score, 0) AS keyword_score,",
    "  (COALESCE(s.similarity_score, 0) * 0.7 + ",
    "   COALESCE(k.keyword_score, 0) * 0.3) AS combined_score",
    "FROM semantic s",
    "FULL OUTER JOIN keyword k ON s.id = k.id",
    "ORDER BY combined_score DESC",
    "LIMIT $2"
  ].join("\n");

  return pool.query(sql, [
    "[" + queryEmbedding.join(",") + "]",
    limit,
    similarityThreshold,
    queryText
  ]).then(function (result) {
    return result.rows;
  });
}

Hybrid Retrieval

The query above demonstrates hybrid retrieval — combining semantic search (vector similarity) with keyword search (PostgreSQL full-text search). The weighting between semantic and keyword scores is tunable. I start with 70/30 semantic/keyword and adjust based on evaluation results.

Keyword search catches exact matches that embedding models sometimes miss. If a user asks about "error code ERR_CONN_REFUSED", the embedding might not place that close to the chunk containing that exact error code, but keyword search will find it.

Reranking Retrieved Results

Vector similarity is a rough proxy for relevance. A dedicated reranker model — a cross-encoder that scores query-document pairs — dramatically improves result ordering. Cohere's rerank API and open-source cross-encoders both work well.

var axios = require("axios");

function rerankResults(query, chunks, options) {
  var topN = (options && options.topN) || 5;

  return axios.post("https://api.cohere.ai/v1/rerank", {
    model: "rerank-english-v3.0",
    query: query,
    documents: chunks.map(function (chunk) {
      return chunk.content;
    }),
    top_n: topN,
    return_documents: false
  }, {
    headers: {
      "Authorization": "Bearer " + process.env.COHERE_API_KEY,
      "Content-Type": "application/json"
    }
  }).then(function (response) {
    return response.data.results.map(function (result) {
      var originalChunk = chunks[result.index];
      return {
        id: originalChunk.id,
        document_id: originalChunk.document_id,
        chunk_index: originalChunk.chunk_index,
        content: originalChunk.content,
        metadata: originalChunk.metadata,
        relevance_score: result.relevance_score,
        original_similarity: originalChunk.similarity_score
      };
    });
  });
}

Over-retrieve then rerank. Fetch 20-30 candidates from vector search, then rerank to the top 5. This consistently outperforms just taking the top 5 from vector search directly.

Context Window Packing

You have a fixed token budget for context. Packing it well is critical. The strategy: sort chunks by relevance, accumulate until you approach the budget, and format with clear boundaries and source attribution.

function packContext(rankedChunks, options) {
  var maxTokens = (options && options.maxTokens) || 3000;
  var packed = [];
  var totalTokens = 0;

  for (var i = 0; i < rankedChunks.length; i++) {
    var chunk = rankedChunks[i];
    var formatted = "[Source " + (i + 1) + " | Document: " +
      (chunk.metadata.title || chunk.document_id) +
      " | Section " + chunk.chunk_index + "]\n" +
      chunk.content + "\n";

    var chunkTokens = estimateTokens(formatted);

    if (totalTokens + chunkTokens > maxTokens) {
      // Try to fit a truncated version
      var remaining = maxTokens - totalTokens;
      if (remaining > 100) {
        var truncated = formatted.substring(0, remaining * 4); // rough token-to-char
        packed.push(truncated + "...[truncated]");
      }
      break;
    }

    packed.push(formatted);
    totalTokens += chunkTokens;
  }

  return {
    context: packed.join("\n---\n"),
    sources: packed.length,
    tokensUsed: totalTokens
  };
}

Generating with Citations

The generation prompt should explicitly instruct the model to cite sources. Format the prompt so the model can reference the numbered source tags you embedded in the context.

function generateAnswer(query, packedContext, conversationHistory) {
  var systemPrompt = [
    "You are a helpful assistant that answers questions based on provided context.",
    "RULES:",
    "1. Only answer based on the provided context. If the context does not contain enough information, say so.",
    "2. Cite your sources using [Source N] notation inline.",
    "3. If multiple sources support a claim, cite all of them.",
    "4. Do not make up information not present in the sources.",
    "5. If you are unsure, express uncertainty."
  ].join("\n");

  var messages = [
    { role: "system", content: systemPrompt }
  ];

  // Add conversation history for multi-turn
  if (conversationHistory && conversationHistory.length > 0) {
    for (var i = 0; i < conversationHistory.length; i++) {
      messages.push(conversationHistory[i]);
    }
  }

  messages.push({
    role: "user",
    content: "Context:\n" + packedContext.context +
      "\n\nQuestion: " + query
  });

  return openai.chat.completions.create({
    model: "gpt-4o",
    messages: messages,
    temperature: 0.1,
    max_tokens: 1000
  }).then(function (response) {
    var answer = response.choices[0].message.content;
    return {
      answer: answer,
      citations: extractCitations(answer),
      usage: response.usage
    };
  });
}

function extractCitations(answer) {
  var citations = [];
  var regex = /\[Source (\d+)\]/g;
  var match;
  while ((match = regex.exec(answer)) !== null) {
    var sourceNum = parseInt(match[1], 10);
    if (citations.indexOf(sourceNum) === -1) {
      citations.push(sourceNum);
    }
  }
  return citations.sort(function (a, b) { return a - b; });
}

Multi-Turn RAG Conversations

In a multi-turn conversation, the user's follow-up question often lacks context. "What about the second one?" makes no sense without the conversation history. Query rewriting transforms the follow-up into a standalone query suitable for retrieval.

function rewriteQuery(currentQuery, conversationHistory) {
  if (!conversationHistory || conversationHistory.length === 0) {
    return Promise.resolve(currentQuery);
  }

  var rewriteMessages = [
    {
      role: "system",
      content: "Rewrite the user's latest message as a standalone search query. " +
        "Incorporate relevant context from the conversation history. " +
        "Return ONLY the rewritten query, nothing else."
    }
  ];

  // Include recent history
  var recent = conversationHistory.slice(-4);
  for (var i = 0; i < recent.length; i++) {
    rewriteMessages.push(recent[i]);
  }

  rewriteMessages.push({
    role: "user",
    content: currentQuery
  });

  return openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: rewriteMessages,
    temperature: 0,
    max_tokens: 200
  }).then(function (response) {
    return response.choices[0].message.content.trim();
  });
}

Use a fast, cheap model for query rewriting. It does not need to be the same model you use for generation. GPT-4o-mini or Claude Haiku work well here since the task is simple and latency matters.

Implementing Guardrails

Production RAG systems need guardrails at multiple levels: relevance filtering before generation, groundedness checking after generation, and input validation before retrieval.

Relevance Filtering

After reranking, filter out chunks below a relevance threshold. If no chunks pass the filter, return a "I don't have enough information" response instead of letting the model hallucinate.

function applyRelevanceGuardrails(rerankedChunks, options) {
  var minRelevance = (options && options.minRelevance) || 0.25;
  var minChunks = (options && options.minChunks) || 1;

  var relevant = rerankedChunks.filter(function (chunk) {
    return chunk.relevance_score >= minRelevance;
  });

  if (relevant.length < minChunks) {
    return {
      passed: false,
      reason: "insufficient_relevant_context",
      chunks: [],
      message: "I don't have enough relevant information in my knowledge base to answer this question accurately."
    };
  }

  return {
    passed: true,
    chunks: relevant,
    avgRelevance: relevant.reduce(function (sum, c) {
      return sum + c.relevance_score;
    }, 0) / relevant.length
  };
}

Hallucination Detection

After generation, check that the answer is grounded in the provided sources. A simple but effective approach is to ask a second model to verify.

function checkGroundedness(answer, context) {
  var verificationPrompt = [
    "You are a fact-checker. Given the ANSWER and the SOURCE CONTEXT below, ",
    "evaluate whether every claim in the answer is supported by the context.",
    "",
    "Respond with a JSON object:",
    '{"grounded": true/false, "unsupported_claims": ["list of claims not in context"], "confidence": 0.0-1.0}',
    "",
    "SOURCE CONTEXT:",
    context,
    "",
    "ANSWER:",
    answer
  ].join("\n");

  return openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: verificationPrompt }],
    temperature: 0,
    response_format: { type: "json_object" }
  }).then(function (response) {
    return JSON.parse(response.choices[0].message.content);
  });
}

This adds latency and cost, so in production I run it asynchronously. Log the result for monitoring but do not block the response unless groundedness is below a critical threshold.

RAG Evaluation Metrics

You cannot improve what you cannot measure. The key metrics for RAG systems:

Retrieval Quality:

  • Hit Rate: Percentage of queries where the correct answer chunk appears in the top-k retrieved results
  • Mean Reciprocal Rank (MRR): Average of 1/rank for the first relevant result
  • Normalized Discounted Cumulative Gain (nDCG): Measures ranking quality accounting for position

Generation Quality:

  • Faithfulness: Does the answer only contain claims supported by the context?
  • Answer Relevance: Does the answer address the actual question?
  • Answer Correctness: Is the answer factually correct compared to a gold standard?
function evaluateRetrieval(testCases) {
  var results = {
    hitRate: 0,
    mrr: 0,
    totalCases: testCases.length
  };

  return Promise.all(testCases.map(function (testCase) {
    return embedTexts([testCase.query]).then(function (embeddings) {
      return retrieveChunks(embeddings[0], testCase.query, { limit: 10 });
    }).then(function (chunks) {
      var hit = false;
      var reciprocalRank = 0;

      for (var i = 0; i < chunks.length; i++) {
        if (chunks[i].document_id === testCase.expectedDocumentId) {
          hit = true;
          if (reciprocalRank === 0) {
            reciprocalRank = 1 / (i + 1);
          }
          break;
        }
      }

      return { hit: hit, reciprocalRank: reciprocalRank };
    });
  })).then(function (caseResults) {
    var hits = 0;
    var totalMRR = 0;

    for (var i = 0; i < caseResults.length; i++) {
      if (caseResults[i].hit) hits++;
      totalMRR += caseResults[i].reciprocalRank;
    }

    results.hitRate = hits / results.totalCases;
    results.mrr = totalMRR / results.totalCases;
    return results;
  });
}

Build an evaluation set early. Fifty to one hundred query-answer pairs with known correct source documents is enough to catch regressions. Run evaluations on every change to your chunking strategy, embedding model, or retrieval logic.

Scaling RAG Systems

Caching

Cache at two levels: embedding cache (avoid re-embedding identical queries) and response cache (avoid re-generating answers for identical queries against the same document state).

var NodeCache = require("node-cache");

var embeddingCache = new NodeCache({ stdTTL: 3600, maxKeys: 10000 });
var responseCache = new NodeCache({ stdTTL: 300, maxKeys: 5000 });

function getCachedEmbedding(text) {
  var key = require("crypto").createHash("sha256").update(text).digest("hex");
  var cached = embeddingCache.get(key);
  if (cached) return Promise.resolve(cached);

  return embedTexts([text]).then(function (embeddings) {
    embeddingCache.set(key, embeddings[0]);
    return embeddings[0];
  });
}

function getCachedResponse(query, documentHash) {
  var key = require("crypto").createHash("sha256")
    .update(query + "|" + documentHash)
    .digest("hex");
  return responseCache.get(key);
}

function setCachedResponse(query, documentHash, response) {
  var key = require("crypto").createHash("sha256")
    .update(query + "|" + documentHash)
    .digest("hex");
  responseCache.set(key, response);
}

Read Replicas

For high-traffic RAG systems, use PostgreSQL read replicas for retrieval queries. The write primary handles ingestion, and replicas serve vector search queries. Replication lag is typically under a second, which is acceptable for document retrieval.

Async Ingestion

Ingestion should never block your query pipeline. Use a job queue (BullMQ, pg-boss) to process documents asynchronously.

var Queue = require("bull");
var ingestionQueue = new Queue("document-ingestion", process.env.REDIS_URL);

ingestionQueue.process(function (job) {
  var data = job.data;
  return parseDocument(data.buffer, data.mimeType)
    .then(function (parsed) {
      var chunks = semanticChunk(parsed.text, { maxTokens: 512 });
      return embedTexts(chunks).then(function (embeddings) {
        return storeChunks(data.documentId, chunks, embeddings, parsed.metadata);
      });
    });
});

// Enqueue a document for processing
function ingestDocument(documentId, buffer, mimeType) {
  return ingestionQueue.add({
    documentId: documentId,
    buffer: buffer.toString("base64"),
    mimeType: mimeType
  }, {
    attempts: 3,
    backoff: { type: "exponential", delay: 5000 }
  });
}

Monitoring RAG Quality in Production

Production monitoring for RAG goes beyond uptime and latency. You need to track retrieval quality, generation quality, and user satisfaction continuously.

function logRAGEvent(event) {
  var record = {
    timestamp: new Date().toISOString(),
    query: event.query,
    rewritten_query: event.rewrittenQuery || null,
    chunks_retrieved: event.chunksRetrieved,
    chunks_after_rerank: event.chunksAfterRerank,
    avg_relevance_score: event.avgRelevance,
    context_tokens_used: event.contextTokensUsed,
    generation_tokens: event.generationTokens,
    citations_count: event.citationsCount,
    groundedness_score: event.groundednessScore,
    latency_ms: event.latencyMs,
    user_feedback: event.userFeedback || null,
    guardrail_triggered: event.guardrailTriggered || false
  };

  // Store in your analytics system
  return pool.query(
    "INSERT INTO rag_events (data) VALUES ($1)",
    [JSON.stringify(record)]
  );
}

Key dashboards to build:

  • Retrieval score distribution: Are relevance scores trending down? You may need to re-embed with a better model or adjust chunking.
  • Guardrail trigger rate: If the "insufficient context" guardrail fires too often, your knowledge base has gaps.
  • Groundedness over time: Declining groundedness indicates the model is hallucinating more, possibly due to ambiguous context.
  • Latency breakdown: Separate embedding time, retrieval time, reranking time, and generation time. Know which stage is your bottleneck.
  • User feedback correlation: Correlate thumbs-up/thumbs-down with retrieval scores and groundedness to calibrate thresholds.

Complete Working Example

Here is a production-ready Express.js API that ties all of the above together into a single RAG service.

var express = require("express");
var { Pool } = require("pg");
var OpenAI = require("openai");
var axios = require("axios");
var crypto = require("crypto");
var NodeCache = require("node-cache");

var app = express();
app.use(express.json({ limit: "1mb" }));

var pool = new Pool({ connectionString: process.env.POSTGRES_CONNECTION_STRING });
var openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
var embeddingCache = new NodeCache({ stdTTL: 3600, maxKeys: 10000 });

// --- Embedding ---
function embedText(text) {
  var key = crypto.createHash("sha256").update(text).digest("hex");
  var cached = embeddingCache.get(key);
  if (cached) return Promise.resolve(cached);

  return openai.embeddings.create({
    model: "text-embedding-3-small",
    input: [text]
  }).then(function (res) {
    var embedding = res.data[0].embedding;
    embeddingCache.set(key, embedding);
    return embedding;
  });
}

// --- Hybrid Retrieval ---
function retrieve(queryEmbedding, queryText, limit) {
  var sql = [
    "WITH semantic AS (",
    "  SELECT id, document_id, chunk_index, content, metadata,",
    "    1 - (embedding <=> $1::vector) AS sim_score",
    "  FROM document_chunks",
    "  WHERE 1 - (embedding <=> $1::vector) > 0.3",
    "  ORDER BY embedding <=> $1::vector LIMIT $2",
    "),",
    "keyword AS (",
    "  SELECT id, document_id, chunk_index, content, metadata,",
    "    ts_rank(tsv, plainto_tsquery('english', $3)) AS kw_score",
    "  FROM document_chunks",
    "  WHERE tsv @@ plainto_tsquery('english', $3)",
    "  ORDER BY kw_score DESC LIMIT $2",
    ")",
    "SELECT COALESCE(s.id, k.id) AS id,",
    "  COALESCE(s.document_id, k.document_id) AS document_id,",
    "  COALESCE(s.chunk_index, k.chunk_index) AS chunk_index,",
    "  COALESCE(s.content, k.content) AS content,",
    "  COALESCE(s.metadata, k.metadata) AS metadata,",
    "  COALESCE(s.sim_score, 0) * 0.7 + COALESCE(k.kw_score, 0) * 0.3 AS score",
    "FROM semantic s FULL OUTER JOIN keyword k ON s.id = k.id",
    "ORDER BY score DESC LIMIT $2"
  ].join("\n");

  return pool.query(sql, [
    "[" + queryEmbedding.join(",") + "]",
    limit || 20,
    queryText
  ]).then(function (r) { return r.rows; });
}

// --- Reranking ---
function rerank(query, chunks, topN) {
  if (!process.env.COHERE_API_KEY) {
    return Promise.resolve(chunks.slice(0, topN || 5));
  }

  return axios.post("https://api.cohere.ai/v1/rerank", {
    model: "rerank-english-v3.0",
    query: query,
    documents: chunks.map(function (c) { return c.content; }),
    top_n: topN || 5,
    return_documents: false
  }, {
    headers: {
      "Authorization": "Bearer " + process.env.COHERE_API_KEY,
      "Content-Type": "application/json"
    }
  }).then(function (res) {
    return res.data.results.map(function (r) {
      var original = chunks[r.index];
      original.relevance_score = r.relevance_score;
      return original;
    });
  });
}

// --- Query Rewriting ---
function rewriteQuery(query, history) {
  if (!history || history.length === 0) return Promise.resolve(query);

  var msgs = [{
    role: "system",
    content: "Rewrite the user message as a standalone search query using conversation context. Return ONLY the rewritten query."
  }];
  var recent = history.slice(-4);
  for (var i = 0; i < recent.length; i++) msgs.push(recent[i]);
  msgs.push({ role: "user", content: query });

  return openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: msgs,
    temperature: 0,
    max_tokens: 200
  }).then(function (r) { return r.choices[0].message.content.trim(); });
}

// --- Context Packing ---
function packContext(chunks, maxTokens) {
  var packed = [];
  var tokens = 0;
  for (var i = 0; i < chunks.length; i++) {
    var header = "[Source " + (i + 1) + " | " +
      (chunks[i].metadata.title || chunks[i].document_id) +
      " | Chunk " + chunks[i].chunk_index + "]";
    var block = header + "\n" + chunks[i].content;
    var blockTokens = Math.ceil(block.length / 4);
    if (tokens + blockTokens > (maxTokens || 3000)) break;
    packed.push(block);
    tokens += blockTokens;
  }
  return { text: packed.join("\n---\n"), count: packed.length, tokens: tokens };
}

// --- Generation ---
function generate(query, context, history) {
  var system = [
    "Answer based ONLY on the provided context.",
    "Cite sources inline as [Source N].",
    "If context is insufficient, say so. Do not fabricate information."
  ].join(" ");

  var messages = [{ role: "system", content: system }];
  if (history) {
    for (var i = 0; i < history.length; i++) messages.push(history[i]);
  }
  messages.push({ role: "user", content: "Context:\n" + context.text + "\n\nQuestion: " + query });

  return openai.chat.completions.create({
    model: "gpt-4o",
    messages: messages,
    temperature: 0.1,
    max_tokens: 1500
  }).then(function (r) {
    var answer = r.choices[0].message.content;
    var citationRegex = /\[Source (\d+)\]/g;
    var citations = [];
    var match;
    while ((match = citationRegex.exec(answer)) !== null) {
      var num = parseInt(match[1], 10);
      if (citations.indexOf(num) === -1) citations.push(num);
    }
    return { answer: answer, citations: citations, usage: r.usage };
  });
}

// --- Logging ---
function logEvent(data) {
  return pool.query(
    "INSERT INTO rag_events (data) VALUES ($1)",
    [JSON.stringify(data)]
  ).catch(function (err) {
    console.error("Failed to log RAG event:", err.message);
  });
}

// --- API Routes ---

app.post("/api/rag/query", function (req, res) {
  var startTime = Date.now();
  var query = req.body.query;
  var history = req.body.history || [];

  if (!query || typeof query !== "string" || query.trim().length === 0) {
    return res.status(400).json({ error: "Query is required" });
  }

  var rewrittenQuery;
  var retrievedChunks;
  var rerankedChunks;
  var context;

  rewriteQuery(query, history)
    .then(function (rq) {
      rewrittenQuery = rq;
      return embedText(rewrittenQuery);
    })
    .then(function (embedding) {
      return retrieve(embedding, rewrittenQuery, 20);
    })
    .then(function (chunks) {
      retrievedChunks = chunks;
      if (chunks.length === 0) {
        throw { status: 200, body: {
          answer: "I don't have enough information in my knowledge base to answer this question.",
          sources: [], citations: [], guardrail: "no_results"
        }};
      }
      return rerank(rewrittenQuery, chunks, 5);
    })
    .then(function (ranked) {
      rerankedChunks = ranked;

      // Relevance guardrail
      var relevant = ranked.filter(function (c) {
        return (c.relevance_score || c.score || 0) >= 0.25;
      });
      if (relevant.length === 0) {
        throw { status: 200, body: {
          answer: "I found some results but none were relevant enough to provide a confident answer.",
          sources: [], citations: [], guardrail: "low_relevance"
        }};
      }

      context = packContext(relevant, 3000);
      return generate(query, context, history);
    })
    .then(function (result) {
      var sources = rerankedChunks.slice(0, context.count).map(function (c, i) {
        return {
          index: i + 1,
          document_id: c.document_id,
          title: c.metadata.title || null,
          chunk_index: c.chunk_index,
          relevance_score: c.relevance_score || c.score
        };
      });

      var latencyMs = Date.now() - startTime;

      // Async logging
      logEvent({
        timestamp: new Date().toISOString(),
        query: query,
        rewritten_query: rewrittenQuery,
        chunks_retrieved: retrievedChunks.length,
        chunks_after_rerank: rerankedChunks.length,
        context_chunks_used: context.count,
        context_tokens: context.tokens,
        citations: result.citations,
        generation_tokens: result.usage.total_tokens,
        latency_ms: latencyMs
      });

      res.json({
        answer: result.answer,
        citations: result.citations,
        sources: sources,
        metadata: {
          rewritten_query: rewrittenQuery !== query ? rewrittenQuery : null,
          chunks_retrieved: retrievedChunks.length,
          chunks_used: context.count,
          latency_ms: latencyMs
        }
      });
    })
    .catch(function (err) {
      if (err.status && err.body) {
        return res.status(err.status).json(err.body);
      }
      console.error("RAG query error:", err);
      res.status(500).json({ error: "Internal server error" });
    });
});

app.post("/api/rag/ingest", function (req, res) {
  var title = req.body.title;
  var content = req.body.content;
  var metadata = req.body.metadata || {};

  if (!title || !content) {
    return res.status(400).json({ error: "Title and content are required" });
  }

  var documentId;

  pool.query(
    "INSERT INTO documents (title, metadata) VALUES ($1, $2) RETURNING id",
    [title, JSON.stringify(metadata)]
  ).then(function (result) {
    documentId = result.rows[0].id;
    var chunks = semanticChunk(content, { maxTokens: 512 });

    return embedTexts(chunks).then(function (embeddings) {
      return storeChunks(documentId, chunks, embeddings, { title: title });
    }).then(function () {
      return chunks.length;
    });
  }).then(function (chunkCount) {
    res.json({
      document_id: documentId,
      chunks_created: chunkCount,
      message: "Document ingested successfully"
    });
  }).catch(function (err) {
    console.error("Ingestion error:", err);
    res.status(500).json({ error: "Ingestion failed" });
  });
});

// Health check with retrieval test
app.get("/api/rag/health", function (req, res) {
  var checks = {};

  Promise.all([
    pool.query("SELECT COUNT(*) FROM document_chunks")
      .then(function (r) { checks.chunks = parseInt(r.rows[0].count, 10); })
      .catch(function (e) { checks.db_error = e.message; }),
    embedText("health check test")
      .then(function () { checks.embeddings = "ok"; })
      .catch(function (e) { checks.embeddings_error = e.message; })
  ]).then(function () {
    var healthy = checks.chunks !== undefined && checks.embeddings === "ok";
    res.status(healthy ? 200 : 503).json({
      status: healthy ? "healthy" : "degraded",
      checks: checks
    });
  });
});

var PORT = process.env.PORT || 3000;
app.listen(PORT, function () {
  console.log("RAG service running on port " + PORT);
});

Common Issues and Troubleshooting

1. pgvector Index Not Being Used

ERROR: operator class "vector_cosine_ops" does not exist for access method "ivfflat"

This means the pgvector extension is not installed or not up to date. Run CREATE EXTENSION IF NOT EXISTS vector; and ensure you are on pgvector 0.5.0 or later. Also, IVFFlat indexes require you to have data in the table before creating the index — creating an IVFFlat index on an empty table will produce poor results. Insert at least a few hundred rows first, then create the index.

If queries are slow despite having an index, check with EXPLAIN ANALYZE. PostgreSQL may choose a sequential scan if the table is small. Set ivfflat.probes higher for better recall at the cost of speed:

SET ivfflat.probes = 10;  -- default is 1

2. Embedding Dimension Mismatch

ERROR: expected 1536 dimensions, not 3072

This happens when you switch embedding models without re-embedding your existing chunks. text-embedding-3-small produces 1536-dimensional vectors, while text-embedding-3-large produces 3072. Your column definition must match the model. If you change models, you must re-embed all existing documents. There is no shortcut.

3. Reranker API Timeout Under Load

Error: timeout of 30000ms exceeded
POST https://api.cohere.ai/v1/rerank

The rerank API can be slow when you send many documents. Reduce the number of candidates sent for reranking (20 instead of 50), or add a timeout with fallback:

function rerankWithFallback(query, chunks, topN) {
  return rerank(query, chunks, topN)
    .catch(function (err) {
      console.warn("Reranker failed, falling back to similarity ordering:", err.message);
      return chunks.slice(0, topN || 5);
    });
}

4. Context Window Overflow

Error: This model's maximum context length is 128000 tokens. However, your messages resulted in 134521 tokens.

This happens when your context packing does not account for the system prompt, conversation history, and generation headroom. Always calculate your token budget as: model_max - system_prompt_tokens - history_tokens - generation_max_tokens - safety_margin. A common mistake is packing context to the model's full limit and leaving no room for the response.

5. Low Retrieval Quality for Acronyms and Codes

Embedding models struggle with domain-specific acronyms, error codes, and short identifiers. "ECONNREFUSED" and "connection refused" may have low cosine similarity despite meaning the same thing. Hybrid retrieval addresses this — the keyword search component catches exact string matches that semantic search misses. Additionally, consider adding an acronym expansion step during both ingestion and query time.

Best Practices

  • Chunk with overlap: Always use 10-15% overlap between chunks to preserve context across boundaries. Chunks without overlap lose critical information at the edges.

  • Over-retrieve, then rerank: Retrieve 3-4x more candidates than you need, then use a cross-encoder reranker to select the final set. This consistently outperforms top-k retrieval alone by 15-25% on relevance metrics.

  • Build an evaluation set from day one: Collect query-answer pairs with known correct source documents. Even 50 pairs give you a regression test. Without evaluation data, you are optimizing blind.

  • Separate your ingestion and query infrastructure: Ingestion is write-heavy, bursty, and can tolerate latency. Queries need consistent low latency. Use async job queues for ingestion and read replicas for query serving.

  • Set relevance thresholds and fail gracefully: It is better to say "I don't know" than to generate a hallucinated answer. Tune your relevance threshold on your evaluation set and monitor the guardrail trigger rate.

  • Cache embeddings aggressively: The same query embedded twice produces the same vector. Cache embeddings for at least an hour. For popular queries, cache the entire response with a shorter TTL.

  • Monitor retrieval scores, not just uptime: A RAG system can be "up" while producing terrible answers. Track average relevance scores, groundedness, citation rates, and user feedback over time.

  • Version your chunking and embedding strategies: When you change how documents are chunked or which embedding model you use, re-embed everything. Mixing chunks from different strategies in the same table produces unpredictable retrieval quality.

  • Use query rewriting for multi-turn conversations: Follow-up questions lack context. Rewrite them into standalone queries before retrieval. This is cheap (use a small model) and dramatically improves multi-turn accuracy.

  • Design for observability from the start: Log every RAG pipeline stage — query, rewritten query, retrieval scores, reranker scores, context tokens used, generation tokens, citations, latency breakdown. You will need this data to debug quality issues.

References

Powered by Contentful