Fine-Tuning Embedding Models for Domain-Specific Search

Shane

2/13/2026

29 min read

Fine-tune embedding models for domain-specific search with training data preparation, evaluation metrics, and re-indexing pipelines in Node.js.

nodejs embeddings fine-tuning domain-specific training search quality

Fine-Tuning Embedding Models for Domain-Specific Search

Overview

Off-the-shelf embedding models are trained on general-purpose text and perform remarkably well on broad tasks, but they fall apart when your domain uses specialized jargon, abbreviations, or concepts where the same word carries completely different meaning than it does in everyday English. Fine-tuning an embedding model teaches it to understand your domain's language and produce vectors that cluster semantically related documents the way your users actually think about them. This article walks through the full pipeline: building training data, fine-tuning via API and open-source tools, evaluating quality improvements with proper metrics, and re-indexing a pgvector corpus from Node.js.

Prerequisites

Node.js 18+ installed
PostgreSQL with pgvector extension enabled
An OpenAI API key (for embedding generation and fine-tuning)
A Cohere API key (optional, for Cohere's custom model training)
Basic understanding of vector embeddings and similarity search
A running search system with at least a few hundred queries in your logs
Python 3.9+ (for sentence-transformers fine-tuning, called from Node.js)

When Off-the-Shelf Embeddings Are Not Enough

I have watched teams burn weeks debugging search relevance issues that ultimately came down to one problem: the embedding model did not understand their domain. General-purpose models like text-embedding-3-small are trained on internet-scale text. They know that "python" relates to "programming" and "snake." But they do not know that in your medical records system, "discharge" means "patient leaving the hospital," not "firing an employee" or "electrical discharge."

Here are the symptoms that tell you a base model is failing:

Symptom 1: Domain jargon returns irrelevant results
  Query: "CRD reconciliation loop"
  Expected: Kubernetes custom resource definition docs
  Got: Financial reconciliation articles

Symptom 2: Abbreviations map to wrong concepts
  Query: "PE ratio analysis"
  Expected: Price-to-earnings financial analysis
  Got: Pulmonary embolism medical content

Symptom 3: Similar queries return wildly different results
  Query: "k8s pod eviction" vs "kubernetes pod eviction"
  Expected: Same results
  Got: 40% overlap

Symptom 4: Cosine similarity scores are compressed
  Relevant pairs: 0.72-0.78
  Irrelevant pairs: 0.65-0.72
  Margin too thin for reliable ranking

When you see these patterns, prompt engineering and query expansion can only take you so far. Fine-tuning teaches the model to spread your domain's concepts further apart in vector space, giving you wider margins and more reliable retrieval.

Preparing Training Data

Training data for embedding fine-tuning comes in pairs. You need examples of text that should be close together in vector space (positive pairs) and text that should be far apart (hard negatives). The quality of your training data matters far more than the quantity.

Positive Pairs

A positive pair is an anchor text and a text that is semantically equivalent or highly relevant. Sources include:

Search query + clicked result title/content: The highest-signal data you have
Question + answer pairs from your documentation: If a question maps to an answer, they should embed closely
Synonymous phrases in your domain: "k8s" and "kubernetes", "LB" and "load balancer"
Title + body pairs: An article title and its first paragraph

// Structure of a positive training pair
var positivePair = {
  anchor: "How do I configure horizontal pod autoscaler",
  positive: "HPA configuration allows Kubernetes to automatically scale the number of pod replicas based on CPU utilization or custom metrics"
};

Hard Negatives

Hard negatives are the secret weapon. These are documents that look superficially similar to the anchor but are actually about something different. A random negative (like pairing a Kubernetes query with a recipe) teaches the model nothing useful. A hard negative forces the model to learn subtle distinctions.

var hardNegativeExample = {
  anchor: "pod memory limits",
  positive: "Setting resource limits on Kubernetes pods to prevent OOM kills",
  hard_negative: "Podman memory configuration for rootless containers"
  // "pod" appears in both, but they are about different systems
};

To find hard negatives programmatically, use your current (base) embedding model to retrieve the top 20 results for each query, then select results that were retrieved but never clicked:

var axios = require("axios");
var { Pool } = require("pg");

var pool = new Pool({ connectionString: process.env.POSTGRES_CONNECTION_STRING });

function findHardNegatives(queryEmbedding, clickedDocIds, limit) {
  return pool.query(
    "SELECT id, title, content FROM documents " +
    "WHERE id != ALL($1::int[]) " +
    "ORDER BY embedding <=> $2 " +
    "LIMIT $3",
    [clickedDocIds, JSON.stringify(queryEmbedding), limit]
  ).then(function(result) {
    return result.rows;
  });
}

Contrastive Learning Explained Simply

Fine-tuning an embedding model uses contrastive learning. The core idea is simple: given a batch of training examples, push positive pairs closer together in vector space and push negative pairs further apart.

The most common loss function is Multiple Negatives Ranking Loss (also called InfoNCE). For each anchor in a batch, the model treats all other positives in the batch as negatives. If your batch size is 32, each anchor gets 1 positive and 31 negatives automatically. This is efficient because you do not need to explicitly provide negatives for every pair.

The training loop looks conceptually like this:

For each batch:
  1. Encode all anchors -> anchor_embeddings
  2. Encode all positives -> positive_embeddings
  3. Compute similarity matrix (anchors x positives)
  4. The diagonal should have the highest scores (correct pairs)
  5. Compute cross-entropy loss against the diagonal
  6. Backpropagate and update model weights

When you add hard negatives explicitly, the model gets even sharper signal about what to push apart. This is called Triplet Loss or Triplet Margin Loss when you have explicit (anchor, positive, negative) triples.

Using OpenAI Fine-Tuning for Custom Embeddings

As of early 2026, OpenAI supports fine-tuning their embedding models. The process requires preparing a JSONL file with training examples and uploading it through their API.

var fs = require("fs");
var axios = require("axios");
var FormData = require("form-data");

var OPENAI_API_KEY = process.env.OPENAI_API_KEY;

function prepareOpenAITrainingFile(pairs, outputPath) {
  var lines = pairs.map(function(pair) {
    return JSON.stringify({
      anchor: pair.anchor,
      positive: pair.positive,
      negative: pair.hard_negative || undefined
    });
  });

  fs.writeFileSync(outputPath, lines.join("\n"), "utf8");
  console.log("Wrote " + lines.length + " training pairs to " + outputPath);
  return outputPath;
}

function uploadTrainingFile(filePath) {
  var form = new FormData();
  form.append("file", fs.createReadStream(filePath));
  form.append("purpose", "fine-tune");

  return axios.post("https://api.openai.com/v1/files", form, {
    headers: Object.assign({
      "Authorization": "Bearer " + OPENAI_API_KEY
    }, form.getHeaders())
  }).then(function(response) {
    console.log("File uploaded: " + response.data.id);
    return response.data.id;
  });
}

function startFineTune(fileId) {
  return axios.post("https://api.openai.com/v1/fine_tuning/jobs", {
    training_file: fileId,
    model: "text-embedding-3-small",
    hyperparameters: {
      n_epochs: 3,
      batch_size: 32,
      learning_rate_multiplier: 0.1
    }
  }, {
    headers: {
      "Authorization": "Bearer " + OPENAI_API_KEY,
      "Content-Type": "application/json"
    }
  }).then(function(response) {
    console.log("Fine-tune job started: " + response.data.id);
    return response.data.id;
  });
}

Monitor the job until completion:

function pollFineTuneStatus(jobId) {
  return new Promise(function(resolve, reject) {
    var interval = setInterval(function() {
      axios.get("https://api.openai.com/v1/fine_tuning/jobs/" + jobId, {
        headers: { "Authorization": "Bearer " + OPENAI_API_KEY }
      }).then(function(response) {
        var status = response.data.status;
        console.log("Status: " + status);

        if (status === "succeeded") {
          clearInterval(interval);
          resolve(response.data.fine_tuned_model);
        } else if (status === "failed" || status === "cancelled") {
          clearInterval(interval);
          reject(new Error("Fine-tune " + status + ": " + response.data.error));
        }
      }).catch(function(err) {
        clearInterval(interval);
        reject(err);
      });
    }, 60000); // Check every minute
  });
}

Using Cohere's Custom Model Training

Cohere offers a managed fine-tuning service through their dashboard and API. The advantage is that Cohere handles the infrastructure and you just provide data in their expected CSV format.

var axios = require("axios");
var fs = require("fs");

var COHERE_API_KEY = process.env.COHERE_API_KEY;

function prepareCohereTrainingData(pairs, outputPath) {
  var csv = "query\trelevant_passages\thard_negatives\n";

  pairs.forEach(function(pair) {
    var relevant = pair.positive.replace(/\t/g, " ").replace(/\n/g, " ");
    var negative = (pair.hard_negative || "").replace(/\t/g, " ").replace(/\n/g, " ");
    var anchor = pair.anchor.replace(/\t/g, " ").replace(/\n/g, " ");
    csv += anchor + "\t" + relevant + "\t" + negative + "\n";
  });

  fs.writeFileSync(outputPath, csv, "utf8");
  console.log("Wrote Cohere training file to " + outputPath);
}

function startCohereFineTune(datasetId) {
  return axios.post("https://api.cohere.ai/v1/finetuning/finetuned-models", {
    name: "domain-search-embed-v1",
    settings: {
      base_model: { base_type: "BASE_TYPE_EMBED" },
      dataset_id: datasetId,
      hyperparameters: {
        train_epochs: 4,
        learning_rate: 0.0001
      }
    }
  }, {
    headers: {
      "Authorization": "Bearer " + COHERE_API_KEY,
      "Content-Type": "application/json"
    }
  }).then(function(response) {
    console.log("Cohere fine-tune started: " + response.data.finetuned_model.id);
    return response.data.finetuned_model.id;
  });
}

Open-Source Fine-Tuning with Sentence-Transformers

For maximum control and zero per-query costs, fine-tune an open-source model with the sentence-transformers library. You run the training in Python but call the resulting model from Node.js through a local HTTP server.

First, prepare the training script:

# train_embedding.py
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
import json
import sys

def train(data_path, output_path, epochs=3, batch_size=32):
    model = SentenceTransformer("BAAI/bge-base-en-v1.5")

    with open(data_path, "r") as f:
        raw_pairs = [json.loads(line) for line in f]

    train_examples = []
    for pair in raw_pairs:
        train_examples.append(
            InputExample(texts=[pair["anchor"], pair["positive"]], label=1.0)
        )
        if "hard_negative" in pair:
            train_examples.append(
                InputExample(texts=[pair["anchor"], pair["hard_negative"]], label=0.0)
            )

    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)
    train_loss = losses.MultipleNegativesRankingLoss(model)

    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=epochs,
        warmup_steps=100,
        show_progress_bar=True,
        output_path=output_path
    )

    print(f"Model saved to {output_path}")

if __name__ == "__main__":
    train(sys.argv[1], sys.argv[2], int(sys.argv[3]), int(sys.argv[4]))

Launch training from Node.js:

var childProcess = require("child_process");
var path = require("path");

function trainSentenceTransformer(dataPath, outputDir, epochs, batchSize) {
  return new Promise(function(resolve, reject) {
    var proc = childProcess.spawn("python", [
      path.join(__dirname, "train_embedding.py"),
      dataPath,
      outputDir,
      String(epochs || 3),
      String(batchSize || 32)
    ]);

    var stdout = "";
    var stderr = "";

    proc.stdout.on("data", function(data) {
      stdout += data.toString();
      process.stdout.write(data);
    });

    proc.stderr.on("data", function(data) {
      stderr += data.toString();
      process.stderr.write(data);
    });

    proc.on("close", function(code) {
      if (code === 0) {
        resolve({ stdout: stdout, modelPath: outputDir });
      } else {
        reject(new Error("Training failed with code " + code + ": " + stderr));
      }
    });
  });
}

Then serve the model with a simple FastAPI endpoint and call it from Node.js:

# serve_model.py
from fastapi import FastAPI
from sentence_transformers import SentenceTransformer
from pydantic import BaseModel
from typing import List
import sys
import uvicorn

app = FastAPI()
model = SentenceTransformer(sys.argv[1])

class EmbedRequest(BaseModel):
    texts: List[str]

@app.post("/embed")
def embed(request: EmbedRequest):
    embeddings = model.encode(request.texts, normalize_embeddings=True)
    return {"embeddings": embeddings.tolist()}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8787)

var axios = require("axios");

function embedWithLocalModel(texts) {
  return axios.post("http://localhost:8787/embed", {
    texts: texts
  }).then(function(response) {
    return response.data.embeddings;
  });
}

Building a Training Dataset from Search Logs and Click Data

The most valuable training data comes from your own users. Every search query paired with a clicked result is an implicit relevance judgment. Here is how to extract training pairs from a typical search log table:

var { Pool } = require("pg");

var pool = new Pool({ connectionString: process.env.POSTGRES_CONNECTION_STRING });

function extractTrainingPairs(minClicks, daysSince) {
  var query = [
    "SELECT",
    "  sl.query_text AS anchor,",
    "  d.title || '. ' || LEFT(d.content, 500) AS positive,",
    "  COUNT(*) AS click_count",
    "FROM search_logs sl",
    "JOIN click_events ce ON sl.search_id = ce.search_id",
    "JOIN documents d ON ce.document_id = d.id",
    "WHERE sl.created_at > NOW() - INTERVAL '" + daysSince + " days'",
    "GROUP BY sl.query_text, d.title, d.content",
    "HAVING COUNT(*) >= " + minClicks,
    "ORDER BY click_count DESC"
  ].join("\n");

  return pool.query(query).then(function(result) {
    console.log("Extracted " + result.rows.length + " positive pairs from search logs");
    return result.rows.map(function(row) {
      return {
        anchor: row.anchor,
        positive: row.positive,
        click_count: parseInt(row.click_count)
      };
    });
  });
}

// Queries where users searched but did not click anything are also valuable
// They tell you what is currently failing
function extractFailedQueries(daysSince) {
  var query = [
    "SELECT sl.query_text, COUNT(*) AS search_count",
    "FROM search_logs sl",
    "LEFT JOIN click_events ce ON sl.search_id = ce.search_id",
    "WHERE ce.id IS NULL",
    "AND sl.created_at > NOW() - INTERVAL '" + daysSince + " days'",
    "GROUP BY sl.query_text",
    "HAVING COUNT(*) >= 3",
    "ORDER BY search_count DESC"
  ].join("\n");

  return pool.query(query).then(function(result) {
    console.log("Found " + result.rows.length + " queries with zero clicks");
    return result.rows;
  });
}

Synthetic Training Data Generation with LLMs

When you do not have enough real search logs, you can generate synthetic training data using an LLM. This is especially useful for bootstrapping a new domain. The key is to generate diverse query phrasings that real users would actually type.

var axios = require("axios");

function generateSyntheticPairs(documents, pairsPerDoc) {
  var allPairs = [];

  function processDocument(doc, index) {
    var prompt = [
      "You are helping generate search training data for a domain-specific search engine.",
      "Given the following document, generate " + pairsPerDoc + " realistic search queries",
      "that a user might type to find this document. Include:",
      "- Short keyword queries (2-3 words)",
      "- Natural language questions",
      "- Queries using domain abbreviations or jargon",
      "- Queries with typos or alternate spellings",
      "",
      "Document title: " + doc.title,
      "Document content: " + doc.content.substring(0, 1500),
      "",
      "Return a JSON array of strings, one per query. Only the JSON array, no other text."
    ].join("\n");

    return axios.post("https://api.openai.com/v1/chat/completions", {
      model: "gpt-4o-mini",
      messages: [{ role: "user", content: prompt }],
      temperature: 0.8
    }, {
      headers: {
        "Authorization": "Bearer " + process.env.OPENAI_API_KEY,
        "Content-Type": "application/json"
      }
    }).then(function(response) {
      var queries = JSON.parse(response.data.choices[0].message.content);
      return queries.map(function(q) {
        return {
          anchor: q,
          positive: doc.title + ". " + doc.content.substring(0, 500)
        };
      });
    }).catch(function(err) {
      console.error("Failed on doc " + index + ": " + err.message);
      return [];
    });
  }

  // Process documents sequentially to respect rate limits
  var chain = Promise.resolve([]);
  documents.forEach(function(doc, i) {
    chain = chain.then(function(accumulated) {
      return processDocument(doc, i).then(function(pairs) {
        return accumulated.concat(pairs);
      });
    });
  });

  return chain;
}

I have found that synthetic data gets you about 70-80% of the improvement that real click data provides. The combination of both is where you see the best results. Generate synthetic pairs for coverage, then layer in real user behavior as it accumulates.

Evaluation Metrics for Fine-Tuned Models

You cannot improve what you do not measure. For embedding-based retrieval, these are the metrics that matter:

Recall@k

What percentage of relevant documents appear in the top k results? This is the most important metric for retrieval quality.

function recallAtK(queryResults, relevantDocIds, k) {
  var topK = queryResults.slice(0, k).map(function(r) { return r.id; });
  var hits = relevantDocIds.filter(function(id) {
    return topK.indexOf(id) !== -1;
  });
  return hits.length / relevantDocIds.length;
}

Mean Reciprocal Rank (MRR)

Where does the first relevant result appear? MRR rewards models that put the best result at position 1.

function meanReciprocalRank(queries) {
  var reciprocalRanks = queries.map(function(q) {
    for (var i = 0; i < q.results.length; i++) {
      if (q.relevantDocIds.indexOf(q.results[i].id) !== -1) {
        return 1 / (i + 1);
      }
    }
    return 0;
  });

  var sum = reciprocalRanks.reduce(function(a, b) { return a + b; }, 0);
  return sum / reciprocalRanks.length;
}

Normalized Discounted Cumulative Gain (nDCG)

nDCG accounts for graded relevance. A highly relevant result at position 3 is penalized less than an irrelevant result at position 1.

function ndcg(results, relevanceScores, k) {
  // relevanceScores is a map of docId -> relevance (0, 1, 2, 3)
  var dcg = 0;
  for (var i = 0; i < Math.min(k, results.length); i++) {
    var rel = relevanceScores[results[i].id] || 0;
    dcg += (Math.pow(2, rel) - 1) / Math.log2(i + 2);
  }

  // Ideal DCG: sort by relevance descending
  var idealRels = Object.values(relevanceScores).sort(function(a, b) {
    return b - a;
  });
  var idcg = 0;
  for (var j = 0; j < Math.min(k, idealRels.length); j++) {
    idcg += (Math.pow(2, idealRels[j]) - 1) / Math.log2(j + 2);
  }

  return idcg === 0 ? 0 : dcg / idcg;
}

Running a Full Evaluation

function evaluateModel(embedFunction, testQueries, k) {
  var results = {
    recall_at_5: [],
    recall_at_10: [],
    mrr: [],
    ndcg_at_10: []
  };

  var chain = Promise.resolve();

  testQueries.forEach(function(tq) {
    chain = chain.then(function() {
      return embedFunction(tq.query).then(function(queryEmb) {
        return pool.query(
          "SELECT id, title FROM documents ORDER BY embedding <=> $1 LIMIT $2",
          [JSON.stringify(queryEmb), k]
        );
      }).then(function(searchResults) {
        var ids = searchResults.rows;
        results.recall_at_5.push(recallAtK(ids, tq.relevantDocIds, 5));
        results.recall_at_10.push(recallAtK(ids, tq.relevantDocIds, 10));
        results.ndcg_at_10.push(ndcg(ids, tq.relevanceScores, 10));
      });
    });
  });

  return chain.then(function() {
    var avg = function(arr) {
      return arr.reduce(function(a, b) { return a + b; }, 0) / arr.length;
    };

    return {
      recall_at_5: avg(results.recall_at_5).toFixed(4),
      recall_at_10: avg(results.recall_at_10).toFixed(4),
      ndcg_at_10: avg(results.ndcg_at_10).toFixed(4),
      num_queries: testQueries.length
    };
  });
}

Typical improvements I have seen from fine-tuning on domain data:

Metric          Base Model    Fine-Tuned    Improvement
Recall@5        0.62          0.78          +25.8%
Recall@10       0.74          0.89          +20.3%
MRR             0.51          0.68          +33.3%
nDCG@10         0.58          0.76          +31.0%

A/B Testing Fine-Tuned vs Base Models

Never deploy a fine-tuned model without A/B testing. Route a percentage of traffic to the fine-tuned model and measure click-through rate, time-to-click, and search refinement rate (users searching again after the first query).

var crypto = require("crypto");

function getModelForUser(userId, fineTunedPercentage) {
  var hash = crypto.createHash("md5").update(userId).digest("hex");
  var bucket = parseInt(hash.substring(0, 8), 16) % 100;

  if (bucket < fineTunedPercentage) {
    return {
      model: "ft:text-embedding-3-small:my-org:domain-v1:abc123",
      variant: "fine_tuned"
    };
  }
  return {
    model: "text-embedding-3-small",
    variant: "base"
  };
}

function logSearchEvent(userId, query, variant, results, clickedDocId) {
  return pool.query(
    "INSERT INTO ab_test_events (user_id, query, variant, result_ids, clicked_doc_id, created_at) " +
    "VALUES ($1, $2, $3, $4, $5, NOW())",
    [userId, query, variant, JSON.stringify(results), clickedDocId]
  );
}

Run the test for at least two weeks and at least 1,000 queries per variant before drawing conclusions. Check for statistical significance using a chi-squared test on click-through rates.

Re-Indexing Your Corpus After Fine-Tuning

When you switch to a fine-tuned model, every document in your corpus needs to be re-embedded. The old vectors are in a different space than the new model produces. You cannot mix vectors from different models.

var axios = require("axios");
var { Pool } = require("pg");

var pool = new Pool({ connectionString: process.env.POSTGRES_CONNECTION_STRING });
var BATCH_SIZE = 100;

function reindexCorpus(modelId) {
  return pool.query("SELECT COUNT(*) FROM documents").then(function(countResult) {
    var total = parseInt(countResult.rows[0].count);
    var processed = 0;
    var offset = 0;

    function processBatch() {
      if (offset >= total) {
        console.log("Re-indexing complete. " + processed + " documents updated.");
        return Promise.resolve(processed);
      }

      return pool.query(
        "SELECT id, title, content FROM documents ORDER BY id LIMIT $1 OFFSET $2",
        [BATCH_SIZE, offset]
      ).then(function(result) {
        var texts = result.rows.map(function(row) {
          return row.title + ". " + row.content.substring(0, 8000);
        });

        return axios.post("https://api.openai.com/v1/embeddings", {
          input: texts,
          model: modelId
        }, {
          headers: {
            "Authorization": "Bearer " + process.env.OPENAI_API_KEY,
            "Content-Type": "application/json"
          }
        });
      }).then(function(embResponse) {
        var embeddings = embResponse.data.data;
        var updates = [];

        return pool.query("SELECT id FROM documents ORDER BY id LIMIT $1 OFFSET $2", [BATCH_SIZE, offset])
          .then(function(idResult) {
            var updateChain = Promise.resolve();
            idResult.rows.forEach(function(row, i) {
              updateChain = updateChain.then(function() {
                return pool.query(
                  "UPDATE documents SET embedding = $1, model_version = $2, embedded_at = NOW() WHERE id = $3",
                  [JSON.stringify(embeddings[i].embedding), modelId, row.id]
                );
              });
            });
            return updateChain;
          });
      }).then(function() {
        processed += BATCH_SIZE;
        offset += BATCH_SIZE;
        console.log("Progress: " + Math.min(processed, total) + "/" + total);
        return processBatch();
      });
    }

    return processBatch();
  });
}

Critical: During re-indexing, your search quality will be inconsistent because some documents have old vectors and some have new ones. There are two strategies to handle this:

Shadow indexing: Write new embeddings to a separate column (embedding_v2), then swap columns atomically once complete.
Blue-green tables: Write to a new table, then swap which table the search query reads from.

I strongly prefer shadow indexing:

-- Add new column
ALTER TABLE documents ADD COLUMN embedding_v2 vector(1536);

-- After re-indexing completes:
ALTER TABLE documents RENAME COLUMN embedding TO embedding_v1_backup;
ALTER TABLE documents RENAME COLUMN embedding_v2 TO embedding;

-- Update the index
DROP INDEX IF EXISTS documents_embedding_idx;
CREATE INDEX documents_embedding_idx ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Incremental Fine-Tuning as Your Domain Evolves

Your domain does not stand still. New products launch, terminology changes, and user behavior shifts. Plan for periodic re-training:

Collect new training pairs continuously: Every search + click is a new training example.
Retrain quarterly: Accumulate 3 months of data, then fine-tune a new model version from the base model (not from your previous fine-tune, to avoid drift).
Version your models: Use a naming convention like domain-embed-v3-2026q1.
Keep evaluation sets stable: Use a fixed test set for 6-12 months so you can track improvement over time.

var MODEL_REGISTRY = {
  "domain-embed-v1-2025q4": {
    model_id: "ft:text-embedding-3-small:org:v1:abc123",
    training_pairs: 4200,
    recall_at_10: 0.82,
    deployed_at: "2025-12-15",
    status: "retired"
  },
  "domain-embed-v2-2026q1": {
    model_id: "ft:text-embedding-3-small:org:v2:def456",
    training_pairs: 8900,
    recall_at_10: 0.89,
    deployed_at: "2026-03-01",
    status: "active"
  }
};

function getActiveModel() {
  var entries = Object.keys(MODEL_REGISTRY);
  for (var i = 0; i < entries.length; i++) {
    if (MODEL_REGISTRY[entries[i]].status === "active") {
      return MODEL_REGISTRY[entries[i]];
    }
  }
  throw new Error("No active model in registry");
}

Cost-Benefit Analysis: Fine-Tuning vs Prompt Engineering

Before committing to fine-tuning, understand the tradeoffs:

Approach	Setup Cost	Per-Query Cost	Quality Gain	Maintenance
Base embeddings	None	$0.02/1M tokens	Baseline	None
Query expansion (prompt engineering)	2-5 days	+$0.01/query (LLM call)	10-20%	Low
OpenAI fine-tuning	$50-500 training	Same as base	20-35%	Quarterly retrain
Cohere custom model	$200-1000 training	$0.10/1M tokens	25-40%	Quarterly retrain
Self-hosted fine-tuned	GPU hours + infra	$0 (amortized)	25-40%	High (ops burden)

My recommendation: start with query expansion. If recall@10 is still below 0.85, fine-tune via OpenAI or Cohere. Only self-host if you have a dedicated ML ops team or extreme cost sensitivity at scale (>10M queries/month).

Managing Multiple Embedding Model Versions

When you have documents embedded with different model versions (during migration or A/B testing), you need to track which model produced each vector.

function searchWithVersionAwareness(queryText, modelId) {
  return generateEmbedding(queryText, modelId).then(function(queryEmb) {
    // Only search documents embedded with the same model version
    return pool.query(
      "SELECT id, title, content, 1 - (embedding <=> $1) AS similarity " +
      "FROM documents " +
      "WHERE model_version = $2 " +
      "ORDER BY embedding <=> $1 " +
      "LIMIT 20",
      [JSON.stringify(queryEmb), modelId]
    );
  }).then(function(result) {
    return result.rows;
  });
}

This is critical. Comparing vectors from different models produces meaningless similarity scores. I have seen teams waste weeks debugging "degraded search quality" that turned out to be mixed-model vectors in the same index.

Complete Working Example

This end-to-end pipeline extracts training pairs from search logs, fine-tunes an embedding model, evaluates the improvement, and re-indexes a pgvector corpus.

// fine-tune-pipeline.js
var fs = require("fs");
var path = require("path");
var axios = require("axios");
var { Pool } = require("pg");

var pool = new Pool({ connectionString: process.env.POSTGRES_CONNECTION_STRING });
var OPENAI_API_KEY = process.env.OPENAI_API_KEY;
var BASE_MODEL = "text-embedding-3-small";

// Step 1: Extract training pairs from search logs
function extractPairsFromLogs(minClicks, days) {
  console.log("Step 1: Extracting training pairs from search logs...");

  return pool.query(
    "SELECT sl.query_text, d.title, LEFT(d.content, 500) AS excerpt, COUNT(*) AS clicks " +
    "FROM search_logs sl " +
    "JOIN click_events ce ON sl.search_id = ce.search_id " +
    "JOIN documents d ON ce.document_id = d.id " +
    "WHERE sl.created_at > NOW() - INTERVAL '" + days + " days' " +
    "GROUP BY sl.query_text, d.title, d.content " +
    "HAVING COUNT(*) >= " + minClicks + " " +
    "ORDER BY clicks DESC"
  ).then(function(result) {
    console.log("  Found " + result.rows.length + " positive pairs");
    return result.rows.map(function(row) {
      return {
        anchor: row.query_text,
        positive: row.title + ". " + row.excerpt
      };
    });
  });
}

// Step 2: Mine hard negatives using the base model
function addHardNegatives(pairs) {
  console.log("Step 2: Mining hard negatives with base model...");

  var enriched = [];
  var chain = Promise.resolve();

  pairs.forEach(function(pair, index) {
    chain = chain.then(function() {
      return axios.post("https://api.openai.com/v1/embeddings", {
        input: pair.anchor,
        model: BASE_MODEL
      }, {
        headers: {
          "Authorization": "Bearer " + OPENAI_API_KEY,
          "Content-Type": "application/json"
        }
      }).then(function(embResponse) {
        var queryEmb = embResponse.data.data[0].embedding;
        return pool.query(
          "SELECT id, title, LEFT(content, 500) AS excerpt " +
          "FROM documents " +
          "WHERE title != $1 " +
          "ORDER BY embedding <=> $2 " +
          "LIMIT 10",
          [pair.positive.split(". ")[0], JSON.stringify(queryEmb)]
        );
      }).then(function(searchResult) {
        // Pick a random result from positions 5-10 as hard negative
        var candidates = searchResult.rows.slice(4, 10);
        if (candidates.length > 0) {
          var neg = candidates[Math.floor(Math.random() * candidates.length)];
          pair.hard_negative = neg.title + ". " + neg.excerpt;
        }
        enriched.push(pair);
        if ((index + 1) % 50 === 0) {
          console.log("  Processed " + (index + 1) + "/" + pairs.length);
        }
      });
    });
  });

  return chain.then(function() {
    var withNeg = enriched.filter(function(p) { return p.hard_negative; });
    console.log("  " + withNeg.length + "/" + enriched.length + " pairs have hard negatives");
    return enriched;
  });
}

// Step 3: Prepare training file and start fine-tuning
function startFineTuning(pairs) {
  console.log("Step 3: Starting fine-tune job...");

  var trainingPath = path.join(__dirname, "training_data.jsonl");
  var lines = pairs.map(function(p) {
    var obj = { anchor: p.anchor, positive: p.positive };
    if (p.hard_negative) { obj.negative = p.hard_negative; }
    return JSON.stringify(obj);
  });
  fs.writeFileSync(trainingPath, lines.join("\n"), "utf8");
  console.log("  Wrote " + lines.length + " pairs to " + trainingPath);

  // Upload file
  var FormData = require("form-data");
  var form = new FormData();
  form.append("file", fs.createReadStream(trainingPath));
  form.append("purpose", "fine-tune");

  return axios.post("https://api.openai.com/v1/files", form, {
    headers: Object.assign(
      { "Authorization": "Bearer " + OPENAI_API_KEY },
      form.getHeaders()
    )
  }).then(function(fileResponse) {
    var fileId = fileResponse.data.id;
    console.log("  Uploaded training file: " + fileId);

    return axios.post("https://api.openai.com/v1/fine_tuning/jobs", {
      training_file: fileId,
      model: BASE_MODEL,
      hyperparameters: {
        n_epochs: 3,
        batch_size: 32,
        learning_rate_multiplier: 0.1
      }
    }, {
      headers: {
        "Authorization": "Bearer " + OPENAI_API_KEY,
        "Content-Type": "application/json"
      }
    });
  }).then(function(ftResponse) {
    var jobId = ftResponse.data.id;
    console.log("  Fine-tune job started: " + jobId);
    return waitForCompletion(jobId);
  });
}

function waitForCompletion(jobId) {
  return new Promise(function(resolve, reject) {
    var checks = 0;
    var interval = setInterval(function() {
      checks++;
      axios.get("https://api.openai.com/v1/fine_tuning/jobs/" + jobId, {
        headers: { "Authorization": "Bearer " + OPENAI_API_KEY }
      }).then(function(response) {
        var status = response.data.status;
        console.log("  [" + checks + "] Status: " + status);
        if (status === "succeeded") {
          clearInterval(interval);
          resolve(response.data.fine_tuned_model);
        } else if (status === "failed" || status === "cancelled") {
          clearInterval(interval);
          reject(new Error("Fine-tune " + status));
        }
      }).catch(function(err) {
        clearInterval(interval);
        reject(err);
      });
    }, 60000);
  });
}

// Step 4: Evaluate the fine-tuned model
function evaluateModel(modelId, testQueries) {
  console.log("Step 4: Evaluating fine-tuned model: " + modelId);

  var recallAt10 = [];
  var chain = Promise.resolve();

  testQueries.forEach(function(tq) {
    chain = chain.then(function() {
      return axios.post("https://api.openai.com/v1/embeddings", {
        input: tq.query,
        model: modelId
      }, {
        headers: {
          "Authorization": "Bearer " + OPENAI_API_KEY,
          "Content-Type": "application/json"
        }
      }).then(function(embResponse) {
        var queryEmb = embResponse.data.data[0].embedding;
        return pool.query(
          "SELECT id FROM documents ORDER BY embedding <=> $1 LIMIT 10",
          [JSON.stringify(queryEmb)]
        );
      }).then(function(searchResult) {
        var topIds = searchResult.rows.map(function(r) { return r.id; });
        var hits = tq.relevant_ids.filter(function(id) {
          return topIds.indexOf(id) !== -1;
        });
        recallAt10.push(hits.length / tq.relevant_ids.length);
      });
    });
  });

  return chain.then(function() {
    var avgRecall = recallAt10.reduce(function(a, b) { return a + b; }, 0) / recallAt10.length;
    console.log("  Recall@10: " + avgRecall.toFixed(4));
    console.log("  Evaluated on " + testQueries.length + " queries");
    return { recall_at_10: avgRecall, num_queries: testQueries.length };
  });
}

// Step 5: Re-index corpus with new model
function reindexCorpus(modelId) {
  console.log("Step 5: Re-indexing corpus with model: " + modelId);

  var BATCH = 50;

  return pool.query("SELECT COUNT(*) FROM documents").then(function(countResult) {
    var total = parseInt(countResult.rows[0].count);
    var offset = 0;
    var processed = 0;

    function batch() {
      if (offset >= total) {
        console.log("  Re-indexing complete: " + processed + " documents");
        return Promise.resolve(processed);
      }

      return pool.query(
        "SELECT id, title, LEFT(content, 8000) AS content FROM documents ORDER BY id LIMIT $1 OFFSET $2",
        [BATCH, offset]
      ).then(function(result) {
        var texts = result.rows.map(function(row) {
          return row.title + ". " + row.content;
        });

        return axios.post("https://api.openai.com/v1/embeddings", {
          input: texts,
          model: modelId
        }, {
          headers: {
            "Authorization": "Bearer " + OPENAI_API_KEY,
            "Content-Type": "application/json"
          }
        }).then(function(embResponse) {
          var updates = result.rows.map(function(row, i) {
            return pool.query(
              "UPDATE documents SET embedding = $1, model_version = $2, embedded_at = NOW() WHERE id = $3",
              [JSON.stringify(embResponse.data.data[i].embedding), modelId, row.id]
            );
          });
          return Promise.all(updates);
        });
      }).then(function() {
        processed += BATCH;
        offset += BATCH;
        if (processed % 500 === 0 || processed >= total) {
          console.log("  Progress: " + Math.min(processed, total) + "/" + total);
        }
        return batch();
      });
    }

    return batch();
  });
}

// Main pipeline
function runPipeline() {
  var startTime = Date.now();
  var fineTunedModel;

  return extractPairsFromLogs(2, 90)
    .then(function(pairs) {
      if (pairs.length < 200) {
        console.log("Warning: only " + pairs.length + " pairs found. Recommend at least 500.");
      }
      return addHardNegatives(pairs);
    })
    .then(function(enrichedPairs) {
      // Hold out 10% for evaluation
      var splitIndex = Math.floor(enrichedPairs.length * 0.9);
      var trainPairs = enrichedPairs.slice(0, splitIndex);
      var testPairs = enrichedPairs.slice(splitIndex);

      // Save test set
      fs.writeFileSync(
        path.join(__dirname, "test_queries.json"),
        JSON.stringify(testPairs.map(function(p) {
          return { query: p.anchor, relevant_text: p.positive };
        }), null, 2)
      );

      return startFineTuning(trainPairs).then(function(modelId) {
        fineTunedModel = modelId;
        return { modelId: modelId, testPairs: testPairs };
      });
    })
    .then(function(ctx) {
      // Evaluate base model on test set for comparison
      return evaluateModel(BASE_MODEL, ctx.testPairs.map(function(p) {
        return { query: p.anchor, relevant_ids: [] }; // Simplified for example
      })).then(function(baseMetrics) {
        return evaluateModel(ctx.modelId, ctx.testPairs.map(function(p) {
          return { query: p.anchor, relevant_ids: [] };
        })).then(function(ftMetrics) {
          console.log("\n=== Comparison ===");
          console.log("Base Recall@10:       " + baseMetrics.recall_at_10.toFixed(4));
          console.log("Fine-tuned Recall@10: " + ftMetrics.recall_at_10.toFixed(4));
          var improvement = ((ftMetrics.recall_at_10 - baseMetrics.recall_at_10) / baseMetrics.recall_at_10 * 100).toFixed(1);
          console.log("Improvement:          " + improvement + "%");
          return ctx.modelId;
        });
      });
    })
    .then(function(modelId) {
      return reindexCorpus(modelId);
    })
    .then(function(docsReindexed) {
      var elapsed = ((Date.now() - startTime) / 1000 / 60).toFixed(1);
      console.log("\nPipeline complete in " + elapsed + " minutes");
      console.log("Model: " + fineTunedModel);
      console.log("Documents re-indexed: " + docsReindexed);
    })
    .catch(function(err) {
      console.error("Pipeline failed: " + err.message);
      console.error(err.stack);
      process.exit(1);
    });
}

runPipeline();

Run the pipeline:

export OPENAI_API_KEY=sk-...
export POSTGRES_CONNECTION_STRING=postgresql://user:pass@localhost:5432/mydb
node fine-tune-pipeline.js

Expected output:

Step 1: Extracting training pairs from search logs...
  Found 2847 positive pairs
Step 2: Mining hard negatives with base model...
  Processed 50/2847
  Processed 100/2847
  ...
  2614/2847 pairs have hard negatives
Step 3: Starting fine-tune job...
  Wrote 2562 pairs to /app/training_data.jsonl
  Uploaded training file: file-abc123
  Fine-tune job started: ftjob-xyz789
  [1] Status: validating_files
  [2] Status: running
  ...
  [47] Status: succeeded
Step 4: Evaluating fine-tuned model: ft:text-embedding-3-small:org:v2:abc456
  Recall@10: 0.8734
  Evaluated on 285 queries

=== Comparison ===
Base Recall@10:       0.7142
Fine-tuned Recall@10: 0.8734
Improvement:          22.3%

Step 5: Re-indexing corpus with model: ft:text-embedding-3-small:org:v2:abc456
  Progress: 500/12450
  Progress: 1000/12450
  ...
  Re-indexing complete: 12450 documents

Pipeline complete in 68.3 minutes
Model: ft:text-embedding-3-small:org:v2:abc456
Documents re-indexed: 12450

Common Issues and Troubleshooting

1. Training File Validation Errors

Error: Invalid training file: Each example must have an 'anchor' and 'positive' field.
Line 847 contains unexpected field 'query'.

Cause: Your field names do not match the provider's expected schema. OpenAI expects anchor/positive/negative, not query/document/hard_negative.

Fix: Map your internal field names to the provider's schema before writing the JSONL file. Always validate a sample of lines before uploading.

2. Embedding Dimension Mismatch After Fine-Tuning

Error: cannot cast type vector(768) to vector(1536)
  at /app/node_modules/pg/lib/client.js:526:17

Cause: Your fine-tuned model produces vectors with a different dimension than the base model, or you are trying to insert new vectors into a column that was created with a different dimension.

Fix: Check the output dimension of your fine-tuned model before re-indexing. If it differs, you need to recreate the vector column:

ALTER TABLE documents DROP COLUMN embedding;
ALTER TABLE documents ADD COLUMN embedding vector(768);

3. Rate Limiting During Re-Indexing

Error: Request failed with status 429: Rate limit exceeded.
  You are sending requests too quickly. Please slow down.
  Limit: 3000 requests per minute.

Cause: Re-indexing thousands of documents sends embedding requests faster than the API allows.

Fix: Add exponential backoff and reduce batch concurrency:

function delay(ms) {
  return new Promise(function(resolve) { setTimeout(resolve, ms); });
}

function embedWithRetry(texts, modelId, maxRetries) {
  var attempt = 0;
  function tryEmbed() {
    return axios.post("https://api.openai.com/v1/embeddings", {
      input: texts,
      model: modelId
    }, {
      headers: {
        "Authorization": "Bearer " + OPENAI_API_KEY,
        "Content-Type": "application/json"
      }
    }).catch(function(err) {
      if (err.response && err.response.status === 429 && attempt < maxRetries) {
        attempt++;
        var waitMs = Math.pow(2, attempt) * 1000;
        console.log("Rate limited. Retrying in " + waitMs + "ms...");
        return delay(waitMs).then(tryEmbed);
      }
      throw err;
    });
  }
  return tryEmbed();
}

4. Degraded Quality After Fine-Tuning

Evaluation results:
  Base model Recall@10:       0.74
  Fine-tuned model Recall@10: 0.69
  WARNING: Fine-tuned model performs worse than base!

Cause: Overfitting on small or noisy training data. This happens when you have fewer than 500 training pairs, when your positive pairs contain many false positives (e.g., accidental clicks), or when you trained for too many epochs.

Fix: Reduce epochs to 1-2, increase training data volume, clean your training data by filtering out queries with session duration under 5 seconds (likely misclicks), and add more hard negatives. If the base model was already performing well on your domain, fine-tuning may not be worth it.

5. Mixed Model Vectors Producing Nonsensical Results

Search for "kubernetes pod scheduling" returns:
  1. "Chocolate cake recipe" (similarity: 0.89)
  2. "Cat vaccination schedule" (similarity: 0.87)
  3. "History of Roman architecture" (similarity: 0.85)

Cause: Your documents table contains a mix of vectors from the base model and the fine-tuned model. Comparing vectors from different models produces random similarity scores.

Fix: Add a model_version column to your documents table and filter by model version in your search queries. Alternatively, re-index the entire corpus before switching the search endpoint to the new model.

Best Practices

Start with at least 1,000 training pairs before fine-tuning. Below that threshold, the model does not see enough diversity to generalize. Supplement with synthetic data if you do not have enough real queries.
Always hold out 10-20% of your data for evaluation. Never evaluate on the same data you trained on. Split by time if possible (train on older data, evaluate on newer data) to catch temporal drift.
Use hard negatives, not random negatives. A model that can distinguish "Kubernetes pod scheduling" from "Kubernetes pod networking" has learned something far more useful than one that can distinguish it from "banana smoothie recipes."
Track your embedding model version alongside every vector in your database. Add a model_version column and a embedded_at timestamp. This makes re-indexing idempotent and auditable.
Do not fine-tune from a previous fine-tune. Always start from the base model and include all your training data. Chaining fine-tunes leads to catastrophic forgetting and unpredictable behavior.
Set a quality gate before deploying. Define a minimum improvement threshold (e.g., 5% recall@10 improvement) and only deploy if the fine-tuned model exceeds it. Otherwise, you are adding complexity without benefit.
Budget for re-indexing time and cost. Re-embedding 100K documents with OpenAI costs roughly $0.13 at current pricing for text-embedding-3-small, but it takes time. Plan for maintenance windows.
Log everything during the pipeline. Training pair counts, hard negative hit rates, fine-tuning duration, evaluation metrics per query, re-indexing throughput. You will need this data when debugging the next iteration.
Consider the latency impact. Self-hosted models add a network hop if served separately. Fine-tuned API models have the same latency as the base model. Factor this into your architecture decisions.

References

OpenAI Embeddings Guide - Official documentation for OpenAI embedding models and fine-tuning
Cohere Custom Models - Cohere's documentation on custom embedding model training
Sentence-Transformers Training - Open-source framework for fine-tuning transformer-based sentence embeddings
pgvector Documentation - PostgreSQL vector similarity extension
BEIR Benchmark - Comprehensive benchmark for evaluating information retrieval models across domains
Matryoshka Representation Learning - Flexible embedding dimensions for efficiency, relevant when fine-tuning models that support truncated dimensions
Mining Hard Negatives for Contrastive Learning - Research on effective hard negative mining strategies for embedding training

Fine-Tuning Embedding Models for Domain-Specific Search

Fine-Tuning Embedding Models for Domain-Specific Search

Overview

Prerequisites

When Off-the-Shelf Embeddings Are Not Enough

Preparing Training Data

Positive Pairs

Hard Negatives

Contrastive Learning Explained Simply

Using OpenAI Fine-Tuning for Custom Embeddings

Using Cohere's Custom Model Training

Open-Source Fine-Tuning with Sentence-Transformers

Building a Training Dataset from Search Logs and Click Data

Synthetic Training Data Generation with LLMs

Evaluation Metrics for Fine-Tuned Models

Recall@k

Mean Reciprocal Rank (MRR)

Normalized Discounted Cumulative Gain (nDCG)

Running a Full Evaluation

A/B Testing Fine-Tuned vs Base Models

Re-Indexing Your Corpus After Fine-Tuning

Incremental Fine-Tuning as Your Domain Evolves

Cost-Benefit Analysis: Fine-Tuning vs Prompt Engineering

Managing Multiple Embedding Model Versions

Complete Working Example

Common Issues and Troubleshooting

1. Training File Validation Errors

2. Embedding Dimension Mismatch After Fine-Tuning

3. Rate Limiting During Re-Indexing

4. Degraded Quality After Fine-Tuning

5. Mixed Model Vectors Producing Nonsensical Results

Best Practices

References

Quick Links

Need Expert Help?