Embeddings

Embedding Model Comparison: OpenAI vs Cohere vs Open Source

Compare embedding models from OpenAI, Cohere, and open-source options with benchmarks, cost analysis, and a Node.js testing tool.

Embedding Model Comparison: OpenAI vs Cohere vs Open Source

Choosing an embedding model is one of the most consequential decisions you will make when building a retrieval-augmented generation (RAG) system, semantic search engine, or recommendation pipeline. The model you pick determines retrieval quality, latency budgets, infrastructure costs, and how tightly you are coupled to a vendor. This article breaks down the major options — OpenAI, Cohere, and open-source alternatives — with real benchmarks, cost math, and a complete Node.js tool you can run against your own data.

Prerequisites

  • Node.js 18 or later installed
  • An OpenAI API key (for OpenAI embedding tests)
  • A Cohere API key (for Cohere embedding tests)
  • Ollama installed locally (for open-source model tests)
  • Basic familiarity with vector embeddings and cosine similarity
  • A PostgreSQL or vector database instance (optional, for retrieval tests)

What Makes a Good Embedding Model

Before comparing specific providers, you need a framework for evaluation. There are five dimensions that matter in production.

Retrieval quality. The whole point of embeddings is to map semantically similar text close together in vector space. A model that cannot distinguish "how to reset a password" from "password security best practices" is useless for search. Quality is measured by benchmarks like MTEB (Massive Text Embedding Benchmark) and by your own domain-specific retrieval tests.

Dimensionality. Embedding vectors range from 384 dimensions to 3072 and beyond. Higher dimensions capture more semantic nuance but cost more to store, index, and compare. A 3072-dimension vector takes 8x the storage of a 384-dimension vector and makes cosine similarity calculations proportionally slower.

Latency. API round-trip time matters when you are embedding user queries at request time. A model that takes 500ms per embedding call adds half a second to every search. Local models eliminate network latency but introduce compute requirements.

Cost. At scale, embedding costs compound. If you are processing 10 million documents, the difference between $0.02 and $0.10 per million tokens is $800. Monthly re-embedding or incremental updates multiply that figure.

Vendor lock-in. Switching embedding providers means re-embedding your entire corpus. If you have 50 million vectors in a database, that is not a trivial migration. The abstraction layer you build around your embedding calls directly affects how painful a future switch will be.

OpenAI Embeddings

OpenAI offers two embedding models in the text-embedding-3 family.

text-embedding-3-small

This is the workhorse model for most production use cases. It produces 1536-dimensional vectors by default, scores competitively on MTEB benchmarks, and costs $0.02 per million tokens. For context, a million tokens is roughly 700,000 words — enough to embed a substantial document corpus.

The killer feature of the text-embedding-3 family is native dimension reduction. You can request vectors in any dimension from 256 to 1536, and the model returns truncated vectors that still perform well. A 512-dimension vector from text-embedding-3-small retains most of its retrieval quality while cutting storage and compute costs by 66%.

text-embedding-3-large

The large model produces 3072-dimensional vectors by default, scores higher on MTEB, and costs $0.13 per million tokens — roughly 6.5x more than the small model. It supports dimension reduction down to 256 as well.

In practice, I have found that text-embedding-3-large at 1024 dimensions often outperforms text-embedding-3-small at 1536 dimensions while using 33% less storage. This is the sweet spot for applications where retrieval quality justifies the higher per-token cost.

var OpenAI = require("openai");

var client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

function getOpenAIEmbedding(text, model, dimensions) {
  var params = {
    model: model || "text-embedding-3-small",
    input: text
  };

  if (dimensions) {
    params.dimensions = dimensions;
  }

  return client.embeddings.create(params).then(function (response) {
    return {
      embedding: response.data[0].embedding,
      model: response.model,
      dimensions: response.data[0].embedding.length,
      usage: response.usage
    };
  });
}

// Default 1536 dimensions
getOpenAIEmbedding("How do I reset my password?").then(function (result) {
  console.log("Dimensions:", result.dimensions);
  console.log("Tokens used:", result.usage.total_tokens);
});

// Reduced to 512 dimensions for cheaper storage
getOpenAIEmbedding("How do I reset my password?", "text-embedding-3-small", 512)
  .then(function (result) {
    console.log("Reduced dimensions:", result.dimensions);
  });

OpenAI Batch Embedding

OpenAI accepts up to 2048 inputs in a single API call. Batching is critical for throughput — sending one text at a time is the number one performance mistake I see in production embedding pipelines.

function batchEmbedOpenAI(texts, model, dimensions) {
  var batchSize = 2048;
  var batches = [];

  for (var i = 0; i < texts.length; i += batchSize) {
    batches.push(texts.slice(i, i + batchSize));
  }

  var results = [];

  return batches.reduce(function (chain, batch, index) {
    return chain.then(function () {
      var params = {
        model: model || "text-embedding-3-small",
        input: batch
      };
      if (dimensions) {
        params.dimensions = dimensions;
      }

      return client.embeddings.create(params).then(function (response) {
        response.data.forEach(function (item) {
          results.push(item.embedding);
        });
        console.log("Batch", index + 1, "of", batches.length, "complete");
      });
    });
  }, Promise.resolve()).then(function () {
    return results;
  });
}

Cohere Embeddings

Cohere's embed-v3 model brings two unique features to the table: input type classification and native multilingual support.

Input Types

Cohere requires you to specify an input_type parameter. This is not just metadata — the model actually produces different embeddings depending on whether you are embedding a search_document, a search_query, a classification input, or a clustering input. The asymmetric embedding approach means that query embeddings and document embeddings are optimized to find each other, which can improve retrieval quality by 5-10% over symmetric models.

Multilingual Support

Cohere embed-v3 natively supports 100+ languages in a single model. If your application handles multilingual content — international knowledge bases, global customer support, or cross-language search — Cohere is the strongest option out of the box. OpenAI's models handle multiple languages but were not explicitly designed for cross-lingual retrieval.

Pricing and Dimensions

Cohere embed-v3 produces 1024-dimensional vectors and costs approximately $0.10 per million tokens. It sits between OpenAI's small and large models on price.

var CohereClient = require("cohere-ai").CohereClient;

var cohere = new CohereClient({ token: process.env.COHERE_API_KEY });

function getCohereEmbedding(texts, inputType) {
  var textsArray = Array.isArray(texts) ? texts : [texts];

  return cohere.embed({
    texts: textsArray,
    model: "embed-english-v3.0",
    inputType: inputType || "search_document",
    embeddingTypes: ["float"]
  }).then(function (response) {
    return {
      embeddings: response.embeddings.float,
      dimensions: response.embeddings.float[0].length,
      model: "embed-english-v3.0"
    };
  });
}

// Embedding documents for storage
getCohereEmbedding(
  ["How to reset your password", "Password security best practices"],
  "search_document"
).then(function (result) {
  console.log("Document embeddings:", result.embeddings.length);
  console.log("Dimensions:", result.dimensions);
});

// Embedding a query for retrieval — note the different input_type
getCohereEmbedding("forgot my password", "search_query")
  .then(function (result) {
    console.log("Query embedding dimensions:", result.dimensions);
  });

For multilingual use cases, swap the model to embed-multilingual-v3.0:

function getCohereMultilingualEmbedding(texts, inputType) {
  var textsArray = Array.isArray(texts) ? texts : [texts];

  return cohere.embed({
    texts: textsArray,
    model: "embed-multilingual-v3.0",
    inputType: inputType || "search_document",
    embeddingTypes: ["float"]
  }).then(function (response) {
    return response.embeddings.float;
  });
}

// Cross-lingual search: embed docs in multiple languages
getCohereMultilingualEmbedding([
  "How to reset your password",
  "Comment reinitialiser votre mot de passe",
  "Passwort zuruecksetzen"
], "search_document").then(function (embeddings) {
  console.log("Embedded", embeddings.length, "multilingual documents");
});

Open-Source Options

Running your own embedding model eliminates API costs, removes vendor dependency, and keeps your data on your own infrastructure. The tradeoff is compute resources, maintenance burden, and typically lower benchmark scores than the top commercial models.

Ollama for Local Embeddings

Ollama is the simplest way to run embedding models locally. It wraps open-source models behind a local HTTP API that mirrors the OpenAI interface, making integration trivial.

The best open-source embedding models available through Ollama include nomic-embed-text (768 dimensions, strong MTEB scores), mxbai-embed-large (1024 dimensions), and all-minilm (384 dimensions, extremely fast).

var http = require("http");

function getOllamaEmbedding(text, model) {
  return new Promise(function (resolve, reject) {
    var data = JSON.stringify({
      model: model || "nomic-embed-text",
      prompt: text
    });

    var options = {
      hostname: "localhost",
      port: 11434,
      path: "/api/embeddings",
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "Content-Length": Buffer.byteLength(data)
      }
    };

    var req = http.request(options, function (res) {
      var body = "";
      res.on("data", function (chunk) { body += chunk; });
      res.on("end", function () {
        try {
          var parsed = JSON.parse(body);
          resolve({
            embedding: parsed.embedding,
            dimensions: parsed.embedding.length,
            model: model || "nomic-embed-text"
          });
        } catch (e) {
          reject(new Error("Failed to parse Ollama response: " + e.message));
        }
      });
    });

    req.on("error", reject);
    req.write(data);
    req.end();
  });
}

// Pull the model first: ollama pull nomic-embed-text
getOllamaEmbedding("How to reset your password").then(function (result) {
  console.log("Local embedding dimensions:", result.dimensions);
});

Sentence-Transformers via Custom API

For more control, you can run sentence-transformers models behind a simple Python API server and call them from Node.js. This gives you access to the full Hugging Face model zoo. Models like BAAI/bge-large-en-v1.5 (1024 dimensions) and intfloat/e5-large-v2 (1024 dimensions) are competitive with commercial offerings on MTEB benchmarks.

I will not cover the Python server setup here, but the Node.js client is the same pattern as Ollama — an HTTP POST to a local endpoint.

Benchmarking Methodology

MTEB Scores

The Massive Text Embedding Benchmark (MTEB) evaluates models across seven task categories: classification, clustering, pair classification, reranking, retrieval, semantic textual similarity (STS), and summarization. The overall score is an average across these categories.

As of early 2026, the MTEB leaderboard looks roughly like this for the models we are comparing:

Model MTEB Average Retrieval Dimensions Type
text-embedding-3-large 64.6 55.4 3072 API
Cohere embed-v3 64.5 55.0 1024 API
text-embedding-3-small 62.3 51.7 1536 API
nomic-embed-text 62.4 52.8 768 Local
bge-large-en-v1.5 63.6 54.3 1024 Local
all-MiniLM-L6-v2 56.3 41.9 384 Local

Important caveat: MTEB scores measure general-purpose performance. Your domain-specific data may produce very different rankings. I have seen nomic-embed-text outperform text-embedding-3-large on specialized technical corpora where the open-source model happened to have better training data coverage.

Running Your Own Benchmarks

Generic benchmarks are useful for shortlisting, but you must test on your own data. The methodology is straightforward:

  1. Assemble a test set of 50-200 query-document pairs where you know the correct answer
  2. Embed all documents with each model
  3. For each query, embed it and find the top-k nearest documents by cosine similarity
  4. Measure recall@k (what percentage of known-relevant documents appear in the top k results)
  5. Compare across models

This is exactly what the complete working example at the end of this article does.

Cost Analysis

Here is the real math per million tokens across providers, as of early 2026:

Provider Model Cost / 1M Tokens Dimensions Cost for 10M Docs (~50M tokens)
OpenAI text-embedding-3-small $0.02 1536 $1.00
OpenAI text-embedding-3-large $0.13 3072 $6.50
Cohere embed-english-v3.0 $0.10 1024 $5.00
Cohere embed-multilingual-v3.0 $0.10 1024 $5.00
Local nomic-embed-text $0.00* 768 $0.00*
Local all-MiniLM-L6-v2 $0.00* 384 $0.00*

*Local models have zero per-token cost but require compute infrastructure. A decent GPU (RTX 3060 or better) can process ~1000 embeddings per second with nomic-embed-text. On CPU only, expect 50-100 embeddings per second with Ollama.

The storage cost difference is also significant. Storing 10 million vectors:

  • 3072 dimensions (float32): ~114 GB
  • 1536 dimensions (float32): ~57 GB
  • 1024 dimensions (float32): ~38 GB
  • 768 dimensions (float32): ~29 GB
  • 384 dimensions (float32): ~14 GB

If you are using a managed vector database like Pinecone or Weaviate, those storage numbers translate directly to monthly hosting costs.

Latency Comparison

API round-trip times vary by region, load, and batch size. Here are typical p50 latencies I have measured from US East:

Provider Model Single Text (p50) Batch of 100 (p50) Per-Text in Batch
OpenAI text-embedding-3-small 120ms 350ms 3.5ms
OpenAI text-embedding-3-large 180ms 500ms 5.0ms
Cohere embed-english-v3.0 150ms 400ms 4.0ms
Local (GPU) nomic-embed-text 15ms 80ms 0.8ms
Local (CPU) nomic-embed-text 45ms 2500ms 25ms

The takeaway: local models on GPU are 10x faster than API calls. On CPU, they are competitive for single texts but fall behind on large batches because they cannot parallelize as effectively.

For query-time embedding where latency matters most, local models are hard to beat if you have the infrastructure. For batch ingestion, API models with their massive parallelism often win on wall-clock time even if individual calls are slower.

Dimension Tradeoffs

Dimension count is not just a storage decision — it affects retrieval quality, indexing speed, and memory usage.

384 dimensions (all-MiniLM-L6-v2): Good enough for simple semantic search, FAQ matching, and deduplication. Not recommended for nuanced retrieval where subtle semantic differences matter. Storage-efficient for very large corpora (100M+ documents).

768 dimensions (nomic-embed-text): The sweet spot for most applications. Captures enough semantic nuance for production search and RAG without excessive resource requirements.

1024 dimensions (Cohere, bge-large): Strong performance across all benchmarks. A good default if you do not have a reason to go smaller or larger.

1536 dimensions (OpenAI small default): Slightly better than 1024 on most benchmarks, but the marginal improvement over 1024 is often not worth the 50% storage increase.

3072 dimensions (OpenAI large default): Only justified for applications where retrieval quality is the absolute top priority and cost is secondary. Research applications, legal discovery, medical literature search.

My rule of thumb: start with 768 or 1024 dimensions. Only increase if you can measure a meaningful improvement on your specific retrieval tasks.

Multilingual Support Comparison

Feature OpenAI Cohere Open Source
Languages supported 50+ 100+ Varies by model
Cross-lingual retrieval Good Excellent Model-dependent
Dedicated multilingual model No (single model) Yes (embed-multilingual-v3.0) Yes (multilingual-e5-large)
Non-Latin script quality Good Excellent Variable
CJK performance Good Very good Requires specific model

If multilingual is a primary requirement, Cohere is the clear winner. Their multilingual model was specifically trained for cross-lingual retrieval — you can embed a query in English and retrieve documents in Japanese, and it works remarkably well.

For open-source multilingual embeddings, intfloat/multilingual-e5-large is the strongest option, but it requires significantly more compute than English-only models.

Unified Embedding Interface

The most important architectural decision is abstracting your embedding provider behind a unified interface. This lets you switch providers without touching application code.

var OpenAI = require("openai");
var CohereClient = require("cohere-ai").CohereClient;
var http = require("http");

function EmbeddingService(config) {
  this.provider = config.provider || "openai";
  this.model = config.model;
  this.dimensions = config.dimensions;

  if (this.provider === "openai") {
    this.client = new OpenAI({ apiKey: config.apiKey });
    this.model = this.model || "text-embedding-3-small";
  } else if (this.provider === "cohere") {
    this.client = new CohereClient({ token: config.apiKey });
    this.model = this.model || "embed-english-v3.0";
  } else if (this.provider === "ollama") {
    this.ollamaHost = config.ollamaHost || "localhost";
    this.ollamaPort = config.ollamaPort || 11434;
    this.model = this.model || "nomic-embed-text";
  }
}

EmbeddingService.prototype.embed = function (texts, options) {
  var textsArray = Array.isArray(texts) ? texts : [texts];
  options = options || {};

  if (this.provider === "openai") {
    return this._embedOpenAI(textsArray, options);
  } else if (this.provider === "cohere") {
    return this._embedCohere(textsArray, options);
  } else if (this.provider === "ollama") {
    return this._embedOllama(textsArray, options);
  }

  return Promise.reject(new Error("Unknown provider: " + this.provider));
};

EmbeddingService.prototype._embedOpenAI = function (texts, options) {
  var params = {
    model: this.model,
    input: texts
  };
  if (this.dimensions) {
    params.dimensions = this.dimensions;
  }

  return this.client.embeddings.create(params).then(function (response) {
    return {
      embeddings: response.data.map(function (d) { return d.embedding; }),
      usage: response.usage,
      provider: "openai"
    };
  });
};

EmbeddingService.prototype._embedCohere = function (texts, options) {
  var inputType = options.inputType || "search_document";

  return this.client.embed({
    texts: texts,
    model: this.model,
    inputType: inputType,
    embeddingTypes: ["float"]
  }).then(function (response) {
    return {
      embeddings: response.embeddings.float,
      usage: null,
      provider: "cohere"
    };
  });
};

EmbeddingService.prototype._embedOllama = function (texts) {
  var self = this;

  return texts.reduce(function (chain, text) {
    return chain.then(function (results) {
      return self._ollamaRequest(text).then(function (embedding) {
        results.push(embedding);
        return results;
      });
    });
  }, Promise.resolve([])).then(function (embeddings) {
    return {
      embeddings: embeddings,
      usage: null,
      provider: "ollama"
    };
  });
};

EmbeddingService.prototype._ollamaRequest = function (text) {
  var self = this;

  return new Promise(function (resolve, reject) {
    var data = JSON.stringify({ model: self.model, prompt: text });
    var options = {
      hostname: self.ollamaHost,
      port: self.ollamaPort,
      path: "/api/embeddings",
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "Content-Length": Buffer.byteLength(data)
      }
    };

    var req = http.request(options, function (res) {
      var body = "";
      res.on("data", function (chunk) { body += chunk; });
      res.on("end", function () {
        try {
          var parsed = JSON.parse(body);
          resolve(parsed.embedding);
        } catch (e) {
          reject(new Error("Ollama parse error: " + e.message));
        }
      });
    });

    req.on("error", reject);
    req.write(data);
    req.end();
  });
};

// Usage — switch providers by changing config only
var service = new EmbeddingService({
  provider: "openai",
  apiKey: process.env.OPENAI_API_KEY,
  dimensions: 768
});

service.embed(["How to reset a password", "Password security guide"])
  .then(function (result) {
    console.log("Provider:", result.provider);
    console.log("Vectors:", result.embeddings.length);
    console.log("Dimensions:", result.embeddings[0].length);
  });

When to Fine-Tune vs Use Off-the-Shelf

Fine-tuning an embedding model means training it on your domain-specific data to improve retrieval quality for your particular use case. It is powerful but comes with significant costs.

Use off-the-shelf when:

  • Your domain vocabulary is well-represented in general training data
  • You have fewer than 10,000 documents
  • Your retrieval accuracy with a general model is already above 85%
  • You need to ship quickly and iterate later

Consider fine-tuning when:

  • Your domain has specialized terminology (medical, legal, scientific)
  • General models consistently fail on your retrieval test set
  • You have at least 1,000 labeled query-document pairs for training
  • You have committed to a specific model family long-term

Currently, OpenAI does not offer fine-tuning for embedding models. Cohere offers custom model training for enterprise customers. For open-source models, fine-tuning with sentence-transformers is well-documented and practical with as little as a single GPU.

In my experience, prompt engineering your chunks (adding descriptive titles, metadata prefixes, and structured formatting to your documents before embedding) often gets you 80% of the benefit of fine-tuning with zero training cost.

Practical Recommendation Matrix

Use Case Recommended Model Why
Startup MVP / prototyping OpenAI text-embedding-3-small Cheapest API, good quality, fast integration
Production semantic search OpenAI text-embedding-3-large (1024d) Best quality-to-cost at reduced dimensions
Multilingual application Cohere embed-multilingual-v3.0 Purpose-built for cross-lingual retrieval
Privacy-sensitive data nomic-embed-text via Ollama Data never leaves your infrastructure
High-throughput pipeline Local GPU + bge-large-en-v1.5 Zero per-token cost, sub-millisecond latency
Budget-constrained at scale OpenAI text-embedding-3-small (512d) $0.02/M tokens with reduced dimensions
Maximum retrieval quality OpenAI text-embedding-3-large (3072d) Highest MTEB scores, full dimensionality
Classification / clustering Cohere embed-v3 Dedicated input types optimize for these tasks

Complete Working Example: Embedding Benchmark Tool

This Node.js tool tests multiple embedding providers on the same dataset, measuring retrieval accuracy, latency, and estimated cost.

var OpenAI = require("openai");
var CohereClient = require("cohere-ai").CohereClient;
var http = require("http");

// ----- Configuration -----

var OPENAI_API_KEY = process.env.OPENAI_API_KEY;
var COHERE_API_KEY = process.env.COHERE_API_KEY;
var OLLAMA_MODEL = "nomic-embed-text";

// ----- Test Dataset -----
// Each entry has a query and a list of relevant document indices

var documents = [
  "To reset your password, go to Settings > Security > Change Password.",
  "Our password policy requires at least 12 characters with mixed case.",
  "Two-factor authentication adds an extra layer of security to your account.",
  "To delete your account, contact support with your account ID.",
  "Billing invoices are generated on the first of each month.",
  "You can upgrade your plan from the billing settings page.",
  "API rate limits are set to 1000 requests per minute per key.",
  "Use OAuth 2.0 for third-party application authentication.",
  "Database backups run automatically every 6 hours.",
  "To export your data, use the bulk export endpoint in the API."
];

var testQueries = [
  { query: "how do I change my password", relevant: [0, 1] },
  { query: "set up 2FA on my account", relevant: [2] },
  { query: "how much does it cost", relevant: [4, 5] },
  { query: "how to connect external apps", relevant: [7] },
  { query: "download all my information", relevant: [9] }
];

// ----- Utility Functions -----

function cosineSimilarity(a, b) {
  var dotProduct = 0;
  var normA = 0;
  var normB = 0;

  for (var i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }

  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

function measureRecall(queryEmbedding, docEmbeddings, relevantIndices, k) {
  var similarities = docEmbeddings.map(function (docEmb, index) {
    return { index: index, score: cosineSimilarity(queryEmbedding, docEmb) };
  });

  similarities.sort(function (a, b) { return b.score - a.score; });

  var topK = similarities.slice(0, k).map(function (s) { return s.index; });
  var hits = relevantIndices.filter(function (idx) {
    return topK.indexOf(idx) !== -1;
  });

  return hits.length / relevantIndices.length;
}

// ----- Provider Embedding Functions -----

function embedWithOpenAI(texts, model, dimensions) {
  var client = new OpenAI({ apiKey: OPENAI_API_KEY });
  var params = { model: model, input: texts };
  if (dimensions) params.dimensions = dimensions;

  return client.embeddings.create(params).then(function (response) {
    return response.data.map(function (d) { return d.embedding; });
  });
}

function embedWithCohere(texts, inputType) {
  var client = new CohereClient({ token: COHERE_API_KEY });

  return client.embed({
    texts: texts,
    model: "embed-english-v3.0",
    inputType: inputType,
    embeddingTypes: ["float"]
  }).then(function (response) {
    return response.embeddings.float;
  });
}

function embedWithOllama(texts) {
  return texts.reduce(function (chain, text) {
    return chain.then(function (results) {
      return new Promise(function (resolve, reject) {
        var data = JSON.stringify({ model: OLLAMA_MODEL, prompt: text });
        var options = {
          hostname: "localhost",
          port: 11434,
          path: "/api/embeddings",
          method: "POST",
          headers: {
            "Content-Type": "application/json",
            "Content-Length": Buffer.byteLength(data)
          }
        };

        var req = http.request(options, function (res) {
          var body = "";
          res.on("data", function (chunk) { body += chunk; });
          res.on("end", function () {
            try {
              var parsed = JSON.parse(body);
              results.push(parsed.embedding);
              resolve(results);
            } catch (e) {
              reject(new Error("Ollama error: " + e.message));
            }
          });
        });

        req.on("error", reject);
        req.write(data);
        req.end();
      });
    });
  }, Promise.resolve([]));
}

// ----- Benchmark Runner -----

function runBenchmark(providerName, embedDocsFn, embedQueryFn) {
  var startTime = Date.now();
  var docEmbeddings;

  console.log("\n===== Benchmarking:", providerName, "=====");

  return embedDocsFn(documents).then(function (docEmbs) {
    docEmbeddings = docEmbs;
    var docTime = Date.now() - startTime;
    console.log("Document embedding time:", docTime, "ms");
    console.log("Dimensions:", docEmbs[0].length);

    var queryStart = Date.now();
    var queryPromises = testQueries.map(function (tq) {
      return embedQueryFn([tq.query]).then(function (qEmbs) {
        return qEmbs[0];
      });
    });

    return Promise.all(queryPromises).then(function (queryEmbeddings) {
      var queryTime = Date.now() - queryStart;
      var totalTime = Date.now() - startTime;

      // Calculate recall@3 for each query
      var recalls = testQueries.map(function (tq, i) {
        return measureRecall(queryEmbeddings[i], docEmbeddings, tq.relevant, 3);
      });

      var avgRecall = recalls.reduce(function (sum, r) { return sum + r; }, 0) / recalls.length;

      console.log("Query embedding time:", queryTime, "ms");
      console.log("Total time:", totalTime, "ms");
      console.log("Average Recall@3:", (avgRecall * 100).toFixed(1) + "%");
      console.log("Individual recalls:", recalls.map(function (r) {
        return (r * 100).toFixed(0) + "%";
      }).join(", "));

      return {
        provider: providerName,
        dimensions: docEmbs[0].length,
        totalTimeMs: totalTime,
        avgRecall: avgRecall,
        recalls: recalls
      };
    });
  });
}

// ----- Main -----

function main() {
  var results = [];

  // Test OpenAI text-embedding-3-small
  return runBenchmark(
    "OpenAI text-embedding-3-small",
    function (texts) { return embedWithOpenAI(texts, "text-embedding-3-small"); },
    function (texts) { return embedWithOpenAI(texts, "text-embedding-3-small"); }
  ).then(function (result) {
    results.push(result);

    // Test OpenAI text-embedding-3-small at 512 dimensions
    return runBenchmark(
      "OpenAI text-embedding-3-small (512d)",
      function (texts) { return embedWithOpenAI(texts, "text-embedding-3-small", 512); },
      function (texts) { return embedWithOpenAI(texts, "text-embedding-3-small", 512); }
    );
  }).then(function (result) {
    results.push(result);

    // Test Cohere if API key is available
    if (COHERE_API_KEY) {
      return runBenchmark(
        "Cohere embed-english-v3.0",
        function (texts) { return embedWithCohere(texts, "search_document"); },
        function (texts) { return embedWithCohere(texts, "search_query"); }
      ).then(function (result) {
        results.push(result);
      });
    } else {
      console.log("\nSkipping Cohere (no COHERE_API_KEY set)");
      return Promise.resolve();
    }
  }).then(function () {
    // Test Ollama if available
    return runBenchmark(
      "Ollama " + OLLAMA_MODEL,
      function (texts) { return embedWithOllama(texts); },
      function (texts) { return embedWithOllama(texts); }
    ).catch(function (err) {
      console.log("\nSkipping Ollama (not running):", err.message);
    });
  }).then(function (result) {
    if (result) results.push(result);

    // Print summary
    console.log("\n\n===== SUMMARY =====");
    console.log("Provider".padEnd(40), "Dims".padEnd(8), "Time(ms)".padEnd(10), "Recall@3");
    console.log("-".repeat(75));
    results.forEach(function (r) {
      console.log(
        r.provider.padEnd(40),
        String(r.dimensions).padEnd(8),
        String(r.totalTimeMs).padEnd(10),
        (r.avgRecall * 100).toFixed(1) + "%"
      );
    });
  }).catch(function (err) {
    console.error("Benchmark failed:", err);
    process.exit(1);
  });
}

main();

Save this as embedding-benchmark.js and run it:

npm install openai cohere-ai
OPENAI_API_KEY=sk-... COHERE_API_KEY=... node embedding-benchmark.js

The tool will output a comparison table showing dimensions, total execution time, and recall@3 for each provider.

Common Issues and Troubleshooting

1. Rate limit errors from OpenAI

Error: 429 Rate limit reached for text-embedding-3-small

OpenAI enforces per-minute token limits on embedding endpoints. The default tier allows 1,000,000 tokens per minute. If you are batch-embedding a large corpus, add a delay between batches or use the Batch API for non-urgent workloads. Implement exponential backoff:

function embedWithRetry(texts, model, retries) {
  retries = retries || 3;
  var delay = 1000;

  return embedWithOpenAI(texts, model).catch(function (err) {
    if (err.status === 429 && retries > 0) {
      console.log("Rate limited, retrying in", delay, "ms...");
      return new Promise(function (resolve) {
        setTimeout(function () {
          resolve(embedWithRetry(texts, model, retries - 1));
        }, delay * (4 - retries));
      });
    }
    throw err;
  });
}

2. Dimension mismatch when querying vectors

Error: Vector dimension mismatch: expected 1536, got 768

This happens when you embed queries with a different model or dimension setting than your stored documents. Every vector in your database must come from the same model with the same dimension configuration. If you switch models, you must re-embed your entire corpus. Store the model name and dimension count as metadata alongside your vectors so you can detect mismatches before they corrupt search results.

3. Ollama connection refused

Error: connect ECONNREFUSED 127.0.0.1:11434

Ollama is not running, or it is running on a different port. Start it with ollama serve, and ensure no firewall rules are blocking port 11434. On Windows, check that the Ollama service is running in the system tray. If you have changed the default port, set the OLLAMA_HOST environment variable accordingly.

4. Cohere invalid input_type error

Error: invalid request: input_type must be one of search_document, search_query, classification, clustering

Cohere requires the inputType parameter on every embed call, and it must be one of the four valid values. A common mistake is passing "document" instead of "search_document", or omitting the parameter entirely. Make sure your unified interface maps the correct input type for document embedding versus query embedding.

5. Out of memory with large batch embeddings

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

Embedding millions of documents generates massive arrays. If you are embedding 1 million documents with 1536 dimensions, that is 1M x 1536 x 4 bytes = ~5.7 GB of float32 data. Stream embeddings to disk or database instead of accumulating them in memory. Process in chunks of 10,000-50,000 documents, writing results before loading the next chunk.

6. Inconsistent similarity scores across providers

Different models produce embeddings in different value ranges. While all normalized embeddings should produce cosine similarity scores between -1 and 1, the distribution of scores varies. An OpenAI cosine similarity of 0.85 does not mean the same thing as a Cohere similarity of 0.85. Never use hardcoded similarity thresholds — always calibrate thresholds per model using your validation set.

Best Practices

  • Always batch your embedding calls. Sending texts one at a time wastes round trips and can be 50-100x slower than batching. OpenAI supports up to 2048 inputs per call, Cohere up to 96.

  • Store your model identifier alongside vectors. When you inevitably switch models or upgrade to a new version, you need to know which vectors need re-embedding. A simple model_version column saves enormous headaches.

  • Use asymmetric embedding where available. Cohere's input type system exists for a reason. Embedding queries and documents differently improves retrieval quality by 5-10% on most benchmarks. If your provider does not support this natively, consider prepending "search query: " or "passage: " to your texts as a manual asymmetry signal.

  • Test dimension reduction before committing. OpenAI's native dimension reduction is nearly free to test. Embed your validation set at 256, 512, 768, 1024, and 1536 dimensions, measure recall at each, and pick the smallest dimension that meets your quality bar. You may be surprised how much you can reduce.

  • Normalize your vectors before storage. Most vector databases expect unit-normalized vectors for cosine similarity. Normalizing once at embedding time is cheaper than normalizing at every query. Verify your model outputs normalized vectors — most modern models do, but some older open-source models do not.

  • Implement a provider abstraction layer from day one. The embedding model landscape changes rapidly. A model that is state-of-the-art today will be surpassed within 12 months. If your embedding calls are scattered across your codebase with provider-specific logic, switching will be a multi-week project instead of a config change.

  • Monitor embedding quality in production. Set up automated retrieval tests that run against your live system. Track recall@k over time. Degradation can happen gradually as your corpus grows and the embedding model struggles with distribution shift.

  • Cache embeddings aggressively. If the same text is embedded multiple times (common with frequently searched queries), cache the result. A simple Redis cache with the text hash as key can eliminate 30-50% of embedding API calls in a typical search application.

References

Powered by Contentful