Multi-Modal Embeddings: Text, Images, and Code

Shane

2/13/2026

26 min read

Build multi-modal search with text, image, and code embeddings in a unified vector store using pgvector and Node.js.

nodejs embeddings multi-modal images code cross-modal

Multi-Modal Embeddings: Text, Images, and Code

Multi-modal embeddings project different types of data — text, images, and code — into a shared vector space where semantic similarity transcends data type boundaries. This means you can search for images using natural language, find relevant code snippets by describing what they do in English, or match documentation to the screenshots it references. In production systems, this capability transforms search from a keyword-matching exercise into genuine semantic understanding across your entire content library.

Prerequisites

Node.js v18 or later
PostgreSQL with the pgvector extension installed
An OpenAI API key (for text and image embeddings)
A Voyage AI API key (for code embeddings)
Working familiarity with vector embeddings and cosine similarity
Basic understanding of how embedding models work (covered in earlier articles in this series)

What Multi-Modal Embeddings Actually Are

Traditional embedding models operate in isolation. You embed text with a text model, images with an image model, and code with a code model. Each model produces vectors in its own vector space, and those spaces have no relationship to each other. Comparing a text vector to an image vector is meaningless because they live in entirely different mathematical universes.

Multi-modal embeddings solve this by training models that project multiple data types into the same vector space. The breakthrough that made this practical was OpenAI's CLIP (Contrastive Language-Image Pre-training), which trained a model on 400 million image-text pairs so that the text "a golden retriever playing fetch" and a photograph of a golden retriever playing fetch would land near each other in vector space. The key insight is contrastive learning: during training, matching pairs are pulled together while non-matching pairs are pushed apart.

This shared vector space is not a compromise — it is a genuine semantic representation where proximity means relatedness regardless of the original data type. When you embed the text "recursive binary search implementation" and a Python function that implements recursive binary search, they should be neighbors in this space.

Text-to-Image Search with CLIP-Style Models

The most mature multi-modal embedding use case is text-to-image search. OpenAI's newer embedding models, along with open-source alternatives like SigLIP and OpenCLIP, can embed both text and images into compatible vector spaces.

Here is how to embed an image using OpenAI's API:

var axios = require("axios");
var fs = require("fs");

function embedImage(imagePath, callback) {
  var imageBuffer = fs.readFileSync(imagePath);
  var base64Image = imageBuffer.toString("base64");
  var mimeType = imagePath.endsWith(".png") ? "image/png" : "image/jpeg";

  var requestBody = {
    model: "clip-vit-large-patch14",
    input: {
      type: "image",
      image: "data:" + mimeType + ";base64," + base64Image
    }
  };

  axios.post("https://api.openai.com/v1/embeddings", requestBody, {
    headers: {
      "Authorization": "Bearer " + process.env.OPENAI_API_KEY,
      "Content-Type": "application/json"
    }
  })
  .then(function(response) {
    callback(null, response.data.data[0].embedding);
  })
  .catch(function(err) {
    callback(err);
  });
}

In practice, many teams use a hybrid approach: they generate text descriptions of images (via a vision model like GPT-4o) and then embed those descriptions with a standard text embedding model. This is less elegant but often more practical because it lets you use a single embedding model for everything and avoids the dimension alignment challenges we will discuss later.

var OpenAI = require("openai");

var openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

function describeAndEmbedImage(imagePath, callback) {
  var imageBuffer = fs.readFileSync(imagePath);
  var base64Image = imageBuffer.toString("base64");
  var mimeType = imagePath.endsWith(".png") ? "image/png" : "image/jpeg";

  openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "user",
        content: [
          {
            type: "text",
            text: "Describe this image in detail for search indexing. Include objects, actions, colors, text, and technical content visible."
          },
          {
            type: "image_url",
            image_url: {
              url: "data:" + mimeType + ";base64," + base64Image
            }
          }
        ]
      }
    ],
    max_tokens: 500
  })
  .then(function(descriptionResponse) {
    var description = descriptionResponse.choices[0].message.content;
    return openai.embeddings.create({
      model: "text-embedding-3-small",
      input: description
    }).then(function(embeddingResponse) {
      callback(null, {
        description: description,
        embedding: embeddingResponse.data[0].embedding
      });
    });
  })
  .catch(function(err) {
    callback(err);
  });
}

This approach gives you text-to-image search that actually works well in production, because the description captures the semantic content of the image in a form that text embedding models handle natively.

Code Embeddings for Semantic Code Search

Code embeddings are where things get genuinely interesting for software teams. Voyage AI's voyage-code-3 model is currently the best option for code embeddings, and it was specifically trained to embed code and natural language into the same vector space. This means you can search a codebase with queries like "function that validates email addresses" and get back the actual validation functions, regardless of variable names or comments.

var axios = require("axios");

function embedCode(codeSnippet, callback) {
  axios.post("https://api.voyageai.com/v1/embeddings", {
    model: "voyage-code-3",
    input: [codeSnippet],
    input_type: "document"
  }, {
    headers: {
      "Authorization": "Bearer " + process.env.VOYAGE_API_KEY,
      "Content-Type": "application/json"
    }
  })
  .then(function(response) {
    callback(null, response.data.data[0].embedding);
  })
  .catch(function(err) {
    callback(err);
  });
}

function searchCode(naturalLanguageQuery, callback) {
  axios.post("https://api.voyageai.com/v1/embeddings", {
    model: "voyage-code-3",
    input: [naturalLanguageQuery],
    input_type: "query"
  }, {
    headers: {
      "Authorization": "Bearer " + process.env.VOYAGE_API_KEY,
      "Content-Type": "application/json"
    }
  })
  .then(function(response) {
    callback(null, response.data.data[0].embedding);
  })
  .catch(function(err) {
    callback(err);
  });
}

Notice the input_type distinction: "document" for code being indexed and "query" for natural language searches. Voyage AI's model was trained with this asymmetry, and using the correct input type significantly improves search quality.

Combining Embeddings in the Same Vector Store

Here is where architecture decisions matter. You have three options for storing multi-modal embeddings in pgvector:

Option 1: Unified column (same model, same dimensions)

If all your embeddings come from the same model and have the same dimensions, use a single vector column. This is the simplest approach and works when you use the description-then-embed pattern for images.

CREATE TABLE content_items (
  id SERIAL PRIMARY KEY,
  content_type VARCHAR(20) NOT NULL,
  title TEXT NOT NULL,
  content TEXT,
  source_path TEXT,
  metadata JSONB DEFAULT '{}',
  embedding vector(1536),
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_content_embedding ON content_items
  USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Option 2: Separate columns (different models, different dimensions)

When you use specialized models per modality — say text-embedding-3-small (1536 dimensions) for text and voyage-code-3 (1024 dimensions) for code — you need separate columns.

CREATE TABLE content_items (
  id SERIAL PRIMARY KEY,
  content_type VARCHAR(20) NOT NULL,
  title TEXT NOT NULL,
  content TEXT,
  source_path TEXT,
  metadata JSONB DEFAULT '{}',
  text_embedding vector(1536),
  code_embedding vector(1024),
  image_embedding vector(1536),
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_text_emb ON content_items
  USING ivfflat (text_embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX idx_code_emb ON content_items
  USING ivfflat (code_embedding vector_cosine_ops) WITH (lists = 100);
CREATE INDEX idx_image_emb ON content_items
  USING ivfflat (image_embedding vector_cosine_ops) WITH (lists = 100);

Option 3: Separate tables per modality

For large-scale systems, separate tables give you independent scaling and indexing control. This is what I recommend for production systems with more than 100,000 items per modality.

I strongly recommend Option 1 for most teams. Use a single text embedding model, convert images to descriptions before embedding, and you avoid the dimension alignment problem entirely. Option 2 is worth the complexity only when you need specialized code search quality that a general text model cannot provide.

Cross-Modal Search

Cross-modal search — searching images with text or finding code with natural language — is the core value proposition of multi-modal embeddings. The implementation depends on which storage option you chose.

With a unified column (Option 1), cross-modal search is free. Every query vector is compared against every document vector regardless of content type:

function crossModalSearch(queryText, options, callback) {
  var limit = options.limit || 10;
  var contentTypes = options.contentTypes || null;

  openai.embeddings.create({
    model: "text-embedding-3-small",
    input: queryText
  })
  .then(function(response) {
    var queryVector = response.data[0].embedding;
    var vectorStr = "[" + queryVector.join(",") + "]";

    var sql = "SELECT id, content_type, title, content, source_path, " +
      "1 - (embedding <=> $1::vector) AS similarity " +
      "FROM content_items";
    var params = [vectorStr];

    if (contentTypes) {
      sql += " WHERE content_type = ANY($2)";
      params.push(contentTypes);
    }

    sql += " ORDER BY embedding <=> $1::vector LIMIT $" + (params.length + 1);
    params.push(limit);

    return pool.query(sql, params);
  })
  .then(function(result) {
    callback(null, result.rows);
  })
  .catch(function(err) {
    callback(err);
  });
}

With separate columns (Option 2), you need to query each column and merge results. This is more complex but lets you weight modalities differently:

function multiColumnSearch(queryText, weights, callback) {
  var textWeight = weights.text || 0.4;
  var codeWeight = weights.code || 0.3;
  var imageWeight = weights.image || 0.3;

  embedTextQuery(queryText, function(err, textVector) {
    if (err) return callback(err);

    embedCodeQuery(queryText, function(err, codeVector) {
      if (err) return callback(err);

      var textVecStr = "[" + textVector.join(",") + "]";
      var codeVecStr = "[" + codeVector.join(",") + "]";

      var sql = "SELECT id, content_type, title, " +
        "COALESCE(1 - (text_embedding <=> $1::vector), 0) * $3 + " +
        "COALESCE(1 - (code_embedding <=> $2::vector), 0) * $4 + " +
        "COALESCE(1 - (image_embedding <=> $1::vector), 0) * $5 " +
        "AS weighted_similarity " +
        "FROM content_items " +
        "ORDER BY weighted_similarity DESC LIMIT 20";

      pool.query(sql, [textVecStr, codeVecStr, textWeight, codeWeight, imageWeight])
        .then(function(result) {
          callback(null, result.rows);
        })
        .catch(function(err) {
          callback(err);
        });
    });
  });
}

Dimension Alignment Across Modalities

When you use different embedding models per modality, you hit a fundamental problem: the vectors have different dimensions and live in different spaces. text-embedding-3-small produces 1536-dimensional vectors while voyage-code-3 produces 1024-dimensional vectors. You cannot directly compare them.

There are three practical approaches:

1. Use one model for everything. OpenAI's text-embedding-3-small handles text, text descriptions of images, and even code reasonably well. You lose some code search quality compared to Voyage, but you eliminate the alignment problem.

2. Dimensionality reduction. Project all embeddings down to a shared lower dimension using PCA or a learned projection matrix. OpenAI's text-embedding-3-small supports native dimension reduction via the dimensions parameter:

function embedWithReducedDimensions(text, targetDimensions, callback) {
  openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
    dimensions: targetDimensions
  })
  .then(function(response) {
    callback(null, response.data[0].embedding);
  })
  .catch(function(err) {
    callback(err);
  });
}

3. Query routing. Instead of aligning dimensions, route queries to the appropriate column based on detected intent. If the query looks like code, search the code column. If it looks like natural language, search text and image columns. This avoids cross-modal comparison entirely but still gives you a unified search interface.

function detectQueryModality(queryText) {
  var codePatterns = /function\s|var\s|class\s|import\s|require\(|def\s|public\s/;
  if (codePatterns.test(queryText)) {
    return "code";
  }
  var imagePatterns = /image|photo|screenshot|diagram|picture|logo|icon/i;
  if (imagePatterns.test(queryText)) {
    return "image";
  }
  return "text";
}

My recommendation: start with Option 1 (single model) and only move to specialized models when you have measured search quality and found it insufficient for a specific modality.

Handling Modality-Specific Preprocessing

Each data type needs different preprocessing before embedding. Text needs chunking and cleaning. Code needs syntax-aware splitting. Images need captioning or feature extraction. Building a robust preprocessing pipeline is essential.

var marked = require("marked");

function preprocessText(rawText) {
  var cleaned = rawText.replace(/<[^>]+>/g, "");
  cleaned = cleaned.replace(/\s+/g, " ").trim();

  if (cleaned.length > 8000) {
    cleaned = cleaned.substring(0, 8000);
  }
  return cleaned;
}

function preprocessCode(sourceCode, language) {
  var lines = sourceCode.split("\n");

  var significantLines = lines.filter(function(line) {
    var trimmed = line.trim();
    if (trimmed === "") return false;
    if (trimmed.startsWith("//") && trimmed.length < 5) return false;
    if (trimmed === "{" || trimmed === "}") return true;
    return true;
  });

  var processed = significantLines.join("\n");

  if (language) {
    processed = "Language: " + language + "\n" + processed;
  }

  if (processed.length > 8000) {
    processed = processed.substring(0, 8000);
  }

  return processed;
}

function preprocessImage(imagePath, callback) {
  describeAndEmbedImage(imagePath, function(err, result) {
    if (err) return callback(err);
    callback(null, {
      text: result.description,
      metadata: {
        originalPath: imagePath,
        type: "image",
        descriptionModel: "gpt-4o"
      }
    });
  });
}

For code specifically, prepending the language name as a tag improves embedding quality. Voyage AI's documentation confirms that including language context helps the model disambiguate syntax that appears in multiple languages.

Evaluating Cross-Modal Search Quality

Measuring cross-modal search quality is harder than measuring single-modality search because you need test sets that span modalities. Here is a practical evaluation framework:

function evaluateSearch(testCases, searchFunction, callback) {
  var results = {
    totalCases: testCases.length,
    hits: 0,
    reciprocalRankSum: 0,
    detailedResults: []
  };

  var completed = 0;

  testCases.forEach(function(testCase, index) {
    searchFunction(testCase.query, { limit: 10 }, function(err, searchResults) {
      if (err) {
        results.detailedResults.push({
          query: testCase.query,
          error: err.message
        });
        completed++;
        if (completed === testCases.length) finalize();
        return;
      }

      var expectedId = testCase.expectedId;
      var rank = -1;

      for (var i = 0; i < searchResults.length; i++) {
        if (searchResults[i].id === expectedId) {
          rank = i + 1;
          break;
        }
      }

      if (rank > 0) {
        results.hits++;
        results.reciprocalRankSum += 1 / rank;
      }

      results.detailedResults.push({
        query: testCase.query,
        expectedType: testCase.expectedType,
        found: rank > 0,
        rank: rank,
        topResult: searchResults[0] ? searchResults[0].title : null
      });

      completed++;
      if (completed === testCases.length) finalize();
    });
  });

  function finalize() {
    results.recall = results.hits / results.totalCases;
    results.mrr = results.reciprocalRankSum / results.totalCases;
    callback(null, results);
  }
}

// Example test cases for cross-modal evaluation
var testCases = [
  {
    query: "function that sorts an array",
    expectedId: 42,
    expectedType: "code"
  },
  {
    query: "architecture diagram showing microservices",
    expectedId: 87,
    expectedType: "image"
  },
  {
    query: "how to configure database connection pooling",
    expectedId: 15,
    expectedType: "text"
  }
];

Target metrics for production systems: recall@10 above 0.7 and MRR above 0.5. If you are below these thresholds, the first thing to check is your preprocessing pipeline — bad preprocessing is almost always the bottleneck, not the embedding model.

Use Cases

Documentation search. Index technical docs (text), API screenshots and diagrams (images), and code examples (code) into a single store. Engineers search with natural language and get results across all three types. This is dramatically better than keyword search for internal documentation.

Product catalogs. E-commerce systems embed product descriptions and product images together. Customers search with text like "comfortable running shoes for flat feet" and get results ranked by semantic similarity across both product copy and visual features.

Code repositories. Index an entire codebase with code embeddings. Developers search with natural language queries like "retry logic with exponential backoff" and find implementations regardless of function names, variable conventions, or programming language. Pair this with documentation embeddings to also surface relevant READMEs and design docs.

Technical support. Support agents search with customer problem descriptions and find matching knowledge base articles (text), troubleshooting screenshots (images), and relevant code fixes (code) in a single query.

Cost Considerations for Multi-Modal Embeddings

Multi-modal embedding costs add up fast. Here is a realistic breakdown:

Operation	Model	Cost per 1M tokens/images
Text embedding	text-embedding-3-small	$0.02 per 1M tokens
Text embedding	text-embedding-3-large	$0.13 per 1M tokens
Code embedding	voyage-code-3	$0.18 per 1M tokens
Image description	GPT-4o	~$2.50 per 1K images (input)
Image embedding (via description)	text-embedding-3-small	$0.02 per 1M tokens

The image pipeline is by far the most expensive because you pay for the vision model to generate descriptions. For a catalog of 100,000 product images, the description step alone costs approximately $250. Budget for this and cache aggressively — never re-describe an image you have already processed.

function embedWithCache(itemId, content, contentType, callback) {
  pool.query(
    "SELECT embedding FROM content_items WHERE id = $1 AND embedding IS NOT NULL",
    [itemId]
  )
  .then(function(result) {
    if (result.rows.length > 0) {
      return callback(null, result.rows[0].embedding);
    }

    generateEmbedding(content, contentType, function(err, embedding) {
      if (err) return callback(err);

      var vectorStr = "[" + embedding.join(",") + "]";
      pool.query(
        "UPDATE content_items SET embedding = $1::vector WHERE id = $2",
        [vectorStr, itemId]
      )
      .then(function() {
        callback(null, embedding);
      })
      .catch(function(err) {
        callback(err);
      });
    });
  })
  .catch(function(err) {
    callback(err);
  });
}

Complete Working Example

Here is a full multi-modal search engine that indexes text documents, code snippets, and image descriptions in a unified pgvector store with cross-modal query support.

// multimodal-search.js
var pg = require("pg");
var OpenAI = require("openai");
var fs = require("fs");
var path = require("path");

var openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

var pool = new pg.Pool({
  connectionString: process.env.POSTGRES_CONNECTION_STRING
});

var EMBEDDING_MODEL = "text-embedding-3-small";
var EMBEDDING_DIMENSIONS = 1536;

// ===== Database Setup =====

function initDatabase(callback) {
  var schema = [
    "CREATE EXTENSION IF NOT EXISTS vector",
    "CREATE TABLE IF NOT EXISTS multimodal_items (" +
    "  id SERIAL PRIMARY KEY," +
    "  content_type VARCHAR(20) NOT NULL," +
    "  title TEXT NOT NULL," +
    "  content TEXT NOT NULL," +
    "  original_source TEXT," +
    "  metadata JSONB DEFAULT '{}'," +
    "  embedding vector(" + EMBEDDING_DIMENSIONS + ")," +
    "  created_at TIMESTAMP DEFAULT NOW()" +
    ")",
    "CREATE INDEX IF NOT EXISTS idx_mm_embedding ON multimodal_items " +
    "  USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100)",
    "CREATE INDEX IF NOT EXISTS idx_mm_content_type ON multimodal_items (content_type)"
  ];

  var completed = 0;
  var failed = false;

  schema.forEach(function(sql) {
    pool.query(sql).then(function() {
      completed++;
      if (completed === schema.length && !failed) {
        console.log("Database initialized successfully");
        callback(null);
      }
    }).catch(function(err) {
      if (!failed) {
        failed = true;
        callback(err);
      }
    });
  });
}

// ===== Embedding Generation =====

function generateTextEmbedding(text, callback) {
  openai.embeddings.create({
    model: EMBEDDING_MODEL,
    input: text.substring(0, 8000)
  })
  .then(function(response) {
    callback(null, response.data[0].embedding);
  })
  .catch(function(err) {
    callback(err);
  });
}

function generateImageDescription(imagePath, callback) {
  var imageBuffer = fs.readFileSync(imagePath);
  var base64Image = imageBuffer.toString("base64");
  var ext = path.extname(imagePath).toLowerCase();
  var mimeType = ext === ".png" ? "image/png" : "image/jpeg";

  openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "user",
        content: [
          {
            type: "text",
            text: "Describe this image in detail for search indexing. " +
              "Include all visible objects, text, colors, layout, " +
              "and any technical content."
          },
          {
            type: "image_url",
            image_url: {
              url: "data:" + mimeType + ";base64," + base64Image
            }
          }
        ]
      }
    ],
    max_tokens: 500
  })
  .then(function(response) {
    callback(null, response.choices[0].message.content);
  })
  .catch(function(err) {
    callback(err);
  });
}

// ===== Indexing Functions =====

function indexTextDocument(title, content, metadata, callback) {
  var processedContent = content.replace(/<[^>]+>/g, "").replace(/\s+/g, " ").trim();

  generateTextEmbedding(processedContent, function(err, embedding) {
    if (err) return callback(err);

    var vectorStr = "[" + embedding.join(",") + "]";
    pool.query(
      "INSERT INTO multimodal_items (content_type, title, content, metadata, embedding) " +
      "VALUES ($1, $2, $3, $4, $5::vector) RETURNING id",
      ["text", title, processedContent, JSON.stringify(metadata || {}), vectorStr]
    )
    .then(function(result) {
      console.log("Indexed text document: " + title + " (id: " + result.rows[0].id + ")");
      callback(null, result.rows[0].id);
    })
    .catch(function(err) {
      callback(err);
    });
  });
}

function indexCodeSnippet(title, code, language, metadata, callback) {
  var processedCode = "Language: " + language + "\n" + code;

  generateTextEmbedding(processedCode, function(err, embedding) {
    if (err) return callback(err);

    var itemMetadata = metadata || {};
    itemMetadata.language = language;

    var vectorStr = "[" + embedding.join(",") + "]";
    pool.query(
      "INSERT INTO multimodal_items (content_type, title, content, metadata, embedding) " +
      "VALUES ($1, $2, $3, $4, $5::vector) RETURNING id",
      ["code", title, code, JSON.stringify(itemMetadata), vectorStr]
    )
    .then(function(result) {
      console.log("Indexed code snippet: " + title + " (id: " + result.rows[0].id + ")");
      callback(null, result.rows[0].id);
    })
    .catch(function(err) {
      callback(err);
    });
  });
}

function indexImage(title, imagePath, metadata, callback) {
  generateImageDescription(imagePath, function(err, description) {
    if (err) return callback(err);

    generateTextEmbedding(description, function(err, embedding) {
      if (err) return callback(err);

      var itemMetadata = metadata || {};
      itemMetadata.originalPath = imagePath;
      itemMetadata.description = description;

      var vectorStr = "[" + embedding.join(",") + "]";
      pool.query(
        "INSERT INTO multimodal_items " +
        "(content_type, title, content, original_source, metadata, embedding) " +
        "VALUES ($1, $2, $3, $4, $5, $6::vector) RETURNING id",
        [
          "image", title, description, imagePath,
          JSON.stringify(itemMetadata), vectorStr
        ]
      )
      .then(function(result) {
        console.log("Indexed image: " + title + " (id: " + result.rows[0].id + ")");
        callback(null, result.rows[0].id);
      })
      .catch(function(err) {
        callback(err);
      });
    });
  });
}

// ===== Search Functions =====

function search(query, options, callback) {
  var limit = (options && options.limit) || 10;
  var contentTypes = (options && options.contentTypes) || null;
  var minSimilarity = (options && options.minSimilarity) || 0.0;

  generateTextEmbedding(query, function(err, queryEmbedding) {
    if (err) return callback(err);

    var vectorStr = "[" + queryEmbedding.join(",") + "]";
    var params = [vectorStr];
    var conditions = [];

    if (contentTypes && contentTypes.length > 0) {
      params.push(contentTypes);
      conditions.push("content_type = ANY($" + params.length + ")");
    }

    if (minSimilarity > 0) {
      params.push(minSimilarity);
      conditions.push("1 - (embedding <=> $1::vector) >= $" + params.length);
    }

    var whereClause = conditions.length > 0
      ? " WHERE " + conditions.join(" AND ")
      : "";

    params.push(limit);

    var sql = "SELECT id, content_type, title, content, original_source, metadata, " +
      "1 - (embedding <=> $1::vector) AS similarity " +
      "FROM multimodal_items" + whereClause +
      " ORDER BY embedding <=> $1::vector " +
      "LIMIT $" + params.length;

    pool.query(sql, params)
      .then(function(result) {
        var items = result.rows.map(function(row) {
          return {
            id: row.id,
            type: row.content_type,
            title: row.title,
            content: row.content.substring(0, 200) +
              (row.content.length > 200 ? "..." : ""),
            source: row.original_source,
            similarity: parseFloat(row.similarity).toFixed(4),
            metadata: typeof row.metadata === "string"
              ? JSON.parse(row.metadata) : row.metadata
          };
        });
        callback(null, items);
      })
      .catch(function(err) {
        callback(err);
      });
  });
}

function searchByType(query, contentType, limit, callback) {
  search(query, { contentTypes: [contentType], limit: limit }, callback);
}

// ===== Batch Indexing Pipeline =====

function indexBatch(items, callback) {
  var completed = 0;
  var errors = [];
  var results = [];

  if (items.length === 0) return callback(null, []);

  function processNext() {
    if (completed >= items.length) {
      if (errors.length > 0) {
        console.error("Batch indexing completed with " + errors.length + " errors");
      }
      callback(errors.length > 0 ? errors : null, results);
      return;
    }

    var item = items[completed];
    var indexFn;

    if (item.type === "text") {
      indexFn = function(cb) {
        indexTextDocument(item.title, item.content, item.metadata, cb);
      };
    } else if (item.type === "code") {
      indexFn = function(cb) {
        indexCodeSnippet(item.title, item.content, item.language, item.metadata, cb);
      };
    } else if (item.type === "image") {
      indexFn = function(cb) {
        indexImage(item.title, item.path, item.metadata, cb);
      };
    } else {
      completed++;
      errors.push(new Error("Unknown content type: " + item.type));
      processNext();
      return;
    }

    indexFn(function(err, id) {
      if (err) {
        errors.push({ item: item.title, error: err.message });
      } else {
        results.push({ item: item.title, id: id });
      }
      completed++;

      // Rate limiting: 200ms between API calls
      setTimeout(processNext, 200);
    });
  }

  processNext();
}

// ===== Demo Usage =====

function runDemo() {
  initDatabase(function(err) {
    if (err) {
      console.error("Failed to initialize database:", err.message);
      process.exit(1);
    }

    var sampleItems = [
      {
        type: "text",
        title: "Database Connection Pooling Guide",
        content: "Connection pooling is a technique for managing database " +
          "connections efficiently. Instead of opening a new connection for " +
          "each query, a pool maintains a set of reusable connections. This " +
          "reduces latency and prevents connection exhaustion under load. " +
          "Configure your pool size based on your database max_connections " +
          "setting and expected concurrent query volume.",
        metadata: { category: "database", difficulty: "intermediate" }
      },
      {
        type: "code",
        title: "Express Rate Limiter Middleware",
        language: "javascript",
        content: 'var rateLimit = {};\n\n' +
          'function rateLimiter(maxRequests, windowMs) {\n' +
          '  return function(req, res, next) {\n' +
          '    var ip = req.ip;\n' +
          '    var now = Date.now();\n' +
          '    if (!rateLimit[ip]) {\n' +
          '      rateLimit[ip] = { count: 1, resetAt: now + windowMs };\n' +
          '      return next();\n' +
          '    }\n' +
          '    if (now > rateLimit[ip].resetAt) {\n' +
          '      rateLimit[ip] = { count: 1, resetAt: now + windowMs };\n' +
          '      return next();\n' +
          '    }\n' +
          '    rateLimit[ip].count++;\n' +
          '    if (rateLimit[ip].count > maxRequests) {\n' +
          '      return res.status(429).json({ error: "Too many requests" });\n' +
          '    }\n' +
          '    next();\n' +
          '  };\n' +
          '}',
        metadata: { category: "middleware", framework: "express" }
      },
      {
        type: "text",
        title: "Microservices Communication Patterns",
        content: "Microservices communicate through synchronous protocols " +
          "like HTTP/REST and gRPC, or asynchronous patterns using message " +
          "brokers like RabbitMQ and Kafka. Synchronous communication is " +
          "simpler but creates tight coupling. Asynchronous messaging " +
          "provides better resilience and scalability at the cost of " +
          "increased complexity in error handling and eventual consistency.",
        metadata: { category: "architecture", difficulty: "advanced" }
      },
      {
        type: "code",
        title: "Binary Search Implementation",
        language: "javascript",
        content: 'function binarySearch(arr, target) {\n' +
          '  var low = 0;\n' +
          '  var high = arr.length - 1;\n' +
          '  while (low <= high) {\n' +
          '    var mid = Math.floor((low + high) / 2);\n' +
          '    if (arr[mid] === target) return mid;\n' +
          '    if (arr[mid] < target) low = mid + 1;\n' +
          '    else high = mid - 1;\n' +
          '  }\n' +
          '  return -1;\n' +
          '}',
        metadata: { category: "algorithms", difficulty: "beginner" }
      }
    ];

    console.log("Indexing " + sampleItems.length + " items...\n");

    indexBatch(sampleItems, function(err) {
      if (err) {
        console.error("Indexing errors:", err);
      }

      console.log("\n--- Cross-Modal Search Demo ---\n");

      var queries = [
        { text: "how to limit API request frequency", label: "NL -> Code" },
        { text: "finding an element in a sorted array", label: "NL -> Code" },
        { text: "managing database connections efficiently", label: "NL -> Text" },
        { text: "async vs sync service communication", label: "NL -> Text" }
      ];

      var qi = 0;
      function runNextQuery() {
        if (qi >= queries.length) {
          console.log("\nDemo complete.");
          pool.end();
          return;
        }

        var q = queries[qi];
        console.log("[" + q.label + '] Query: "' + q.text + '"');

        search(q.text, { limit: 3 }, function(err, results) {
          if (err) {
            console.error("Search error:", err.message);
          } else {
            results.forEach(function(r, i) {
              console.log(
                "  " + (i + 1) + ". [" + r.type + "] " +
                r.title + " (similarity: " + r.similarity + ")"
              );
            });
          }
          console.log("");
          qi++;
          runNextQuery();
        });
      }

      runNextQuery();
    });
  });
}

// ===== Exports =====

module.exports = {
  initDatabase: initDatabase,
  indexTextDocument: indexTextDocument,
  indexCodeSnippet: indexCodeSnippet,
  indexImage: indexImage,
  indexBatch: indexBatch,
  search: search,
  searchByType: searchByType
};

if (require.main === module) {
  runDemo();
}

Run the demo with:

OPENAI_API_KEY=sk-your-key POSTGRES_CONNECTION_STRING=postgresql://user:pass@localhost:5432/multimodal node multimodal-search.js

Common Issues and Troubleshooting

1. Dimension mismatch when inserting vectors

error: expected 1536 dimensions, not 1024

This happens when you switch embedding models or accidentally use a model that produces different dimensions than your table schema expects. Verify your model output matches your vector(N) column definition. If you need to change dimensions, you must alter the column: ALTER TABLE multimodal_items ALTER COLUMN embedding TYPE vector(1024). Warning: this invalidates all existing embeddings and you will need to re-embed everything.

2. IVFFlat index requires training data

ERROR: cannot create ivfflat index on empty table

The IVFFlat index needs existing data to build its clustering structure. Insert your data first, then create the index. For tables under 10,000 rows, use HNSW instead, which does not have this limitation:

CREATE INDEX idx_mm_embedding ON multimodal_items
  USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);

3. OpenAI embedding API rate limits

Error: 429 Rate limit reached for text-embedding-3-small in organization org-xxx on tokens per min (TPM): Limit 1000000

When batch-indexing thousands of items, you will hit rate limits. Implement exponential backoff and batch your embedding requests. OpenAI supports up to 2048 inputs per embedding request:

function embedBatchWithRetry(texts, retries, callback) {
  openai.embeddings.create({
    model: EMBEDDING_MODEL,
    input: texts
  })
  .then(function(response) {
    var embeddings = response.data.map(function(item) {
      return item.embedding;
    });
    callback(null, embeddings);
  })
  .catch(function(err) {
    if (err.status === 429 && retries > 0) {
      var delay = Math.pow(2, 4 - retries) * 1000;
      console.log("Rate limited. Retrying in " + delay + "ms...");
      setTimeout(function() {
        embedBatchWithRetry(texts, retries - 1, callback);
      }, delay);
    } else {
      callback(err);
    }
  });
}

4. Image description generation returns generic or unhelpful descriptions

Response: "This is an image showing some content on a screen."

This happens when the image is low resolution, heavily compressed, or when the prompt is too vague. Use a specific, structured prompt that tells the vision model exactly what to extract. Also ensure your images are at least 512x512 pixels and in a lossless or high-quality format:

var detailedPrompt = "Analyze this image and provide a structured description:\n" +
  "1. Main subject and objects visible\n" +
  "2. Any text, labels, or annotations\n" +
  "3. Colors and visual style\n" +
  "4. Technical content (diagrams, code, charts)\n" +
  "5. Context and likely purpose of this image\n" +
  "Be specific and detailed. Avoid generic descriptions.";

5. Cross-modal search returns poor results across modalities

When text queries consistently fail to find relevant code, the problem is usually that your text embedding model was not trained on code. The text-embedding-3-small model handles code reasonably well, but for best results, prepend language context and include function signatures. If quality is still unacceptable, consider switching to voyage-code-3 for the code modality and using the separate-column architecture.

6. Memory issues with large image batches

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

Base64-encoding large images consumes significant memory. Process images sequentially rather than in parallel, and increase the Node.js heap size if needed:

node --max-old-space-size=4096 multimodal-search.js

Best Practices

Start with a single embedding model. Use text-embedding-3-small for everything initially. Convert images to text descriptions before embedding. This gives you a unified vector space with zero alignment complexity. Only introduce specialized models when you have evidence that search quality is insufficient for a specific modality.
Cache embeddings aggressively. Never re-compute an embedding for content that has not changed. Store the embedding model name and version alongside the vector so you know when re-embedding is needed after a model upgrade.
Preprocess each modality appropriately. Strip HTML from text. Prepend language tags to code. Generate detailed, structured descriptions for images. The quality of your preprocessing directly determines search quality — garbage in, garbage out.
Use HNSW indexes for tables under 100,000 rows. HNSW provides better recall than IVFFlat for smaller datasets and does not require pre-existing data to build the index. Switch to IVFFlat only when HNSW memory consumption becomes a concern at scale.
Include modality metadata in search results. Always return the content_type field so your application can render results appropriately. A code result should display with syntax highlighting while a text result should render as prose.
Implement query-time modality filtering. Let users optionally restrict searches to specific content types. A developer searching for code does not want documentation results cluttering the top positions.
Batch embedding requests. OpenAI's API accepts up to 2048 inputs per request. Batching dramatically reduces indexing time and helps you stay under rate limits. Process arrays of 100-500 items per batch for optimal throughput.
Monitor embedding costs by modality. Image descriptions via GPT-4o are orders of magnitude more expensive than text embeddings. Track costs per content type so you can make informed decisions about caching, re-indexing frequency, and model selection.
Version your vector store schema. When you change embedding models, all existing vectors become incompatible. Maintain a schema version and plan for full re-indexing when upgrading models. A good pattern is to write new embeddings to a shadow column, validate search quality, then swap columns.
Test cross-modal retrieval explicitly. Build a test suite with queries in one modality and expected results in another. If your text-to-code recall drops below 0.6, investigate your code preprocessing pipeline before blaming the model.

Multi-Modal Embeddings: Text, Images, and Code

Multi-Modal Embeddings: Text, Images, and Code

Prerequisites

What Multi-Modal Embeddings Actually Are

Text-to-Image Search with CLIP-Style Models

Code Embeddings for Semantic Code Search

Combining Embeddings in the Same Vector Store

Cross-Modal Search

Dimension Alignment Across Modalities

Handling Modality-Specific Preprocessing

Evaluating Cross-Modal Search Quality

Use Cases

Cost Considerations for Multi-Modal Embeddings

Complete Working Example

Common Issues and Troubleshooting

Best Practices

References

Quick Links

Need Expert Help?