Similarity Search Optimization Techniques
Optimize similarity search with ANN indexes, parameter tuning, dimensionality reduction, and scaling strategies for pgvector in Node.js.
Similarity Search Optimization Techniques
Overview
Similarity search is the backbone of any embedding-powered application, but it becomes a serious bottleneck the moment your dataset grows beyond a few thousand vectors. This article dives deep into the optimization techniques that take pgvector from a toy demo to a production-grade vector search engine: approximate nearest neighbor indexes, parameter tuning, dimensionality reduction, quantization, caching strategies, and horizontal scaling. If you are running similarity search in Node.js with PostgreSQL, these are the techniques that will determine whether your system responds in 5 milliseconds or 5 seconds.
Prerequisites
- Node.js v18+ installed
- PostgreSQL 15+ with pgvector 0.5+ extension
- Working knowledge of embeddings and cosine similarity
- Familiarity with PostgreSQL indexing concepts
- The
pgnpm package installed - An embedding model (OpenAI, Cohere, or local) producing vectors
Why Brute-Force Similarity Search Does Not Scale
When you run a basic similarity query against pgvector, PostgreSQL performs a sequential scan. It computes the distance between your query vector and every single row in the table:
SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 10;
This is an exact nearest neighbor search. For 10,000 rows with 1536-dimensional vectors, it finishes in about 50ms. That seems fine until you realize the relationship is linear. At 100,000 rows, you are at 500ms. At 1 million rows, you are looking at 5 seconds per query. At 10 million rows, the query is essentially unusable for real-time applications.
The math is unforgiving. Each distance computation for a 1536-dimensional vector involves 1536 multiplications and 1535 additions. Multiply that by the number of rows, and you are performing billions of floating-point operations per query. No amount of CPU optimization will fix this at scale.
The answer is approximate nearest neighbor (ANN) algorithms. These trade a small amount of accuracy for dramatic improvements in speed. Instead of checking every vector, they use data structures that narrow the search space to a fraction of the total dataset.
Understanding Approximate Nearest Neighbor Algorithms
ANN algorithms work on a simple principle: organize vectors into a structure that allows you to skip most of the dataset while still finding results that are close enough to the true nearest neighbors.
The two main approaches available in pgvector are:
- IVFFlat (Inverted File with Flat Compression) - Partitions vectors into clusters using k-means, then only searches the clusters closest to the query vector.
- HNSW (Hierarchical Navigable Small World) - Builds a multi-layered graph where each node connects to its nearest neighbors, allowing greedy traversal from coarse to fine levels.
Both approaches sacrifice some recall (the probability of finding the true nearest neighbors) in exchange for sub-linear search time. The key engineering challenge is tuning these algorithms so the recall stays above 95% while search time drops by orders of magnitude.
IVFFlat Indexes in pgvector
How IVFFlat Works
IVFFlat divides your vector space into a predetermined number of regions (called "lists" or "cells") using k-means clustering. During index creation, pgvector runs k-means on your data to find cluster centroids. Each vector is then assigned to the nearest centroid.
At query time, pgvector computes the distance from the query vector to all centroids, identifies the closest centroids, and then performs an exact search only within those clusters. If you have 1000 lists and probe 10 of them, you only scan 1% of the data.
Creating an IVFFlat Index
-- Create an IVFFlat index for cosine distance
CREATE INDEX idx_documents_embedding_ivfflat
ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
The lists parameter controls the number of clusters. The general guideline is:
- For up to 1 million rows:
lists = rows / 1000 - For over 1 million rows:
lists = sqrt(rows)
Setting the Probe Parameter
The probes parameter controls how many clusters are searched at query time. Higher probes means better recall but slower queries:
var pg = require("pg");
function createIVFFlatSearchClient(connectionString) {
var pool = new pg.Pool({ connectionString: connectionString });
function search(queryVector, probes, limit) {
return pool.query("SET ivfflat.probes = " + probes).then(function () {
return pool.query(
"SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity " +
"FROM documents " +
"ORDER BY embedding <=> $1::vector " +
"LIMIT $2",
["[" + queryVector.join(",") + "]", limit]
);
});
}
return { search: search, pool: pool };
}
When to Use IVFFlat
IVFFlat is the right choice when:
- Your dataset changes infrequently (index creation requires a full k-means pass)
- You need fast index build times compared to HNSW
- Memory is constrained (IVFFlat indexes are smaller than HNSW)
- You can tolerate periodic reindexing as data distribution changes
The major downside of IVFFlat is that the clusters become stale as you insert new data. If your data distribution shifts significantly, you need to rebuild the index with REINDEX INDEX idx_documents_embedding_ivfflat;.
HNSW Indexes in pgvector
How HNSW Works
HNSW constructs a multi-layered graph. The bottom layer contains all vectors. Each higher layer contains a progressively sparser subset of vectors. Edges connect vectors that are close to each other in the embedding space.
During a search, the algorithm starts at the top layer, finds the closest node, then drops to the next layer and repeats. Each layer refines the search, navigating through the graph by always moving toward nodes that are closer to the query vector. By the time it reaches the bottom layer, it has narrowed down to a small neighborhood of candidates.
Creating an HNSW Index
-- Create an HNSW index for cosine distance
CREATE INDEX idx_documents_embedding_hnsw
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
The two key parameters are:
- m: The maximum number of connections per node in each layer. Higher values create a denser graph with better recall but consume more memory.
- ef_construction: The size of the dynamic candidate list during index construction. Higher values produce a better-quality graph but take longer to build.
Setting the ef_search Parameter
At query time, hnsw.ef_search controls the size of the candidate list during search. It must be at least as large as the LIMIT in your query:
function createHNSWSearchClient(connectionString) {
var pool = new pg.Pool({ connectionString: connectionString });
function search(queryVector, efSearch, limit) {
return pool.query("SET hnsw.ef_search = " + efSearch).then(function () {
return pool.query(
"SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity " +
"FROM documents " +
"ORDER BY embedding <=> $1::vector " +
"LIMIT $2",
["[" + queryVector.join(",") + "]", limit]
);
});
}
return { search: search, pool: pool };
}
When to Use HNSW
HNSW is the right choice when:
- Your dataset is updated frequently (no reindexing needed for new inserts)
- You need consistently high recall (HNSW typically achieves >95% recall with default settings)
- Query latency is critical and you can afford the memory overhead
- Your dataset is large enough that IVFFlat cluster quality would degrade between reindexes
The downside is that HNSW indexes are significantly larger in memory and take much longer to build than IVFFlat.
IVFFlat vs HNSW Performance Comparison
Here is a realistic performance comparison based on a dataset of 500,000 vectors at 1536 dimensions:
| Metric | Sequential Scan | IVFFlat (100 lists, 10 probes) | HNSW (m=16, ef_search=40) |
|---|---|---|---|
| Query Latency | ~2500ms | ~25ms | ~8ms |
| Recall@10 | 100% | ~92% | ~98% |
| Index Build Time | N/A | ~2 min | ~15 min |
| Index Size on Disk | N/A | ~1.2 GB | ~2.8 GB |
| Insert Performance | Instant | Moderate (cluster assignment) | Slower (graph updates) |
The takeaway is clear: HNSW wins on query performance and recall, but IVFFlat wins on build time and memory footprint. For most production workloads where query latency matters, HNSW is the better default choice.
Tuning Index Parameters
IVFFlat: Tuning Lists and Probes
The relationship between lists and probes is the core tuning lever for IVFFlat. More lists means finer-grained clusters, which requires more probes to maintain recall:
function benchmarkIVFFlat(pool, queryVector, listsValues, probesValues) {
var results = [];
function runBenchmark(listsIdx, probesIdx) {
if (listsIdx >= listsValues.length) {
return Promise.resolve(results);
}
if (probesIdx >= probesValues.length) {
return runBenchmark(listsIdx + 1, 0);
}
var lists = listsValues[listsIdx];
var probes = probesValues[probesIdx];
return pool
.query("DROP INDEX IF EXISTS idx_bench_ivfflat")
.then(function () {
return pool.query(
"CREATE INDEX idx_bench_ivfflat ON documents USING ivfflat (embedding vector_cosine_ops) WITH (lists = " +
lists +
")"
);
})
.then(function () {
return pool.query("SET ivfflat.probes = " + probes);
})
.then(function () {
var start = Date.now();
return pool
.query(
"SELECT id FROM documents ORDER BY embedding <=> $1::vector LIMIT 10",
["[" + queryVector.join(",") + "]"]
)
.then(function (res) {
var elapsed = Date.now() - start;
results.push({
lists: lists,
probes: probes,
latencyMs: elapsed,
resultCount: res.rows.length,
});
return runBenchmark(listsIdx, probesIdx + 1);
});
});
}
return runBenchmark(0, 0);
}
A practical rule of thumb: set probes to roughly sqrt(lists) as a starting point, then increase until recall reaches your target threshold.
HNSW: Tuning m, ef_construction, and ef_search
The three HNSW parameters interact in important ways:
- m (default 16): Controls graph connectivity. Values of 12-48 are typical. Higher m improves recall but increases memory and build time. For 1536-dimensional vectors, m=16 to m=32 works well.
- ef_construction (default 64): Higher values produce a better-quality graph. Set this to at least 2x your target ef_search. Values of 100-200 work well for production.
- ef_search (default 40): The runtime search parameter. Start at 40 and increase until recall meets your requirements. Values of 100-200 give excellent recall with sub-10ms latency on moderate datasets.
function tuneHNSWEfSearch(pool, queryVector, efValues) {
var results = [];
function runTest(idx) {
if (idx >= efValues.length) {
return Promise.resolve(results);
}
var ef = efValues[idx];
return pool.query("SET hnsw.ef_search = " + ef).then(function () {
var start = Date.now();
return pool
.query(
"SELECT id, 1 - (embedding <=> $1::vector) AS similarity " +
"FROM documents ORDER BY embedding <=> $1::vector LIMIT 10",
["[" + queryVector.join(",") + "]"]
)
.then(function (res) {
var elapsed = Date.now() - start;
results.push({
efSearch: ef,
latencyMs: elapsed,
topSimilarity: res.rows[0] ? res.rows[0].similarity : null,
});
return runTest(idx + 1);
});
});
}
return runTest(0);
}
Pre-Filtering vs Post-Filtering with Indexes
A common mistake is applying filters after the ANN search. If you search for the 10 nearest neighbors and then filter by category, you might end up with fewer than 10 results or miss relevant documents entirely.
Post-Filtering (Problematic)
-- This finds 10 nearest neighbors, then filters. May return fewer than 10 results.
SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE category = 'machine-learning'
ORDER BY embedding <=> $1::vector
LIMIT 10;
The problem is that pgvector retrieves the 10 nearest neighbors from the index first, and PostgreSQL applies the WHERE clause afterward. If only 3 of those 10 are in the "machine-learning" category, you get 3 results.
Pre-Filtering with Partial Indexes
The solution is to create partial indexes for your most common filter values:
-- Create a partial HNSW index for a specific category
CREATE INDEX idx_docs_ml_hnsw
ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
WHERE category = 'machine-learning';
This works well when you have a small number of known filter values. For dynamic filters, you need a different approach.
Pre-Filtering with Partitioned Tables
For many filter values, partition the table:
CREATE TABLE documents (
id SERIAL,
content TEXT,
category TEXT,
embedding vector(1536)
) PARTITION BY LIST (category);
CREATE TABLE documents_ml PARTITION OF documents FOR VALUES IN ('machine-learning');
CREATE TABLE documents_nlp PARTITION OF documents FOR VALUES IN ('nlp');
-- Create HNSW indexes on each partition
CREATE INDEX ON documents_ml USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON documents_nlp USING hnsw (embedding vector_cosine_ops);
PostgreSQL will automatically use the correct partition index when a category filter is specified.
Dimensionality Reduction for Faster Search
High-dimensional vectors are expensive to compare. If your embedding model outputs 1536-dimensional vectors, but most of the useful information is captured in the first 512 dimensions, you can reduce dimensionality and speed up search significantly.
Truncation (Matryoshka Embeddings)
Some modern embedding models (like OpenAI's text-embedding-3-small) are trained with Matryoshka representation learning. The first N dimensions contain the most important information:
function truncateVector(vector, targetDimensions) {
return vector.slice(0, targetDimensions);
}
function normalizeVector(vector) {
var magnitude = 0;
for (var i = 0; i < vector.length; i++) {
magnitude += vector[i] * vector[i];
}
magnitude = Math.sqrt(magnitude);
var normalized = new Array(vector.length);
for (var i = 0; i < vector.length; i++) {
normalized[i] = vector[i] / magnitude;
}
return normalized;
}
// Reduce from 1536 to 512 dimensions
function prepareReducedVector(fullVector) {
var truncated = truncateVector(fullVector, 512);
return normalizeVector(truncated);
}
Store both the full and reduced vectors. Use the reduced vector for the initial ANN search, then re-rank the top candidates using the full vector:
function twoStageSearch(pool, queryVectorFull, limit) {
var queryReduced = prepareReducedVector(queryVectorFull);
var candidateMultiplier = 5;
// Stage 1: Fast search on reduced vectors
return pool
.query(
"SELECT id, embedding_full FROM documents " +
"ORDER BY embedding_reduced <=> $1::vector " +
"LIMIT $2",
[
"[" + queryReduced.join(",") + "]",
limit * candidateMultiplier,
]
)
.then(function (res) {
// Stage 2: Re-rank candidates using full vectors
var candidates = res.rows.map(function (row) {
var fullEmb = JSON.parse(row.embedding_full);
var similarity = cosineSimilarity(queryVectorFull, fullEmb);
return { id: row.id, similarity: similarity };
});
candidates.sort(function (a, b) {
return b.similarity - a.similarity;
});
return candidates.slice(0, limit);
});
}
function cosineSimilarity(a, b) {
var dot = 0;
var magA = 0;
var magB = 0;
for (var i = 0; i < a.length; i++) {
dot += a[i] * b[i];
magA += a[i] * a[i];
magB += b[i] * b[i];
}
return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}
This two-stage approach can reduce search latency by 60-70% while maintaining nearly identical recall.
Quantization Techniques
pgvector 0.7+ supports half-precision (halfvec) vectors, which use 16-bit floats instead of 32-bit. This halves memory usage and improves cache efficiency:
-- Store vectors as half-precision
ALTER TABLE documents
ADD COLUMN embedding_half halfvec(1536);
UPDATE documents
SET embedding_half = embedding::halfvec(1536);
-- Create an HNSW index on half-precision vectors
CREATE INDEX idx_docs_halfvec_hnsw
ON documents
USING hnsw (embedding_half halfvec_cosine_ops)
WITH (m = 16, ef_construction = 100);
Binary quantization is even more aggressive. It converts each dimension to a single bit (positive or negative), reducing a 1536-dimensional float32 vector from 6KB to 192 bytes:
-- Binary quantization using bit vectors
ALTER TABLE documents
ADD COLUMN embedding_binary bit(1536);
UPDATE documents
SET embedding_binary = (
SELECT string_agg(CASE WHEN v > 0 THEN '1' ELSE '0' END, '')::bit(1536)
FROM unnest(embedding::float[]) AS v
);
Binary quantization dramatically reduces storage and speeds up distance computations, but it sacrifices meaningful accuracy. Use it only as a first-pass filter with a re-ranking step.
Batch Query Optimization
When you need to search for multiple query vectors simultaneously, batching is far more efficient than individual queries:
function batchSearch(pool, queryVectors, limit) {
var queries = queryVectors.map(function (vec, idx) {
return (
"SELECT " +
idx +
" AS query_idx, id, content, " +
"1 - (embedding <=> '" +
"[" + vec.join(",") + "]" +
"'::vector) AS similarity " +
"FROM documents " +
"ORDER BY embedding <=> '" +
"[" + vec.join(",") + "]" +
"'::vector " +
"LIMIT " +
limit
);
});
var unionQuery = queries.join(" UNION ALL ");
return pool.query(unionQuery).then(function (res) {
var grouped = {};
res.rows.forEach(function (row) {
if (!grouped[row.query_idx]) {
grouped[row.query_idx] = [];
}
grouped[row.query_idx].push(row);
});
return grouped;
});
}
For very high throughput, use pg library's pipeline mode or issue queries concurrently:
function parallelSearch(pool, queryVectors, limit) {
var promises = queryVectors.map(function (vec) {
return pool.query(
"SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity " +
"FROM documents ORDER BY embedding <=> $1::vector LIMIT $2",
["[" + vec.join(",") + "]", limit]
);
});
return Promise.all(promises).then(function (results) {
return results.map(function (res) {
return res.rows;
});
});
}
Connection Pooling for High-Throughput Search
Vector search queries are CPU-intensive on the database side. Proper connection pooling prevents connection exhaustion and ensures consistent latency:
var pg = require("pg");
function createOptimizedPool(connectionString, options) {
var poolConfig = {
connectionString: connectionString,
max: options.maxConnections || 20,
min: options.minConnections || 5,
idleTimeoutMillis: options.idleTimeout || 30000,
connectionTimeoutMillis: options.connectionTimeout || 5000,
statement_timeout: options.queryTimeout || 10000,
};
var pool = new pg.Pool(poolConfig);
pool.on("error", function (err) {
console.error("Unexpected pool error:", err.message);
});
pool.on("connect", function (client) {
// Set session-level parameters for vector search
client.query("SET hnsw.ef_search = " + (options.efSearch || 100));
client.query("SET work_mem = '256MB'");
client.query("SET effective_cache_size = '4GB'");
});
return pool;
}
For applications with extreme throughput requirements, consider PgBouncer in transaction mode. However, be aware that session-level SET commands (like hnsw.ef_search) do not persist across transactions in PgBouncer. You must set these parameters in each transaction:
function searchWithPgBouncer(pool, queryVector, efSearch, limit) {
return pool.connect().then(function (client) {
return client
.query("BEGIN")
.then(function () {
return client.query("SET LOCAL hnsw.ef_search = " + efSearch);
})
.then(function () {
return client.query(
"SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity " +
"FROM documents ORDER BY embedding <=> $1::vector LIMIT $2",
["[" + queryVector.join(",") + "]", limit]
);
})
.then(function (res) {
return client.query("COMMIT").then(function () {
client.release();
return res.rows;
});
})
.catch(function (err) {
return client.query("ROLLBACK").then(function () {
client.release();
throw err;
});
});
});
}
Caching Frequently Accessed Vectors
Many applications have query patterns where the same or similar searches repeat. A simple in-memory LRU cache can eliminate redundant database hits:
function createVectorSearchCache(maxSize) {
var cache = {};
var accessOrder = [];
function getCacheKey(vector, limit) {
// Quantize vector components to reduce cache key space
var quantized = vector.map(function (v) {
return Math.round(v * 1000) / 1000;
});
return JSON.stringify({ v: quantized, l: limit });
}
function get(vector, limit) {
var key = getCacheKey(vector, limit);
if (cache[key]) {
// Move to end of access order
var idx = accessOrder.indexOf(key);
if (idx > -1) {
accessOrder.splice(idx, 1);
}
accessOrder.push(key);
return cache[key].results;
}
return null;
}
function set(vector, limit, results) {
var key = getCacheKey(vector, limit);
// Evict oldest entries if at capacity
while (accessOrder.length >= maxSize) {
var oldest = accessOrder.shift();
delete cache[oldest];
}
cache[key] = {
results: results,
timestamp: Date.now(),
};
accessOrder.push(key);
}
function invalidate() {
cache = {};
accessOrder = [];
}
return {
get: get,
set: set,
invalidate: invalidate,
size: function () {
return accessOrder.length;
},
};
}
function cachedSearch(pool, searchCache, queryVector, limit) {
var cached = searchCache.get(queryVector, limit);
if (cached) {
return Promise.resolve(cached);
}
return pool
.query(
"SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity " +
"FROM documents ORDER BY embedding <=> $1::vector LIMIT $2",
["[" + queryVector.join(",") + "]", limit]
)
.then(function (res) {
searchCache.set(queryVector, limit, res.rows);
return res.rows;
});
}
For distributed caching, serialize result sets to Redis. Do not cache the vectors themselves in Redis as the serialization overhead negates the benefits. Cache only the search results (IDs and scores).
Partitioning Vector Tables by Category or Date
As your vector table grows into the millions of rows, partitioning becomes essential. Partitioning allows PostgreSQL to create smaller, independent indexes for each partition, which dramatically improves both index build time and query performance.
Partitioning by Category
CREATE TABLE documents (
id SERIAL,
content TEXT,
category TEXT NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
embedding vector(1536)
) PARTITION BY LIST (category);
CREATE TABLE documents_tutorials PARTITION OF documents
FOR VALUES IN ('tutorials');
CREATE TABLE documents_reference PARTITION OF documents
FOR VALUES IN ('reference');
CREATE TABLE documents_blog PARTITION OF documents
FOR VALUES IN ('blog');
-- Each partition gets its own HNSW index
CREATE INDEX ON documents_tutorials USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 100);
CREATE INDEX ON documents_reference USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 100);
CREATE INDEX ON documents_blog USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 100);
Partitioning by Date
For time-series data or applications where recent content is more relevant:
CREATE TABLE documents (
id SERIAL,
content TEXT,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
embedding vector(1536)
) PARTITION BY RANGE (created_at);
CREATE TABLE documents_2025 PARTITION OF documents
FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
CREATE TABLE documents_2026 PARTITION OF documents
FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');
CREATE INDEX ON documents_2025 USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON documents_2026 USING hnsw (embedding vector_cosine_ops);
When your query includes a partition key in the WHERE clause, PostgreSQL prunes irrelevant partitions and only searches the matching partition's index. This gives you both the filter accuracy of pre-filtering and the speed of a smaller index.
Monitoring Search Performance
You cannot optimize what you do not measure. Here is how to instrument your vector search pipeline:
Using EXPLAIN ANALYZE
function explainSearch(pool, queryVector, limit) {
return pool
.query(
"EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) " +
"SELECT id, content FROM documents " +
"ORDER BY embedding <=> $1::vector LIMIT $2",
["[" + queryVector.join(",") + "]", limit]
)
.then(function (res) {
var plan = res.rows[0]["QUERY PLAN"][0];
return {
executionTimeMs: plan["Execution Time"],
planningTimeMs: plan["Planning Time"],
indexUsed: extractIndexInfo(plan.Plan),
sharedBuffersHit: plan.Plan["Shared Hit Blocks"],
sharedBuffersRead: plan.Plan["Shared Read Blocks"],
};
});
}
function extractIndexInfo(plan) {
if (plan["Index Name"]) {
return {
name: plan["Index Name"],
scanType: plan["Node Type"],
};
}
if (plan.Plans) {
for (var i = 0; i < plan.Plans.length; i++) {
var result = extractIndexInfo(plan.Plans[i]);
if (result) return result;
}
}
return null;
}
Tracking Query Performance Over Time
function createPerformanceTracker() {
var metrics = {
totalQueries: 0,
totalLatencyMs: 0,
p50Latencies: [],
p99Latencies: [],
indexMisses: 0,
};
function recordQuery(latencyMs, usedIndex) {
metrics.totalQueries++;
metrics.totalLatencyMs += latencyMs;
metrics.p50Latencies.push(latencyMs);
if (!usedIndex) {
metrics.indexMisses++;
}
// Keep only last 1000 measurements for percentile calculation
if (metrics.p50Latencies.length > 1000) {
metrics.p50Latencies.shift();
}
}
function getStats() {
var sorted = metrics.p50Latencies.slice().sort(function (a, b) {
return a - b;
});
var p50Idx = Math.floor(sorted.length * 0.5);
var p99Idx = Math.floor(sorted.length * 0.99);
return {
totalQueries: metrics.totalQueries,
avgLatencyMs: metrics.totalQueries > 0
? Math.round(metrics.totalLatencyMs / metrics.totalQueries)
: 0,
p50LatencyMs: sorted[p50Idx] || 0,
p99LatencyMs: sorted[p99Idx] || 0,
indexMissRate: metrics.totalQueries > 0
? (metrics.indexMisses / metrics.totalQueries * 100).toFixed(2) + "%"
: "0%",
};
}
return { recordQuery: recordQuery, getStats: getStats };
}
Watch for these warning signs in your monitoring:
- Sequential scans appearing in EXPLAIN output: Your index is not being used, usually because the table is too small or the query planner estimates a sequential scan is faster.
- Shared Read Blocks significantly higher than Shared Hit Blocks: Your index does not fit in memory and PostgreSQL is reading from disk.
- P99 latency spiking: Usually caused by index maintenance operations (VACUUM, autovacuum) or concurrent write pressure.
Scaling Strategies
Read Replicas
For read-heavy workloads (which most search workloads are), streaming replication gives you horizontal read scaling:
function createReadReplicaPool(primaryUrl, replicaUrls) {
var pg = require("pg");
var primaryPool = new pg.Pool({
connectionString: primaryUrl,
max: 5,
});
var replicaPools = replicaUrls.map(function (url) {
return new pg.Pool({
connectionString: url,
max: 10,
});
});
var roundRobinIndex = 0;
function getReplicaPool() {
var pool = replicaPools[roundRobinIndex % replicaPools.length];
roundRobinIndex++;
return pool;
}
function search(queryVector, limit) {
var pool = getReplicaPool();
return pool.query(
"SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity " +
"FROM documents ORDER BY embedding <=> $1::vector LIMIT $2",
["[" + queryVector.join(",") + "]", limit]
);
}
function write(query, params) {
return primaryPool.query(query, params);
}
return { search: search, write: write };
}
Application-Level Sharding
When a single PostgreSQL instance cannot handle your dataset size (typically beyond 50-100 million vectors), shard across multiple databases:
function createShardedSearch(shardConfigs) {
var pg = require("pg");
var shards = shardConfigs.map(function (config) {
return {
pool: new pg.Pool({ connectionString: config.connectionString, max: 10 }),
range: config.range, // e.g., { min: 0, max: 999999 }
};
});
function searchAllShards(queryVector, limit) {
var promises = shards.map(function (shard) {
return shard.pool.query(
"SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity " +
"FROM documents ORDER BY embedding <=> $1::vector LIMIT $2",
["[" + queryVector.join(",") + "]", limit]
);
});
return Promise.all(promises).then(function (results) {
// Merge results from all shards and re-sort
var allRows = [];
results.forEach(function (res) {
allRows = allRows.concat(res.rows);
});
allRows.sort(function (a, b) {
return b.similarity - a.similarity;
});
return allRows.slice(0, limit);
});
}
return { search: searchAllShards };
}
Sharding adds complexity. Before going down this path, exhaust these options first: HNSW index tuning, dimensionality reduction, half-precision vectors, partitioning, and read replicas. These will take you surprisingly far.
Complete Working Example
Here is a comprehensive Node.js optimization suite that benchmarks different index types and demonstrates the performance impact of each optimization:
var pg = require("pg");
// ============================================================
// Configuration
// ============================================================
var DATABASE_URL = process.env.DATABASE_URL || "postgresql://localhost:5432/vectordb";
var VECTOR_DIMENSIONS = 1536;
var TEST_ROWS = 100000;
var pool = new pg.Pool({
connectionString: DATABASE_URL,
max: 20,
idleTimeoutMillis: 30000,
});
// ============================================================
// Setup: Create table and generate test data
// ============================================================
function setupTestTable() {
console.log("Setting up test table with " + TEST_ROWS + " rows...");
return pool
.query("CREATE EXTENSION IF NOT EXISTS vector")
.then(function () {
return pool.query("DROP TABLE IF EXISTS bench_documents");
})
.then(function () {
return pool.query(
"CREATE TABLE bench_documents (" +
"id SERIAL PRIMARY KEY, " +
"content TEXT, " +
"category TEXT, " +
"embedding vector(" + VECTOR_DIMENSIONS + "), " +
"embedding_reduced vector(512)" +
")"
);
})
.then(function () {
console.log("Table created. Inserting test data...");
return insertTestData();
});
}
function generateRandomVector(dims) {
var vec = new Array(dims);
for (var i = 0; i < dims; i++) {
vec[i] = (Math.random() * 2 - 1);
}
// Normalize
var mag = 0;
for (var i = 0; i < dims; i++) {
mag += vec[i] * vec[i];
}
mag = Math.sqrt(mag);
for (var i = 0; i < dims; i++) {
vec[i] = vec[i] / mag;
}
return vec;
}
function insertTestData() {
var batchSize = 500;
var categories = ["tutorials", "reference", "blog", "api-docs", "guides"];
var inserted = 0;
function insertBatch() {
if (inserted >= TEST_ROWS) {
console.log("Inserted " + inserted + " rows.");
return Promise.resolve();
}
var values = [];
var params = [];
var paramIdx = 1;
for (var i = 0; i < batchSize && inserted + i < TEST_ROWS; i++) {
var fullVec = generateRandomVector(VECTOR_DIMENSIONS);
var reducedVec = fullVec.slice(0, 512);
// Normalize reduced vector
var mag = 0;
for (var j = 0; j < reducedVec.length; j++) {
mag += reducedVec[j] * reducedVec[j];
}
mag = Math.sqrt(mag);
for (var j = 0; j < reducedVec.length; j++) {
reducedVec[j] = reducedVec[j] / mag;
}
var cat = categories[Math.floor(Math.random() * categories.length)];
values.push(
"($" + paramIdx + ", $" + (paramIdx + 1) + ", $" +
(paramIdx + 2) + "::vector, $" + (paramIdx + 3) + "::vector)"
);
params.push("Document " + (inserted + i));
params.push(cat);
params.push("[" + fullVec.join(",") + "]");
params.push("[" + reducedVec.join(",") + "]");
paramIdx += 4;
}
var sql =
"INSERT INTO bench_documents (content, category, embedding, embedding_reduced) VALUES " +
values.join(", ");
return pool.query(sql, params).then(function () {
inserted += batchSize;
if (inserted % 10000 === 0) {
console.log(" Inserted " + inserted + " / " + TEST_ROWS);
}
return insertBatch();
});
}
return insertBatch();
}
// ============================================================
// Benchmark: Sequential Scan (no index)
// ============================================================
function benchmarkSequentialScan(queryVector) {
console.log("\n--- Benchmark: Sequential Scan (No Index) ---");
return pool
.query("DROP INDEX IF EXISTS idx_bench_ivfflat")
.then(function () {
return pool.query("DROP INDEX IF EXISTS idx_bench_hnsw");
})
.then(function () {
var start = Date.now();
return pool
.query(
"SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity " +
"FROM bench_documents ORDER BY embedding <=> $1::vector LIMIT 10",
["[" + queryVector.join(",") + "]"]
)
.then(function (res) {
var elapsed = Date.now() - start;
console.log(" Latency: " + elapsed + "ms");
console.log(" Results: " + res.rows.length);
console.log(" Top similarity: " + (res.rows[0] ? res.rows[0].similarity : "N/A"));
return { method: "sequential", latencyMs: elapsed, results: res.rows };
});
});
}
// ============================================================
// Benchmark: IVFFlat
// ============================================================
function benchmarkIVFFlat(queryVector) {
console.log("\n--- Benchmark: IVFFlat Index ---");
var lists = Math.max(10, Math.floor(TEST_ROWS / 1000));
var probes = Math.max(1, Math.floor(Math.sqrt(lists)));
console.log(" Creating IVFFlat index with " + lists + " lists...");
return pool
.query("DROP INDEX IF EXISTS idx_bench_ivfflat")
.then(function () {
return pool.query("DROP INDEX IF EXISTS idx_bench_hnsw");
})
.then(function () {
var buildStart = Date.now();
return pool
.query(
"CREATE INDEX idx_bench_ivfflat ON bench_documents " +
"USING ivfflat (embedding vector_cosine_ops) WITH (lists = " + lists + ")"
)
.then(function () {
var buildTime = Date.now() - buildStart;
console.log(" Index build time: " + buildTime + "ms");
return pool.query("SET ivfflat.probes = " + probes);
});
})
.then(function () {
console.log(" Searching with probes = " + probes + "...");
var start = Date.now();
return pool
.query(
"SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity " +
"FROM bench_documents ORDER BY embedding <=> $1::vector LIMIT 10",
["[" + queryVector.join(",") + "]"]
)
.then(function (res) {
var elapsed = Date.now() - start;
console.log(" Latency: " + elapsed + "ms");
console.log(" Results: " + res.rows.length);
console.log(" Top similarity: " + (res.rows[0] ? res.rows[0].similarity : "N/A"));
return { method: "ivfflat", latencyMs: elapsed, results: res.rows };
});
});
}
// ============================================================
// Benchmark: HNSW
// ============================================================
function benchmarkHNSW(queryVector) {
console.log("\n--- Benchmark: HNSW Index ---");
return pool
.query("DROP INDEX IF EXISTS idx_bench_ivfflat")
.then(function () {
return pool.query("DROP INDEX IF EXISTS idx_bench_hnsw");
})
.then(function () {
console.log(" Creating HNSW index (m=16, ef_construction=100)...");
var buildStart = Date.now();
return pool
.query(
"CREATE INDEX idx_bench_hnsw ON bench_documents " +
"USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 100)"
)
.then(function () {
var buildTime = Date.now() - buildStart;
console.log(" Index build time: " + buildTime + "ms");
return pool.query("SET hnsw.ef_search = 100");
});
})
.then(function () {
var start = Date.now();
return pool
.query(
"SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity " +
"FROM bench_documents ORDER BY embedding <=> $1::vector LIMIT 10",
["[" + queryVector.join(",") + "]"]
)
.then(function (res) {
var elapsed = Date.now() - start;
console.log(" Latency: " + elapsed + "ms");
console.log(" Results: " + res.rows.length);
console.log(" Top similarity: " + (res.rows[0] ? res.rows[0].similarity : "N/A"));
return { method: "hnsw", latencyMs: elapsed, results: res.rows };
});
});
}
// ============================================================
// Benchmark: Reduced dimensionality search
// ============================================================
function benchmarkReducedDimensionality(queryVectorFull) {
console.log("\n--- Benchmark: Reduced Dimensionality (512d) ---");
return pool
.query("DROP INDEX IF EXISTS idx_bench_reduced_hnsw")
.then(function () {
console.log(" Creating HNSW index on 512d reduced vectors...");
var buildStart = Date.now();
return pool
.query(
"CREATE INDEX idx_bench_reduced_hnsw ON bench_documents " +
"USING hnsw (embedding_reduced vector_cosine_ops) WITH (m = 16, ef_construction = 100)"
)
.then(function () {
var buildTime = Date.now() - buildStart;
console.log(" Index build time: " + buildTime + "ms");
return pool.query("SET hnsw.ef_search = 100");
});
})
.then(function () {
var reducedQuery = queryVectorFull.slice(0, 512);
var mag = 0;
for (var i = 0; i < reducedQuery.length; i++) {
mag += reducedQuery[i] * reducedQuery[i];
}
mag = Math.sqrt(mag);
for (var i = 0; i < reducedQuery.length; i++) {
reducedQuery[i] = reducedQuery[i] / mag;
}
var start = Date.now();
return pool
.query(
"SELECT id, content, 1 - (embedding_reduced <=> $1::vector) AS similarity " +
"FROM bench_documents ORDER BY embedding_reduced <=> $1::vector LIMIT 10",
["[" + reducedQuery.join(",") + "]"]
)
.then(function (res) {
var elapsed = Date.now() - start;
console.log(" Latency: " + elapsed + "ms");
console.log(" Results: " + res.rows.length);
console.log(" Top similarity: " + (res.rows[0] ? res.rows[0].similarity : "N/A"));
return { method: "reduced_512d", latencyMs: elapsed, results: res.rows };
});
});
}
// ============================================================
// Run all benchmarks
// ============================================================
function runAllBenchmarks() {
var queryVector = generateRandomVector(VECTOR_DIMENSIONS);
var allResults = [];
return setupTestTable()
.then(function () {
return benchmarkSequentialScan(queryVector);
})
.then(function (result) {
allResults.push(result);
return benchmarkIVFFlat(queryVector);
})
.then(function (result) {
allResults.push(result);
return benchmarkHNSW(queryVector);
})
.then(function (result) {
allResults.push(result);
return benchmarkReducedDimensionality(queryVector);
})
.then(function (result) {
allResults.push(result);
console.log("\n============================================");
console.log(" BENCHMARK RESULTS SUMMARY ");
console.log("============================================");
allResults.forEach(function (r) {
console.log(" " + r.method + ": " + r.latencyMs + "ms");
});
console.log("============================================");
return pool.end();
})
.catch(function (err) {
console.error("Benchmark failed:", err.message);
return pool.end();
});
}
runAllBenchmarks();
Run this suite against your pgvector instance to get concrete numbers for your hardware and dataset. The results will guide which optimizations deliver the most impact for your specific workload.
Common Issues and Troubleshooting
1. Index Not Being Used (Sequential Scan Instead)
Error/Symptom in EXPLAIN output:
Seq Scan on documents (cost=0.00..15234.00 rows=100000 width=48)
Filter: ...
Cause: PostgreSQL's query planner decides a sequential scan is cheaper, usually because the table is too small, statistics are outdated, or enable_seqscan is on.
Fix:
ANALYZE bench_documents;
-- If still not using index, check with:
SET enable_seqscan = off;
-- Then re-run your query to confirm the index works
-- Reset afterward:
SET enable_seqscan = on;
If the table has fewer than a few thousand rows, PostgreSQL is probably right that a sequential scan is faster. Do not force index usage on small tables.
2. Out of Memory During HNSW Index Build
Error:
ERROR: out of memory
DETAIL: Failed on request of size 1073741824 in memory context "HNSW index build"
Cause: HNSW index construction requires significant memory, especially with high m and ef_construction values on large datasets.
Fix: Increase maintenance_work_mem for the index build session:
SET maintenance_work_mem = '2GB';
CREATE INDEX idx_docs_hnsw ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
RESET maintenance_work_mem;
If your server does not have enough RAM, lower m to 12 and ef_construction to 48. Alternatively, build the index on a more powerful machine and restore the dump.
3. IVFFlat Recall Degradation After Bulk Inserts
Symptom: Search quality noticeably drops after inserting a large batch of new vectors. Results that should match no longer appear in the top 10.
Cause: IVFFlat cluster centroids were computed on the original data distribution. New vectors are assigned to the nearest existing centroid, but if the new data has a different distribution, the clustering is suboptimal.
Fix: Rebuild the index after significant data changes:
-- Reindex to recompute centroids with all current data
REINDEX INDEX idx_documents_embedding_ivfflat;
-- Or drop and recreate with potentially different parameters
DROP INDEX idx_documents_embedding_ivfflat;
CREATE INDEX idx_documents_embedding_ivfflat ON documents
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 150);
Schedule this as a maintenance task during low-traffic hours. For HNSW, this is not an issue since the graph structure adapts to new inserts.
4. Slow Queries Due to Disk I/O on Large Indexes
Symptom: Query latency spikes intermittently, especially after server restart or when running many concurrent queries. EXPLAIN shows high Shared Read Blocks.
Error in logs:
LOG: duration: 2847.432 ms statement: SELECT id FROM documents ORDER BY embedding <=> '[...]'::vector LIMIT 10
Cause: The HNSW index does not fit in PostgreSQL's shared buffer cache, forcing disk reads.
Fix: Increase shared_buffers to accommodate the index. A good rule of thumb is that shared_buffers should be at least 2x the size of your largest vector index:
-- Check index sizes
SELECT indexname, pg_size_pretty(pg_relation_size(indexname::regclass)) AS size
FROM pg_indexes
WHERE tablename = 'documents';
Then update postgresql.conf:
shared_buffers = 4GB
effective_cache_size = 12GB
If the index is larger than available RAM, consider dimensionality reduction or partitioning to keep individual index sizes manageable.
5. Connection Pool Exhaustion Under Load
Error:
Error: timeout expired before a connection could be obtained from the pool
Cause: Vector search queries are CPU-heavy and hold connections longer than typical CRUD operations. Under high concurrency, the pool runs dry.
Fix: Increase pool size, add query timeouts, and implement request queuing:
var pool = new pg.Pool({
connectionString: DATABASE_URL,
max: 30,
connectionTimeoutMillis: 3000,
statement_timeout: 5000,
});
Also consider whether you genuinely need that many concurrent vector searches or if caching and request deduplication can reduce the load.
Best Practices
Start with HNSW unless you have a specific reason to use IVFFlat. HNSW provides better recall, does not degrade with inserts, and the memory/build-time trade-off is worth it for most workloads. IVFFlat is preferable only when index build time or memory footprint are hard constraints.
Always run EXPLAIN ANALYZE on your vector queries in production. Verify that the index is actually being used, check buffer hit ratios, and track execution time. An unused index is worse than no index because it consumes storage and slows down writes for zero benefit.
Set ef_search (HNSW) or probes (IVFFlat) per query, not globally. Different queries may have different accuracy requirements. A fuzzy suggestion feature can tolerate lower recall than a critical document retrieval pipeline. Use
SET LOCALwithin transactions when using PgBouncer.Implement two-stage retrieval for high-dimensional vectors. Use a reduced-dimensionality or quantized index for the first pass to retrieve 5-10x your target result count, then re-rank using full-precision vectors. This delivers near-exact results at a fraction of the cost.
Partition your vector tables when they exceed 5 million rows. Smaller partitions mean smaller indexes, faster builds, and the ability to leverage partition pruning for filtered queries. Time-based partitioning also makes it easy to archive or drop old data without expensive DELETE operations.
Cache search results aggressively at the application layer. Most search workloads have significant query repetition. Even a 5-minute TTL cache can eliminate 50%+ of database hits. Use Redis for distributed caching, but cache result IDs and scores rather than raw vectors.
Monitor index size relative to shared_buffers. If your vector indexes exceed available buffer cache, query latency will be dominated by disk I/O rather than computation. Either increase shared_buffers, reduce index size through dimensionality reduction, or partition the table to keep individual indexes cacheable.
Never build indexes on empty or near-empty tables. Both IVFFlat and HNSW produce degenerate structures when trained on insufficient data. Load at least a representative sample of your dataset before creating the index. For IVFFlat, you need at least
lists * 10rows for meaningful clustering.Use connection pooling with session-level parameter awareness. Remember that SET commands for hnsw.ef_search and ivfflat.probes are session-scoped. When using PgBouncer in transaction mode, use SET LOCAL within explicit transactions to ensure parameters apply correctly.
References
- pgvector GitHub Repository - Official documentation and performance benchmarks
- HNSW Algorithm Paper - "Efficient and Robust Approximate Nearest Neighbor Using Hierarchical Navigable Small World Graphs" by Malkov and Yashunin
- PostgreSQL Partitioning Documentation - Official PostgreSQL table partitioning guide
- Matryoshka Representation Learning - Paper on adaptive dimensionality embeddings
- ANN Benchmarks - Cross-library approximate nearest neighbor benchmark comparisons
- pg Node.js Driver Documentation - Connection pooling and query configuration
- pgvector Performance Tuning - Official pgvector tuning recommendations