Query Optimization: EXPLAIN ANALYZE Deep Dive
An advanced guide to PostgreSQL query optimization with EXPLAIN ANALYZE, covering node types, join algorithms, cost estimation, plan reading, slow query diagnosis, and practical optimization techniques.
Query Optimization: EXPLAIN ANALYZE Deep Dive
Overview
EXPLAIN ANALYZE is the single most important tool in your PostgreSQL performance toolkit. It shows you exactly what the database engine does when it executes your query — which indexes it uses, how it joins tables, where it spends time, and where your assumptions about performance are wrong. If you are running PostgreSQL in production and you have never read an EXPLAIN ANALYZE output top to bottom, you are flying blind.
Prerequisites
- PostgreSQL 13+ (some features discussed require 14+)
- Solid understanding of SQL (JOINs, subqueries, CTEs, aggregations)
- Familiarity with indexing concepts (B-tree, hash, GIN)
- Node.js and the
pgmodule for the application-layer examples - A database with enough data to make query plans interesting (at least tens of thousands of rows)
EXPLAIN vs EXPLAIN ANALYZE
There are two forms of EXPLAIN, and they are fundamentally different.
EXPLAIN (without ANALYZE) asks the planner: "What would you do if I ran this query?" It produces an estimated plan based on table statistics. It does not execute the query.
EXPLAIN SELECT * FROM orders WHERE customer_id = 4821;
QUERY PLAN
----------------------------------------------------------------------------------
Index Scan using idx_orders_customer_id on orders (cost=0.43..12.58 rows=7 width=94)
Index Cond: (customer_id = 4821)
EXPLAIN ANALYZE actually runs the query and shows you both the estimated plan and the actual execution results side by side.
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 4821;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_orders_customer_id on orders (cost=0.43..12.58 rows=7 width=94) (actual time=0.028..0.035 rows=6 loops=1)
Index Cond: (customer_id = 4821)
Planning Time: 0.089 ms
Execution Time: 0.058 ms
The critical difference: EXPLAIN guesses. EXPLAIN ANALYZE measures. When the estimates diverge from actuals, that is where your performance problems live.
Warning: EXPLAIN ANALYZE executes the query. For INSERT, UPDATE, or DELETE statements, wrap it in a transaction and roll back:
BEGIN;
EXPLAIN ANALYZE DELETE FROM orders WHERE status = 'cancelled';
ROLLBACK;
Reading the Output Tree
Every EXPLAIN output is a tree of plan nodes. The query executor processes this tree bottom-up — leaf nodes execute first, and their results flow upward.
EXPLAIN ANALYZE
SELECT c.name, COUNT(o.id) AS order_count
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE c.created_at > '2024-01-01'
GROUP BY c.name
ORDER BY order_count DESC
LIMIT 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=1842.56..1842.59 rows=10 width=44) (actual time=24.381..24.384 rows=10 loops=1)
-> Sort (cost=1842.56..1848.23 rows=2268 width=44) (actual time=24.379..24.381 rows=10 loops=1)
Sort Key: (count(o.id)) DESC
Sort Method: top-N heapsort Memory: 26kB
-> HashAggregate (cost=1780.12..1802.80 rows=2268 width=44) (actual time=23.985..24.198 rows=2268 loops=1)
Group Key: c.name
Batches: 1 Memory Usage: 369kB
-> Hash Join (cost=112.45..1735.78 rows=8868 width=40) (actual time=1.204..19.877 rows=8734 loops=1)
Hash Cond: (o.customer_id = c.id)
-> Seq Scan on orders o (cost=0.00..1443.00 rows=50000 width=12) (actual time=0.008..6.742 rows=50000 loops=1)
-> Hash (cost=98.12..98.12 rows=1146 width=36) (actual time=1.178..1.179 rows=2268 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 168kB
-> Seq Scan on customers c (cost=0.00..98.12 rows=1146 width=36) (actual time=0.011..0.712 rows=2268 loops=1)
Filter: (created_at > '2024-01-01 00:00:00'::timestamp without time zone)
Rows Removed by Filter: 2732
Each node has these fields:
| Field | Meaning |
|---|---|
cost=0.43..12.58 |
Estimated startup cost .. total cost (in arbitrary planner units) |
rows=7 |
Estimated number of rows returned |
width=94 |
Estimated average row width in bytes |
actual time=0.028..0.035 |
Actual startup time .. total time in milliseconds |
rows=6 |
Actual number of rows returned |
loops=1 |
Number of times this node was executed |
Important: When loops is greater than 1, multiply the actual time by loops to get the true wall-clock cost. The reported times are per-loop averages.
Startup Cost vs Total Cost
The two cost numbers are not a range — they represent different things:
- Startup cost — the cost before the node can emit its first row. For a Sort node, this includes sorting the entire input. For a Seq Scan, this is essentially zero.
- Total cost — the cost to emit all rows.
This matters for queries with LIMIT. If a node has a high startup cost (like a Sort), adding LIMIT 10 does not make the sort cheaper — it still must sort everything before returning the first row. But an Index Scan has near-zero startup cost and can return rows immediately, making it efficient with LIMIT.
-- Sort has high startup cost — LIMIT does not help the sort itself
EXPLAIN ANALYZE SELECT * FROM orders ORDER BY total_amount DESC LIMIT 10;
Limit (cost=4892.33..4892.36 rows=10 width=94) (actual time=42.517..42.521 rows=10 loops=1)
-> Sort (cost=4892.33..5017.33 rows=50000 width=94) (actual time=42.515..42.517 rows=10 loops=1)
Sort Key: total_amount DESC
Sort Method: top-N heapsort Memory: 27kB
-> Seq Scan on orders (cost=0.00..1443.00 rows=50000 width=94) (actual time=0.009..8.131 rows=50000 loops=1)
Now with an index on total_amount:
CREATE INDEX idx_orders_total_amount ON orders (total_amount DESC);
EXPLAIN ANALYZE SELECT * FROM orders ORDER BY total_amount DESC LIMIT 10;
Limit (cost=0.29..1.04 rows=10 width=94) (actual time=0.031..0.047 rows=10 loops=1)
-> Index Scan using idx_orders_total_amount on orders (cost=0.29..3768.29 rows=50000 width=94) (actual time=0.029..0.043 rows=10 loops=1)
The total cost is higher (3768 vs 4892) but the startup cost is near zero, so LIMIT makes this fast. The planner knows this.
Scan Node Types
Seq Scan (Sequential Scan)
Reads every row in the table from disk, in physical order. This is the baseline — no index needed, no random I/O. On small tables, this is often faster than an index scan.
Seq Scan on orders (cost=0.00..1443.00 rows=50000 width=94)
Filter: (status = 'pending')
Rows Removed by Filter: 47234
When you see Rows Removed by Filter is much larger than rows, that is a sign an index might help.
Index Scan
Traverses a B-tree index to find matching rows, then fetches the actual row data from the heap (table).
Index Scan using idx_orders_status on orders (cost=0.29..85.42 rows=2766 width=94)
Index Cond: (status = 'pending')
Good for selective queries (returning a small percentage of rows). Involves random I/O to the heap, which is why Postgres might prefer a Seq Scan if the query returns a large percentage of the table.
Index Only Scan
Like an Index Scan, but all requested columns are in the index — no heap fetch needed. This is as fast as it gets.
Index Only Scan using idx_orders_customer_status on orders (cost=0.29..42.18 rows=12 width=8)
Index Cond: (customer_id = 4821)
Heap Fetches: 0
Heap Fetches: 0 is the ideal. If this number is high, run VACUUM on the table to update the visibility map.
Bitmap Index Scan + Bitmap Heap Scan
A two-phase approach. First, the Bitmap Index Scan reads the index and builds a bitmap of which table pages contain matching rows. Then the Bitmap Heap Scan reads those pages sequentially.
Bitmap Heap Scan on orders (cost=52.18..1298.42 rows=2766 width=94)
Recheck Cond: (status = 'pending')
Heap Blocks: exact=842
-> Bitmap Index Scan on idx_orders_status (cost=0.00..51.49 rows=2766 width=0)
Index Cond: (status = 'pending')
This is Postgres's sweet spot between Seq Scan and Index Scan — good for medium-selectivity queries. It avoids the random I/O of Index Scan by reading pages in physical order.
Multiple bitmap scans can be combined with BitmapAnd or BitmapOr for queries with multiple conditions:
EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'pending' AND region = 'us-west';
Bitmap Heap Scan on orders (cost=72.38..498.12 rows=312 width=94)
Recheck Cond: ((status = 'pending') AND (region = 'us-west'))
-> BitmapAnd (cost=72.38..72.38 rows=312 width=0)
-> Bitmap Index Scan on idx_orders_status (cost=0.00..51.49 rows=2766 width=0)
Index Cond: (status = 'pending')
-> Bitmap Index Scan on idx_orders_region (cost=0.00..20.64 rows=5641 width=0)
Index Cond: (region = 'us-west')
Join Algorithms
PostgreSQL has three join strategies, and the planner picks the cheapest one based on table sizes, available indexes, and sort order.
Nested Loop
For each row in the outer table, scan the inner table for matches. Fast when the outer table is small and the inner table has an index on the join column.
Nested Loop (cost=0.29..845.72 rows=12 width=130) (actual time=0.038..0.142 rows=12 loops=1)
-> Index Scan using idx_customers_email on customers c (cost=0.29..8.30 rows=1 width=36) (actual time=0.021..0.023 rows=1 loops=1)
Index Cond: (email = '[email protected]')
-> Index Scan using idx_orders_customer_id on orders o (cost=0.29..837.30 rows=12 width=94) (actual time=0.014..0.108 rows=12 loops=1)
Index Cond: (customer_id = c.id)
This is the right choice here: one customer, few orders. But Nested Loop with a large outer table and no inner index is catastrophic.
Hash Join
Builds a hash table from the smaller (inner) table, then probes it for each row from the larger (outer) table. No index needed. Good for large equi-joins.
Hash Join (cost=112.45..1735.78 rows=8868 width=40) (actual time=1.204..19.877 rows=8734 loops=1)
Hash Cond: (o.customer_id = c.id)
-> Seq Scan on orders o (cost=0.00..1443.00 rows=50000 width=12) (actual time=0.008..6.742 rows=50000 loops=1)
-> Hash (cost=98.12..98.12 rows=1146 width=36) (actual time=1.178..1.179 rows=2268 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 168kB
Watch the Batches count. If it exceeds 1, the hash table spilled to disk because work_mem is too small. This kills performance.
Merge Join
Both inputs must be sorted on the join key. Postgres walks through both sorted sets simultaneously. Efficient when both inputs are already sorted (e.g., via an index).
Merge Join (cost=0.57..3842.18 rows=50000 width=130) (actual time=0.034..28.412 rows=50000 loops=1)
Merge Cond: (c.id = o.customer_id)
-> Index Scan using customers_pkey on customers c (cost=0.29..248.29 rows=5000 width=36) (actual time=0.012..1.842 rows=5000 loops=1)
-> Index Scan using idx_orders_customer_id on orders o (cost=0.29..2843.29 rows=50000 width=94) (actual time=0.009..14.218 rows=50000 loops=1)
Merge Join is common in large analytical queries that join pre-sorted data. If both sides need sorting first, the planner usually prefers Hash Join instead.
Sort Operations
Sort
Standard sort operation. Pay attention to the Sort Method:
Sort (cost=4892.33..5017.33 rows=50000 width=94) (actual time=42.515..48.218 rows=50000 loops=1)
Sort Key: created_at DESC
Sort Method: external merge Disk: 5624kB
external merge Disk means the sort spilled to disk. This is slow. Increase work_mem:
SET work_mem = '16MB';
After increasing work_mem:
Sort (cost=4127.33..4252.33 rows=50000 width=94) (actual time=31.207..35.442 rows=50000 loops=1)
Sort Key: created_at DESC
Sort Method: quicksort Memory: 7428kB
Now it fits in memory. Much faster.
Incremental Sort (PostgreSQL 13+)
When data is already partially sorted, Incremental Sort only sorts within groups of the presorted prefix.
EXPLAIN ANALYZE SELECT * FROM orders ORDER BY customer_id, created_at DESC;
If there is an index on customer_id, Postgres can do an Incremental Sort:
Incremental Sort (cost=0.48..6842.48 rows=50000 width=94) (actual time=0.078..28.914 rows=50000 loops=1)
Sort Key: customer_id, created_at DESC
Presorted Key: customer_id
Full-sort Groups: 1563 Sort Method: quicksort Average Memory: 28kB Peak Memory: 32kB
This uses much less memory because it only sorts small groups at a time.
Aggregate Operations
HashAggregate
Builds a hash table of groups. Fast, but requires enough memory to hold all groups.
HashAggregate (cost=1780.12..1802.80 rows=2268 width=44) (actual time=23.985..24.198 rows=2268 loops=1)
Group Key: c.name
Batches: 1 Memory Usage: 369kB
If Batches > 1, the aggregate spilled to disk. Increase work_mem.
GroupAggregate
Requires sorted input. Streams through sorted data and aggregates group by group. Uses minimal memory but needs the sort cost up front.
GroupAggregate (cost=0.29..4218.29 rows=5000 width=44) (actual time=0.042..32.187 rows=5000 loops=1)
Group Key: customer_id
-> Index Scan using idx_orders_customer_id on orders (cost=0.29..3768.29 rows=50000 width=94) (actual time=0.028..18.412 rows=50000 loops=1)
The planner chooses GroupAggregate when the data is already sorted (e.g., from an index scan) and HashAggregate when it would need to sort first.
CTEs and Materialization
Before PostgreSQL 12, CTEs were always materialized — the CTE result was computed once and stored in a temporary buffer, creating an optimization fence. The planner could not push predicates into the CTE.
Starting in PostgreSQL 12, the planner can inline CTEs that are referenced once. You can force the old behavior with MATERIALIZED:
-- Inlined (PostgreSQL 12+, referenced once)
WITH recent_orders AS (
SELECT * FROM orders WHERE created_at > '2025-01-01'
)
SELECT * FROM recent_orders WHERE status = 'pending';
Index Scan using idx_orders_status_created on orders (cost=0.29..85.42 rows=124 width=94)
Index Cond: (status = 'pending')
Filter: (created_at > '2025-01-01'::date)
The CTE was eliminated. The planner combined both predicates. Now force materialization:
WITH recent_orders AS MATERIALIZED (
SELECT * FROM orders WHERE created_at > '2025-01-01'
)
SELECT * FROM recent_orders WHERE status = 'pending';
CTE Scan on recent_orders (cost=1842.00..2217.00 rows=5000 width=94)
Filter: (status = 'pending')
-> CTE recent_orders
-> Seq Scan on orders (cost=0.00..1843.00 rows=15000 width=94)
Filter: (created_at > '2025-01-01'::date)
Now the status = 'pending' filter cannot be pushed down. It scans 15,000 rows from the CTE and filters afterward. Use MATERIALIZED only when you know the CTE result is expensive to compute and is referenced multiple times.
Subquery Execution: SubPlan and InitPlan
InitPlan
A subquery that runs once and its result is reused. Shown at the top of the plan:
EXPLAIN ANALYZE SELECT * FROM orders WHERE total_amount > (SELECT AVG(total_amount) FROM orders);
Seq Scan on orders (cost=1568.00..3011.00 rows=16667 width=94) (actual time=8.412..18.724 rows=17234 loops=1)
Filter: (total_amount > $0)
Rows Removed by Filter: 32766
InitPlan 1 (returns $0)
-> Aggregate (cost=1443.00..1443.01 rows=1 width=32) (actual time=8.389..8.390 rows=1 loops=1)
-> Seq Scan on orders orders_1 (cost=0.00..1443.00 rows=50000 width=8) (actual time=0.005..4.812 rows=50000 loops=1)
The subquery runs once (loops=1), computes the average, and the outer query uses it as a constant.
SubPlan
A correlated subquery that runs once per outer row. This is usually a performance disaster:
EXPLAIN ANALYZE
SELECT c.name,
(SELECT COUNT(*) FROM orders o WHERE o.customer_id = c.id) AS order_count
FROM customers c;
Seq Scan on customers c (cost=0.00..91443.00 rows=5000 width=44) (actual time=0.048..412.187 rows=5000 loops=1)
SubPlan 1
-> Aggregate (cost=18.15..18.16 rows=1 width=8) (actual time=0.079..0.079 rows=1 loops=5000)
-> Index Only Scan using idx_orders_customer_id on orders o (cost=0.29..18.12 rows=10 width=0) (actual time=0.028..0.068 rows=10 loops=5000)
Index Cond: (customer_id = c.id)
The subquery runs 5,000 times (loops=5000). Total time: 0.079ms * 5000 = 395ms. Rewrite this as a JOIN:
SELECT c.name, COUNT(o.id) AS order_count
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
GROUP BY c.name;
EXPLAIN Options
BUFFERS
Shows how many pages were read from cache (shared hit) vs disk (read):
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 4821;
Index Scan using idx_orders_customer_id on orders (cost=0.29..12.58 rows=7 width=94) (actual time=0.028..0.035 rows=6 loops=1)
Index Cond: (customer_id = 4821)
Buffers: shared hit=4
Planning:
Buffers: shared hit=12
shared hit=4 means 4 pages were read from PostgreSQL's shared buffer cache. If you see shared read=4, those pages came from disk (or OS cache). High read counts indicate a cold cache or insufficient shared_buffers.
FORMAT JSON
Machine-readable output, great for programmatic analysis:
EXPLAIN (ANALYZE, FORMAT JSON) SELECT * FROM orders WHERE customer_id = 4821;
SETTINGS (PostgreSQL 12+)
Shows non-default planner settings that affected the plan:
SET work_mem = '256MB';
EXPLAIN (ANALYZE, SETTINGS) SELECT * FROM orders ORDER BY created_at;
Sort (cost=4127.33..4252.33 rows=50000 width=94) (actual time=31.207..35.442 rows=50000 loops=1)
Sort Key: created_at
Sort Method: quicksort Memory: 7428kB
Settings: work_mem = '256MB'
WAL (PostgreSQL 13+)
Shows WAL (Write-Ahead Log) generation for DML statements:
EXPLAIN (ANALYZE, WAL, BUFFERS) INSERT INTO orders (customer_id, total_amount) VALUES (1, 99.99);
Useful for understanding the write amplification of your queries.
Identifying Slow Nodes
The key skill is comparing estimated rows to actual rows. When they diverge, the planner made a bad decision.
Hash Join (cost=112.45..235.78 rows=10 width=40) (actual time=1.204..847.212 rows=48721 loops=1)
The planner estimated 10 rows but got 48,721. This means:
- The hash table was undersized
- Downstream nodes were not prepared for this volume
- The total time (847ms) is far beyond what the cost model predicted
Root cause: Stale statistics. Fix it:
ANALYZE orders;
Another pattern — watch for high actual time on a single node relative to its children:
Sort (cost=5892.33..6017.33 rows=50000 width=94) (actual time=312.515..358.218 rows=50000 loops=1)
Sort Key: created_at DESC
Sort Method: external merge Disk: 12824kB
-> Seq Scan on orders (cost=0.00..1443.00 rows=50000 width=94) (actual time=0.009..8.131 rows=50000 loops=1)
The Seq Scan took 8ms but the Sort took 312ms and spilled to disk. The fix is either increasing work_mem or adding an index on created_at DESC.
Common Red Flags
- Seq Scan on large tables with a selective filter — you almost certainly need an index
- Rows estimate off by 10x or more — run
ANALYZE, check for correlated columns, or use extended statistics - Nested Loop with high outer row count and no inner index — produces O(n*m) behavior
- Hash Batches > 1 — hash table spilled to disk, increase
work_mem - external merge Disk on Sort nodes — same fix, increase
work_mem - Bitmap Heap Scan with Lossy blocks —
work_memtoo low for the bitmap, resulting in page-level (lossy) rather than row-level filtering - SubPlan with high loops count — rewrite as a JOIN
Fixing Bad Plans
Adding Indexes
The most common fix. Target columns that appear in WHERE, JOIN ON, and ORDER BY:
-- Composite index for a common query pattern
CREATE INDEX idx_orders_status_created ON orders (status, created_at DESC);
-- Covering index to enable Index Only Scan
CREATE INDEX idx_orders_covering ON orders (customer_id) INCLUDE (total_amount, status);
-- Partial index for a common filter
CREATE INDEX idx_orders_pending ON orders (created_at DESC) WHERE status = 'pending';
Rewriting Queries
Sometimes the query structure itself is the problem:
-- Bad: correlated subquery
SELECT * FROM customers c
WHERE (SELECT COUNT(*) FROM orders WHERE customer_id = c.id) > 5;
-- Good: JOIN with HAVING
SELECT c.*
FROM customers c
JOIN orders o ON o.customer_id = c.id
GROUP BY c.id
HAVING COUNT(o.id) > 5;
-- Also good: semi-join with EXISTS (when you do not need aggregation)
SELECT * FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.id
HAVING COUNT(*) > 5
);
Updating Statistics
When row estimates are wrong, statistics are usually stale or insufficient:
-- Update statistics for a specific table
ANALYZE orders;
-- Increase statistics target for a specific column (default is 100)
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;
-- Create extended statistics for correlated columns (PostgreSQL 10+)
CREATE STATISTICS stat_orders_region_status (dependencies, ndistinct, mcv)
ON region, status FROM orders;
ANALYZE orders;
Planner Configuration
These settings influence plan selection. Change them per-session for testing, not globally unless you know what you are doing.
-- Discourage Seq Scans (useful for testing if an index plan is faster)
SET enable_seqscan = off;
-- Adjust the cost of random I/O relative to sequential I/O
-- Default is 4.0. On SSDs, set closer to 1.1-1.5
SET random_page_cost = 1.1;
-- Increase memory available for sorts and hash operations
SET work_mem = '64MB';
-- Increase memory for maintenance operations (CREATE INDEX, VACUUM)
SET maintenance_work_mem = '512MB';
-- Allow more parallel workers
SET max_parallel_workers_per_gather = 4;
For SSD-based servers, random_page_cost = 1.1 is one of the most impactful changes you can make. The default of 4.0 assumes spinning disks and makes the planner irrationally afraid of index scans.
Using pg_stat_statements to Find Slow Queries
Before you can optimize, you need to know what to optimize. pg_stat_statements tracks query performance across all sessions.
-- Enable the extension (requires superuser, add to postgresql.conf or run once)
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Find the top 10 slowest queries by total time
SELECT
substring(query, 1, 80) AS short_query,
calls,
round(total_exec_time::numeric, 2) AS total_ms,
round(mean_exec_time::numeric, 2) AS mean_ms,
round(stddev_exec_time::numeric, 2) AS stddev_ms,
rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
short_query | calls | total_ms | mean_ms | stddev_ms | rows
--------------------------------+--------+-----------+---------+-----------+---------
SELECT * FROM orders WHERE sta | 284710 | 142847.12 | 0.50 | 1.24 | 1847210
SELECT c.name, COUNT(o.id) AS | 12847 | 89412.84 | 6.96 | 3.18 | 128470
UPDATE orders SET status = $1 | 48721 | 51842.18 | 1.06 | 0.84 | 48721
The first query ran 284,710 times. Even though each call averages 0.50ms, the total is 142 seconds. Optimizing this saves the most cumulative time.
EXPLAIN in Node.js
In production Node.js applications, you want to automatically capture slow query plans. Here is a practical approach using the pg module:
var pg = require('pg');
var pool = new pg.Pool({
connectionString: process.env.DATABASE_URL,
max: 20
});
var SLOW_QUERY_THRESHOLD_MS = 200;
function queryWithExplain(text, params, callback) {
var start = Date.now();
pool.query(text, params, function(err, result) {
var duration = Date.now() - start;
if (duration > SLOW_QUERY_THRESHOLD_MS) {
console.warn('[SLOW QUERY] ' + duration + 'ms: ' + text);
console.warn('[SLOW QUERY] Params: ' + JSON.stringify(params));
// Capture the EXPLAIN ANALYZE for diagnosis
var explainQuery = 'EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) ' + text;
pool.query(explainQuery, params, function(explainErr, explainResult) {
if (!explainErr) {
var plan = JSON.stringify(explainResult.rows[0]['QUERY PLAN'], null, 2);
console.warn('[SLOW QUERY] Plan:\n' + plan);
// In production, send this to your logging service
logSlowQuery({
query: text,
params: params,
duration: duration,
plan: explainResult.rows[0]['QUERY PLAN']
});
}
});
}
callback(err, result);
});
}
function logSlowQuery(data) {
// Send to your logging/monitoring service
// e.g., Datadog, Sentry, CloudWatch, or your own table
console.log('[SLOW QUERY LOG]', JSON.stringify({
timestamp: new Date().toISOString(),
query: data.query,
duration_ms: data.duration,
plan_nodes: extractNodeTypes(data.plan)
}));
}
function extractNodeTypes(plan) {
var nodes = [];
function walk(node) {
if (node['Node Type']) {
nodes.push({
type: node['Node Type'],
actual_rows: node['Actual Rows'],
actual_time: node['Actual Total Time']
});
}
if (node.Plans) {
node.Plans.forEach(function(child) {
walk(child);
});
}
}
if (Array.isArray(plan)) {
plan.forEach(function(p) { walk(p.Plan); });
}
return nodes;
}
// Usage
queryWithExplain(
'SELECT * FROM orders WHERE customer_id = $1 AND status = $2',
[4821, 'pending'],
function(err, result) {
if (err) {
console.error('Query failed:', err);
return;
}
console.log('Got ' + result.rows.length + ' orders');
}
);
Important caveat: Running EXPLAIN ANALYZE on every slow query doubles the execution cost. In high-traffic production systems, sample slow queries (e.g., capture the plan for 1 in 100 slow queries) or use EXPLAIN without ANALYZE to get the estimated plan without re-executing:
var sampleRate = 0.01; // 1% of slow queries
if (duration > SLOW_QUERY_THRESHOLD_MS && Math.random() < sampleRate) {
var explainQuery = 'EXPLAIN (FORMAT JSON) ' + text;
pool.query(explainQuery, params, function(explainErr, explainResult) {
// Log estimated plan without re-executing
});
}
Complete Working Example: Diagnosing and Fixing a Slow Query
Let us walk through a real scenario. We have an e-commerce database and a query that is running too slow.
The Schema
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(255),
region VARCHAR(50),
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customers(id),
status VARCHAR(20),
total_amount DECIMAL(10,2),
region VARCHAR(50),
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE order_items (
id SERIAL PRIMARY KEY,
order_id INTEGER REFERENCES orders(id),
product_id INTEGER,
quantity INTEGER,
unit_price DECIMAL(10,2)
);
-- 50,000 customers, 500,000 orders, 2,000,000 order items
The Slow Query
EXPLAIN (ANALYZE, BUFFERS)
SELECT c.name, c.region,
SUM(oi.quantity * oi.unit_price) AS total_revenue,
COUNT(DISTINCT o.id) AS order_count
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.id
WHERE o.status = 'completed'
AND o.created_at BETWEEN '2025-01-01' AND '2025-12-31'
AND c.region = 'us-west'
GROUP BY c.name, c.region
ORDER BY total_revenue DESC
LIMIT 20;
The Bad Plan (Before Optimization)
Limit (cost=48921.45..48921.50 rows=20 width=76) (actual time=2847.218..2847.234 rows=20 loops=1)
Buffers: shared hit=12847 read=34218
-> Sort (cost=48921.45..48932.12 rows=4268 width=76) (actual time=2847.215..2847.228 rows=20 loops=1)
Sort Key: (sum((oi.quantity * oi.unit_price))) DESC
Sort Method: top-N heapsort Memory: 28kB
Buffers: shared hit=12847 read=34218
-> HashAggregate (cost=48712.18..48754.86 rows=4268 width=76) (actual time=2842.187..2845.412 rows=4268 loops=1)
Group Key: c.name, c.region
Batches: 1 Memory Usage: 1249kB
Buffers: shared hit=12847 read=34218
-> Hash Join (cost=18412.45..47842.18 rows=174000 width=44) (actual time=892.412..2587.218 rows=168421 loops=1)
Hash Cond: (oi.order_id = o.id)
Buffers: shared hit=12847 read=34218
-> Seq Scan on order_items oi (cost=0.00..24718.00 rows=2000000 width=16) (actual time=0.012..412.187 rows=2000000 loops=1)
Buffers: shared hit=4218 read=20500
-> Hash (cost=17842.45..17842.45 rows=45600 width=36) (actual time=891.842..891.843 rows=42187 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 3412kB
Buffers: shared hit=8629 read=13718
-> Hash Join (cost=248.12..17842.45 rows=45600 width=36) (actual time=12.187..842.412 rows=42187 loops=1)
Hash Cond: (o.customer_id = c.id)
Buffers: shared hit=8629 read=13718
-> Seq Scan on orders o (cost=0.00..16843.00 rows=91200 width=12) (actual time=0.024..718.412 rows=87412 loops=1)
Filter: ((status = 'completed') AND (created_at >= '2025-01-01') AND (created_at <= '2025-12-31'))
Rows Removed by Filter: 412588
Buffers: shared hit=8412 read=13500
-> Hash (cost=198.12..198.12 rows=4000 width=32) (actual time=12.087..12.088 rows=8421 loops=1)
Buckets: 16384 (originally 4096) Batches: 1 Memory Usage: 542kB
Buffers: shared hit=217 read=218
-> Seq Scan on customers c (cost=0.00..198.12 rows=4000 width=32) (actual time=0.018..8.412 rows=8421 loops=1)
Filter: (region = 'us-west')
Rows Removed by Filter: 41579
Buffers: shared hit=217 read=218
Planning Time: 1.842 ms
Execution Time: 2847.518 ms
Total time: 2,847ms. Let us diagnose this.
Reading the Plan — Where Is the Time?
- Seq Scan on
order_items— reads all 2,000,000 rows (412ms). This is the biggest table and it has no filter pushed down. It reads 20,500 pages from disk. - Seq Scan on
orderswith Filter — reads 500,000 rows, removes 412,588 (718ms). The filter onstatusandcreated_atis applied after reading every row. - Row estimate errors — customers estimated 4,000 but got 8,421. The planner sized the hash table wrong (resized from 4,096 to 16,384 buckets).
shared read=34218total — significant disk I/O. Not enough of the data is cached.
The Fix
-- 1. Index for the orders filter (most impactful)
CREATE INDEX idx_orders_status_created ON orders (status, created_at)
WHERE status = 'completed';
-- 2. Index for order_items join
CREATE INDEX idx_order_items_order_id ON order_items (order_id)
INCLUDE (quantity, unit_price);
-- 3. Index for customers region filter
CREATE INDEX idx_customers_region ON customers (region);
-- 4. Update statistics
ANALYZE customers;
ANALYZE orders;
ANALYZE order_items;
The Fixed Plan (After Optimization)
EXPLAIN (ANALYZE, BUFFERS)
SELECT c.name, c.region,
SUM(oi.quantity * oi.unit_price) AS total_revenue,
COUNT(DISTINCT o.id) AS order_count
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN order_items oi ON oi.order_id = o.id
WHERE o.status = 'completed'
AND o.created_at BETWEEN '2025-01-01' AND '2025-12-31'
AND c.region = 'us-west'
GROUP BY c.name, c.region
ORDER BY total_revenue DESC
LIMIT 20;
Limit (cost=8421.45..8421.50 rows=20 width=76) (actual time=187.218..187.234 rows=20 loops=1)
Buffers: shared hit=14842
-> Sort (cost=8421.45..8432.12 rows=4268 width=76) (actual time=187.215..187.228 rows=20 loops=1)
Sort Key: (sum((oi.quantity * oi.unit_price))) DESC
Sort Method: top-N heapsort Memory: 28kB
Buffers: shared hit=14842
-> HashAggregate (cost=8212.18..8254.86 rows=4268 width=76) (actual time=182.187..185.412 rows=4268 loops=1)
Group Key: c.name, c.region
Batches: 1 Memory Usage: 1249kB
Buffers: shared hit=14842
-> Nested Loop (cost=1248.45..7342.18 rows=174000 width=44) (actual time=8.412..142.218 rows=168421 loops=1)
Buffers: shared hit=14842
-> Hash Join (cost=1248.16..4842.45 rows=42187 width=36) (actual time=8.387..42.187 rows=42187 loops=1)
Hash Cond: (o.customer_id = c.id)
Buffers: shared hit=2847
-> Index Scan using idx_orders_status_created on orders o (cost=0.42..3218.42 rows=87412 width=12) (actual time=0.024..18.412 rows=87412 loops=1)
Index Cond: ((status = 'completed') AND (created_at >= '2025-01-01') AND (created_at <= '2025-12-31'))
Buffers: shared hit=2412
-> Hash (cost=1198.12..1198.12 rows=8421 width=32) (actual time=8.087..8.088 rows=8421 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 542kB
Buffers: shared hit=435
-> Index Scan using idx_customers_region on customers c (cost=0.29..1198.12 rows=8421 width=32) (actual time=0.018..4.412 rows=8421 loops=1)
Index Cond: (region = 'us-west')
Buffers: shared hit=435
-> Index Only Scan using idx_order_items_order_id on order_items oi (cost=0.43..0.58 rows=4 width=16) (actual time=0.001..0.002 rows=4 loops=42187)
Index Cond: (order_id = o.id)
Heap Fetches: 0
Buffers: shared hit=11995
Planning Time: 0.847 ms
Execution Time: 187.518 ms
Results
| Metric | Before | After | Improvement |
|---|---|---|---|
| Execution Time | 2,847ms | 187ms | 15.2x faster |
| Disk Reads | 34,218 pages | 0 pages | 100% cache hit |
| Order Items Scan | Full Seq Scan (2M rows) | Index Only Scan (4 rows/loop) | Eliminated full scan |
| Orders Scan | Seq Scan + Filter | Index Scan (direct) | No rows removed by filter |
The key changes:
order_itemswent from a full Seq Scan of 2M rows to an Index Only Scan that reads only the matching rows per order, with zero heap fetchesorderswent from Seq Scan + Filter (reading 500K, keeping 87K) to an Index Scan that only reads matching rows- All reads come from shared buffers — zero disk I/O on the second run
- The join strategy changed from Hash Join to Nested Loop for
order_items, which is correct because the loop count is bounded by the filtered orders
Common Issues & Troubleshooting
Issue 1: Planner Chooses Seq Scan Despite Index
EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'completed';
Seq Scan on orders (cost=0.00..16843.00 rows=425000 width=94) (actual time=0.024..142.187 rows=425842 loops=1)
Filter: (status = 'completed')
Rows Removed by Filter: 74158
Why: The query returns 85% of the table. An index scan would be slower because of random I/O. The planner is correct. If you only need a subset of columns, create a covering index. If you need to filter further, add more conditions.
Issue 2: Wrong Row Estimates Due to Correlated Columns
EXPLAIN ANALYZE SELECT * FROM orders WHERE region = 'us-west' AND status = 'pending';
Bitmap Heap Scan on orders (cost=82.18..4842.12 rows=180 width=94) (actual time=4.218..28.412 rows=8421 loops=1)
Recheck Cond: (region = 'us-west')
Filter: (status = 'pending')
Rows Removed by Filter: 12
Estimated 180 rows, actual 8,421 — the planner assumed region and status are independent, but they are correlated (most us-west orders are pending).
Fix: Extended statistics:
CREATE STATISTICS stat_orders_region_status (dependencies) ON region, status FROM orders;
ANALYZE orders;
Issue 3: Hash Join Spilling to Disk
Hash Join (cost=48218.45..98421.18 rows=500000 width=130) (actual time=2847.218..8421.412 rows=500000 loops=1)
Hash Cond: (o.customer_id = c.id)
-> Seq Scan on orders o ...
-> Hash (cost=24718.00..24718.00 rows=50000 width=36) (actual time=2842.187..2842.188 rows=50000 loops=1)
Buckets: 65536 Batches: 4 Memory Usage: 4097kB
Batches: 4 means the hash table did not fit in work_mem and spilled to disk in 4 batches.
Fix:
SET work_mem = '32MB'; -- Increase from default 4MB
Or better yet, set it per-query in your application:
function runLargeJoinQuery(params, callback) {
pool.query('SET LOCAL work_mem = \'32MB\'', function(err) {
if (err) return callback(err);
pool.query('SELECT ...', params, callback);
});
}
Issue 4: Index Not Used Due to Type Mismatch
-- Column is INTEGER, but parameter is TEXT
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = '4821';
Seq Scan on orders (cost=0.00..16843.00 rows=1 width=94) (actual time=0.024..142.187 rows=6 loops=1)
Filter: ((customer_id)::text = '4821'::text)
PostgreSQL cast customer_id to text, which prevents index use.
Fix: Use the correct type:
SELECT * FROM orders WHERE customer_id = 4821;
In Node.js, be explicit about parameter types:
// Bad — pg might send this as text
pool.query('SELECT * FROM orders WHERE customer_id = $1', ['4821']);
// Good — ensure it is a number
pool.query('SELECT * FROM orders WHERE customer_id = $1', [parseInt(customerId, 10)]);
Best Practices
Always use EXPLAIN ANALYZE, not just EXPLAIN. Estimated plans lie. Actual plans tell the truth. The gap between estimated and actual rows is where performance problems hide.
Run ANALYZE after bulk data changes. After large INSERTs, DELETEs, or schema changes, the planner's statistics are stale. Autovacuum handles this eventually, but for immediate accuracy, run
ANALYZE tablenameexplicitly.Set
random_page_costappropriately for your storage. If you are on SSDs (and you should be), setrandom_page_cost = 1.1. The default of 4.0 assumes spinning disks and causes the planner to avoid index scans even when they would be faster.Use BUFFERS to understand I/O patterns.
EXPLAIN (ANALYZE, BUFFERS)shows whether data comes from cache or disk. Highshared readcounts indicate either a cold cache, insufficientshared_buffers, or a query that touches too many pages.Create composite indexes that match your query patterns. A single-column index on
statusis rarely enough. If your queries filter onstatusand sort bycreated_at, createCREATE INDEX ON orders (status, created_at). Column order matters — put equality conditions first, range conditions and sort columns after.Monitor with pg_stat_statements, not just application logs. Application-level query timing includes network latency, connection pool wait time, and serialization.
pg_stat_statementsmeasures pure database execution time and tracks every query, including those from ORMs and background jobs you might not be monitoring.Prefer EXISTS over COUNT for existence checks.
SELECT COUNT(*) FROM orders WHERE customer_id = $1scans all matching rows.SELECT EXISTS (SELECT 1 FROM orders WHERE customer_id = $1)stops at the first match.Do not over-index. Every index slows down writes and consumes disk space. If an index is not used by any query in
pg_stat_user_indexes(checkidx_scan = 0), drop it.Test query plans with production-like data volumes. A query that uses an index on 1,000 rows might switch to a Seq Scan on 1,000,000 rows. Always test EXPLAIN ANALYZE on a database with realistic data distribution and volume.
References
- PostgreSQL Documentation: Using EXPLAIN
- PostgreSQL Documentation: Planner Cost Constants
- PostgreSQL Documentation: pg_stat_statements
- PostgreSQL Documentation: Extended Statistics
- PostgreSQL Documentation: Index Types
- Depesz EXPLAIN Visualizer
- PEV2: PostgreSQL EXPLAIN Visualizer
- The Art of PostgreSQL by Dimitri Fontaine
