Postgresql

PostgreSQL Monitoring and Diagnostics

A comprehensive guide to monitoring PostgreSQL databases covering key metrics, built-in views, slow query analysis, alerting, and building monitoring dashboards with Node.js.

PostgreSQL Monitoring and Diagnostics

Overview

PostgreSQL ships with a powerful suite of built-in monitoring views and statistics collectors that most teams barely scratch the surface of. Understanding how to read these metrics, set up meaningful alerts, and build automated monitoring systems is the difference between catching a problem at 2 AM and explaining an outage at 9 AM. This guide covers the essential monitoring tools PostgreSQL provides out of the box, how to query them effectively, and how to build a production-grade monitoring service in Node.js that keeps you ahead of trouble.

Prerequisites

  • PostgreSQL 13 or later installed and running
  • Node.js 18+ with the pg driver installed
  • Basic knowledge of SQL and PostgreSQL administration
  • Access to a PostgreSQL instance with superuser or pg_monitor role privileges
  • Familiarity with Express.js for the dashboard example

Install the required Node.js packages:

npm install pg express

Key PostgreSQL Metrics to Monitor

Before diving into tooling, you need to understand which metrics actually matter. Not everything PostgreSQL exposes is worth alerting on. Here are the ones that will save you in production.

Active Connections

PostgreSQL has a hard connection limit defined by max_connections (default 100). When you hit it, new connections are refused outright. You want to track both the current count and its ratio against the maximum.

SELECT
    count(*) AS total_connections,
    count(*) FILTER (WHERE state = 'active') AS active_connections,
    count(*) FILTER (WHERE state = 'idle') AS idle_connections,
    count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_transaction,
    (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max_connections
FROM pg_stat_activity
WHERE backend_type = 'client backend';

A healthy system typically runs at 20-40% of max_connections. If you are consistently above 70%, it is time to either raise the limit, introduce connection pooling with PgBouncer, or investigate connection leaks.

Cache Hit Ratio

PostgreSQL uses shared buffers to cache frequently accessed data. The cache hit ratio tells you how often queries find data in memory versus going to disk. A ratio below 99% on a production OLTP workload is a red flag.

SELECT
    sum(heap_blks_read) AS heap_read,
    sum(heap_blks_hit) AS heap_hit,
    CASE
        WHEN sum(heap_blks_hit) + sum(heap_blks_read) = 0 THEN 0
        ELSE round(sum(heap_blks_hit)::numeric / (sum(heap_blks_hit) + sum(heap_blks_read)) * 100, 2)
    END AS cache_hit_ratio
FROM pg_statio_user_tables;

If your ratio is below 99%, increase shared_buffers or add more RAM. If the ratio is below 95%, something is fundamentally wrong with either your memory allocation or your query patterns are scanning enormous amounts of data.

Transaction Throughput

Tracking commits and rollbacks over time gives you a clear picture of database throughput and error rates.

SELECT
    xact_commit AS commits,
    xact_rollback AS rollbacks,
    CASE
        WHEN xact_commit + xact_rollback = 0 THEN 0
        ELSE round(xact_rollback::numeric / (xact_commit + xact_rollback) * 100, 2)
    END AS rollback_ratio,
    blks_read,
    blks_hit,
    tup_returned,
    tup_fetched,
    tup_inserted,
    tup_updated,
    tup_deleted
FROM pg_stat_database
WHERE datname = current_database();

A rollback ratio above 1-2% deserves investigation. High rollback rates often indicate application-level retry storms or deadlock contention.

Replication Lag

If you are running streaming replication, lag is arguably the most critical metric. A replica falling behind means your failover target is stale.

-- On the primary
SELECT
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS replay_lag_pretty
FROM pg_stat_replication;

Anything above a few megabytes of lag under normal load warrants investigation. Sustained lag often points to the replica being under-provisioned or long-running queries on the replica blocking WAL replay.

Lock Contention

Locks are normal. Lock waits that last more than a few seconds are not. This query shows you exactly who is blocking whom.

SELECT
    blocked_locks.pid AS blocked_pid,
    blocked_activity.usename AS blocked_user,
    blocking_locks.pid AS blocking_pid,
    blocking_activity.usename AS blocking_user,
    blocked_activity.query AS blocked_query,
    blocking_activity.query AS blocking_query,
    blocked_activity.wait_event_type,
    now() - blocked_activity.query_start AS blocked_duration
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
    ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
    AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
    AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
    AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
    AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
    AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
    AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
    AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
    AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
    ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted
ORDER BY blocked_duration DESC;

Built-in Monitoring Views

PostgreSQL provides a rich set of pg_stat_* views. Here are the ones you should know inside and out.

pg_stat_activity

This is the first place you look when something feels wrong. It shows every active backend connection, what query it is running, and how long it has been running.

SELECT
    pid,
    usename,
    datname,
    client_addr,
    state,
    wait_event_type,
    wait_event,
    query,
    now() - query_start AS query_duration,
    now() - xact_start AS transaction_duration
FROM pg_stat_activity
WHERE state != 'idle'
    AND pid != pg_backend_pid()
ORDER BY query_start ASC;

The state field is your primary indicator. Watch for idle in transaction connections -- these hold locks and prevent vacuum from reclaiming dead tuples. Any connection sitting in idle in transaction for more than a few minutes is almost certainly a bug in the application code, usually a missing COMMIT or ROLLBACK in an error path.

pg_stat_user_tables

This view tracks sequential scans, index scans, live tuples, dead tuples, and when the last vacuum and analyze ran. It is essential for understanding table health.

SELECT
    schemaname,
    relname AS table_name,
    seq_scan,
    seq_tup_read,
    idx_scan,
    idx_tup_fetch,
    n_tup_ins AS inserts,
    n_tup_upd AS updates,
    n_tup_del AS deletes,
    n_live_tup AS live_tuples,
    n_dead_tup AS dead_tuples,
    round(n_dead_tup::numeric / GREATEST(n_live_tup, 1) * 100, 2) AS dead_tuple_ratio,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

Tables with a dead tuple ratio above 10% are candidates for a manual VACUUM ANALYZE. Tables with high seq_scan counts and low idx_scan counts are candidates for new indexes.

pg_stat_bgwriter

The background writer and checkpointer statistics reveal I/O pressure. High checkpoint counts or buffers written by backends (instead of the background writer) indicate I/O bottlenecks.

SELECT
    checkpoints_timed,
    checkpoints_req,
    checkpoint_write_time,
    checkpoint_sync_time,
    buffers_checkpoint,
    buffers_clean,
    maxwritten_clean,
    buffers_backend,
    buffers_backend_fsync,
    buffers_alloc
FROM pg_stat_bgwriter;

If buffers_backend is significantly higher than buffers_clean and buffers_checkpoint combined, your background writer cannot keep up. Increase bgwriter_lru_maxpages and bgwriter_lru_multiplier. If checkpoints_req is much higher than checkpoints_timed, increase checkpoint_timeout or max_wal_size.

Slow Query Identification with pg_stat_statements

The pg_stat_statements extension is the single most valuable monitoring tool in PostgreSQL. It aggregates query execution statistics across all calls, normalized by query pattern.

First, enable it:

-- In postgresql.conf
-- shared_preload_libraries = 'pg_stat_statements'
-- pg_stat_statements.track = all

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Then query the top offenders:

-- Top 20 queries by total execution time
SELECT
    queryid,
    calls,
    round(total_exec_time::numeric, 2) AS total_time_ms,
    round(mean_exec_time::numeric, 2) AS mean_time_ms,
    round(min_exec_time::numeric, 2) AS min_time_ms,
    round(max_exec_time::numeric, 2) AS max_time_ms,
    round(stddev_exec_time::numeric, 2) AS stddev_ms,
    rows,
    round((shared_blks_hit::numeric / GREATEST(shared_blks_hit + shared_blks_read, 1)) * 100, 2) AS hit_ratio,
    left(query, 200) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Reset the statistics periodically (weekly is reasonable for most workloads) so the data stays relevant:

SELECT pg_stat_statements_reset();

The queries that consume the most total time are your optimization targets. A query that runs in 5ms but gets called 10 million times per day deserves more attention than a query that runs once a day for 30 seconds.

Long-Running Query Detection and Termination

Detecting long-running queries is straightforward. Terminating them safely requires knowing the difference between pg_cancel_backend and pg_terminate_backend.

-- Find queries running longer than 5 minutes
SELECT
    pid,
    usename,
    datname,
    state,
    now() - query_start AS duration,
    left(query, 200) AS query
FROM pg_stat_activity
WHERE state = 'active'
    AND now() - query_start > interval '5 minutes'
    AND pid != pg_backend_pid();

pg_cancel_backend(pid) sends a cancel signal, which aborts the current query but leaves the connection open. Always try this first. pg_terminate_backend(pid) kills the entire connection. Use it when cancel does not work.

-- Cancel a long-running query (graceful)
SELECT pg_cancel_backend(12345);

-- Terminate a connection (forceful)
SELECT pg_terminate_backend(12345);

You can also set a statement timeout globally or per-session to automatically kill runaway queries:

-- Set a 30-second query timeout for the current session
SET statement_timeout = '30s';

-- Set a default timeout for a specific role
ALTER ROLE app_user SET statement_timeout = '60s';

Vacuum and Autovacuum Monitoring

Autovacuum is critical for PostgreSQL performance. Dead tuples accumulate from updates and deletes, causing table bloat and degraded query performance. If autovacuum is not keeping up, your database will slowly grind to a halt.

-- Check autovacuum activity
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    last_autovacuum,
    last_autoanalyze,
    autovacuum_count,
    autoanalyze_count
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;

-- Check if autovacuum is currently running
SELECT
    pid,
    datname,
    relid::regclass AS table_name,
    phase,
    heap_blks_total,
    heap_blks_scanned,
    heap_blks_vacuumed,
    index_vacuum_count,
    max_dead_tuples,
    num_dead_tuples
FROM pg_stat_progress_vacuum;

If you have high-churn tables that autovacuum cannot keep up with, tune the settings per-table:

ALTER TABLE high_churn_table SET (
    autovacuum_vacuum_scale_factor = 0.01,
    autovacuum_analyze_scale_factor = 0.005,
    autovacuum_vacuum_cost_delay = 2
);

Disk Usage and Table Bloat Analysis

Understanding disk usage at the table and index level helps you plan capacity and identify bloat.

-- Table and index sizes
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS total_size,
    pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size,
    pg_size_pretty(
        pg_total_relation_size(schemaname || '.' || tablename) -
        pg_relation_size(schemaname || '.' || tablename)
    ) AS index_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC;

For bloat estimation, this query compares the actual table size against what it should be based on live tuple counts:

SELECT
    current_database(),
    schemaname,
    tablename,
    pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) AS table_size,
    n_live_tup AS live_tuples,
    n_dead_tup AS dead_tuples,
    round(
        CASE WHEN n_live_tup > 0
            THEN n_dead_tup::numeric / n_live_tup * 100
            ELSE 0
        END, 2
    ) AS bloat_pct
FROM pg_stat_user_tables
    JOIN pg_tables USING (schemaname, tablename)
WHERE schemaname = 'public'
ORDER BY n_dead_tup DESC;

PostgreSQL Logging Configuration

The right logging settings are essential for post-incident analysis. Add these to postgresql.conf:

# Log slow queries (anything over 500ms)
log_min_duration_statement = 500

# Log all DDL statements
log_statement = 'ddl'

# Log connection and disconnection events
log_connections = on
log_disconnections = on

# Log lock waits
log_lock_waits = on
deadlock_timeout = 1s

# Log checkpoints
log_checkpoints = on

# Log temporary file usage (queries spilling to disk)
log_temp_files = 0

# Structured output for log aggregation
log_line_prefix = '%m [%p] %q%u@%d '
logging_collector = on
log_directory = 'pg_log'
log_filename = 'postgresql-%Y-%m-%d.log'
log_rotation_age = 1d

Setting log_min_duration_statement to 500ms will capture slow queries without generating excessive log volume. Adjust based on your workload. For high-throughput OLTP systems, 1000ms might be more appropriate. For analytical workloads, 5000ms or higher.

Connection Pool Monitoring

If you are using PgBouncer (and you should be for any production Node.js application), monitor its internal statistics:

-- Connect to PgBouncer admin console
-- psql -p 6432 -U pgbouncer pgbouncer

SHOW POOLS;
SHOW STATS;
SHOW CLIENTS;
SHOW SERVERS;

From Node.js, you can query PgBouncer stats programmatically:

var pg = require('pg');

var bouncerPool = new pg.Pool({
    host: 'localhost',
    port: 6432,
    database: 'pgbouncer',
    user: 'pgbouncer'
});

function getPgBouncerStats(callback) {
    bouncerPool.query('SHOW POOLS', function(err, result) {
        if (err) {
            return callback(err);
        }
        var stats = result.rows.map(function(row) {
            return {
                database: row.database,
                user: row.user,
                activeConnections: parseInt(row.cl_active, 10),
                waitingConnections: parseInt(row.cl_waiting, 10),
                serverActive: parseInt(row.sv_active, 10),
                serverIdle: parseInt(row.sv_idle, 10),
                poolMode: row.pool_mode
            };
        });
        callback(null, stats);
    });
}

Integrating with Prometheus and Grafana

For production systems, you want metrics flowing into a time-series database. The postgres_exporter is the standard Prometheus exporter for PostgreSQL.

# Install postgres_exporter
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.15.0/postgres_exporter-0.15.0.linux-amd64.tar.gz
tar xvf postgres_exporter-0.15.0.linux-amd64.tar.gz
cd postgres_exporter-0.15.0.linux-amd64

# Set the connection string and run
export DATA_SOURCE_NAME="postgresql://monitor_user:password@localhost:5432/mydb?sslmode=disable"
./postgres_exporter

Add the scrape target to prometheus.yml:

scrape_configs:
  - job_name: 'postgresql'
    static_configs:
      - targets: ['localhost:9187']
    scrape_interval: 15s

You can also push custom metrics from your Node.js monitoring service using the prom-client library:

var client = require('prom-client');

var cacheHitRatio = new client.Gauge({
    name: 'pg_cache_hit_ratio',
    help: 'PostgreSQL buffer cache hit ratio'
});

var activeConnections = new client.Gauge({
    name: 'pg_active_connections',
    help: 'Number of active PostgreSQL connections'
});

var deadTuples = new client.Gauge({
    name: 'pg_dead_tuples_total',
    help: 'Total dead tuples across all tables',
    labelNames: ['table']
});

Complete Working Example: Node.js Monitoring Service

Here is a production-ready monitoring service that collects PostgreSQL metrics, stores historical snapshots, exposes a health check endpoint, and fires alerts when thresholds are breached.

var pg = require('pg');
var express = require('express');
var http = require('http');

var app = express();
var port = process.env.MONITOR_PORT || 3100;

// PostgreSQL connection pool for monitoring queries
var pool = new pg.Pool({
    connectionString: process.env.DATABASE_URL || 'postgresql://localhost:5432/mydb',
    max: 3, // Keep the monitoring pool small
    idleTimeoutMillis: 30000
});

// In-memory metric history (use Redis or a time-series DB in production)
var metricHistory = [];
var MAX_HISTORY = 1440; // 24 hours at 1-minute intervals

// Alert thresholds
var THRESHOLDS = {
    connectionUsagePercent: 80,
    cacheHitRatio: 99.0,
    longQuerySeconds: 300,
    deadTupleRatio: 20,
    rollbackRatio: 5,
    idleInTransactionSeconds: 60
};

// Collect active connection metrics
function collectConnectionMetrics(callback) {
    var query = [
        'SELECT',
        '    count(*) AS total,',
        '    count(*) FILTER (WHERE state = \'active\') AS active,',
        '    count(*) FILTER (WHERE state = \'idle\') AS idle,',
        '    count(*) FILTER (WHERE state = \'idle in transaction\') AS idle_in_tx,',
        '    (SELECT setting::int FROM pg_settings WHERE name = \'max_connections\') AS max_conn',
        'FROM pg_stat_activity',
        'WHERE backend_type = \'client backend\''
    ].join('\n');

    pool.query(query, function(err, result) {
        if (err) return callback(err);
        var row = result.rows[0];
        var metrics = {
            total: parseInt(row.total, 10),
            active: parseInt(row.active, 10),
            idle: parseInt(row.idle, 10),
            idleInTransaction: parseInt(row.idle_in_tx, 10),
            maxConnections: parseInt(row.max_conn, 10),
            usagePercent: parseFloat((row.total / row.max_conn * 100).toFixed(2))
        };
        callback(null, metrics);
    });
}

// Collect cache hit ratio
function collectCacheMetrics(callback) {
    var query = [
        'SELECT',
        '    sum(heap_blks_read) AS heap_read,',
        '    sum(heap_blks_hit) AS heap_hit,',
        '    CASE',
        '        WHEN sum(heap_blks_hit) + sum(heap_blks_read) = 0 THEN 100',
        '        ELSE round(sum(heap_blks_hit)::numeric / (sum(heap_blks_hit) + sum(heap_blks_read)) * 100, 2)',
        '    END AS ratio',
        'FROM pg_statio_user_tables'
    ].join('\n');

    pool.query(query, function(err, result) {
        if (err) return callback(err);
        var row = result.rows[0];
        callback(null, {
            heapRead: parseInt(row.heap_read, 10) || 0,
            heapHit: parseInt(row.heap_hit, 10) || 0,
            hitRatio: parseFloat(row.ratio)
        });
    });
}

// Collect slow / long-running queries
function collectSlowQueries(callback) {
    var query = [
        'SELECT',
        '    pid,',
        '    usename,',
        '    datname,',
        '    state,',
        '    EXTRACT(EPOCH FROM (now() - query_start))::int AS duration_seconds,',
        '    left(query, 500) AS query',
        'FROM pg_stat_activity',
        'WHERE state = \'active\'',
        '    AND now() - query_start > interval \'5 seconds\'',
        '    AND pid != pg_backend_pid()',
        'ORDER BY query_start ASC'
    ].join('\n');

    pool.query(query, function(err, result) {
        if (err) return callback(err);
        callback(null, result.rows);
    });
}

// Collect table sizes and dead tuple counts
function collectTableMetrics(callback) {
    var query = [
        'SELECT',
        '    schemaname,',
        '    relname AS table_name,',
        '    pg_size_pretty(pg_total_relation_size(schemaname || \'.\' || relname)) AS total_size,',
        '    pg_total_relation_size(schemaname || \'.\' || relname) AS total_size_bytes,',
        '    n_live_tup AS live_tuples,',
        '    n_dead_tup AS dead_tuples,',
        '    CASE WHEN n_live_tup > 0',
        '        THEN round(n_dead_tup::numeric / n_live_tup * 100, 2)',
        '        ELSE 0',
        '    END AS dead_tuple_ratio,',
        '    last_autovacuum,',
        '    seq_scan,',
        '    idx_scan',
        'FROM pg_stat_user_tables',
        'ORDER BY pg_total_relation_size(schemaname || \'.\' || relname) DESC',
        'LIMIT 50'
    ].join('\n');

    pool.query(query, function(err, result) {
        if (err) return callback(err);
        callback(null, result.rows);
    });
}

// Collect lock waits
function collectLockMetrics(callback) {
    var query = [
        'SELECT',
        '    blocked.pid AS blocked_pid,',
        '    blocked_activity.usename AS blocked_user,',
        '    blocking.pid AS blocking_pid,',
        '    blocking_activity.usename AS blocking_user,',
        '    EXTRACT(EPOCH FROM (now() - blocked_activity.query_start))::int AS wait_seconds,',
        '    left(blocked_activity.query, 200) AS blocked_query',
        'FROM pg_locks blocked',
        'JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked.pid',
        'JOIN pg_locks blocking',
        '    ON blocking.locktype = blocked.locktype',
        '    AND blocking.relation IS NOT DISTINCT FROM blocked.relation',
        '    AND blocking.pid != blocked.pid',
        'JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking.pid',
        'WHERE NOT blocked.granted',
        'ORDER BY blocked_activity.query_start ASC'
    ].join('\n');

    pool.query(query, function(err, result) {
        if (err) return callback(err);
        callback(null, result.rows);
    });
}

// Evaluate alerts based on current metrics
function evaluateAlerts(snapshot) {
    var alerts = [];

    if (snapshot.connections.usagePercent > THRESHOLDS.connectionUsagePercent) {
        alerts.push({
            level: 'critical',
            metric: 'connections',
            message: 'Connection usage at ' + snapshot.connections.usagePercent +
                '% (' + snapshot.connections.total + '/' + snapshot.connections.maxConnections + ')'
        });
    }

    if (snapshot.cache.hitRatio < THRESHOLDS.cacheHitRatio) {
        alerts.push({
            level: 'warning',
            metric: 'cache_hit_ratio',
            message: 'Cache hit ratio at ' + snapshot.cache.hitRatio + '% (threshold: ' +
                THRESHOLDS.cacheHitRatio + '%)'
        });
    }

    snapshot.slowQueries.forEach(function(q) {
        if (q.duration_seconds > THRESHOLDS.longQuerySeconds) {
            alerts.push({
                level: 'critical',
                metric: 'long_query',
                message: 'Query on PID ' + q.pid + ' running for ' +
                    q.duration_seconds + 's: ' + q.query.substring(0, 100)
            });
        }
    });

    snapshot.tables.forEach(function(t) {
        if (parseFloat(t.dead_tuple_ratio) > THRESHOLDS.deadTupleRatio) {
            alerts.push({
                level: 'warning',
                metric: 'dead_tuples',
                message: 'Table ' + t.table_name + ' has ' + t.dead_tuple_ratio +
                    '% dead tuples (' + t.dead_tuples + ' dead / ' + t.live_tuples + ' live)'
            });
        }
    });

    if (snapshot.connections.idleInTransaction > 0) {
        alerts.push({
            level: 'warning',
            metric: 'idle_in_transaction',
            message: snapshot.connections.idleInTransaction +
                ' connections stuck in idle-in-transaction state'
        });
    }

    if (snapshot.locks.length > 0) {
        alerts.push({
            level: 'warning',
            metric: 'lock_waits',
            message: snapshot.locks.length + ' queries currently waiting on locks'
        });
    }

    return alerts;
}

// Collect all metrics in a single snapshot
function collectSnapshot(callback) {
    var snapshot = { timestamp: new Date().toISOString() };
    var pending = 5;
    var errors = [];

    function done() {
        pending--;
        if (pending === 0) {
            if (errors.length > 0) {
                return callback(errors[0]);
            }
            snapshot.alerts = evaluateAlerts(snapshot);
            snapshot.status = snapshot.alerts.some(function(a) {
                return a.level === 'critical';
            }) ? 'critical' : (snapshot.alerts.length > 0 ? 'warning' : 'healthy');

            // Store in history
            metricHistory.push(snapshot);
            if (metricHistory.length > MAX_HISTORY) {
                metricHistory.shift();
            }

            callback(null, snapshot);
        }
    }

    collectConnectionMetrics(function(err, data) {
        if (err) errors.push(err);
        else snapshot.connections = data;
        done();
    });

    collectCacheMetrics(function(err, data) {
        if (err) errors.push(err);
        else snapshot.cache = data;
        done();
    });

    collectSlowQueries(function(err, data) {
        if (err) errors.push(err);
        else snapshot.slowQueries = data;
        done();
    });

    collectTableMetrics(function(err, data) {
        if (err) errors.push(err);
        else snapshot.tables = data;
        done();
    });

    collectLockMetrics(function(err, data) {
        if (err) errors.push(err);
        else snapshot.locks = data;
        done();
    });
}

// HTTP endpoints
app.get('/health', function(req, res) {
    collectSnapshot(function(err, snapshot) {
        if (err) {
            return res.status(500).json({
                status: 'error',
                message: err.message
            });
        }
        var statusCode = snapshot.status === 'critical' ? 503 : 200;
        res.status(statusCode).json(snapshot);
    });
});

app.get('/metrics', function(req, res) {
    collectSnapshot(function(err, snapshot) {
        if (err) {
            return res.status(500).json({ error: err.message });
        }
        res.json(snapshot);
    });
});

app.get('/metrics/history', function(req, res) {
    var limit = parseInt(req.query.limit, 10) || 60;
    var recent = metricHistory.slice(-limit);
    res.json({
        count: recent.length,
        metrics: recent
    });
});

app.get('/metrics/tables', function(req, res) {
    collectTableMetrics(function(err, data) {
        if (err) {
            return res.status(500).json({ error: err.message });
        }
        res.json({ tables: data });
    });
});

app.post('/queries/cancel/:pid', function(req, res) {
    var pid = parseInt(req.params.pid, 10);
    if (isNaN(pid)) {
        return res.status(400).json({ error: 'Invalid PID' });
    }
    pool.query('SELECT pg_cancel_backend($1)', [pid], function(err, result) {
        if (err) {
            return res.status(500).json({ error: err.message });
        }
        res.json({ cancelled: result.rows[0].pg_cancel_backend, pid: pid });
    });
});

// Periodic collection (every 60 seconds)
var collectionInterval = setInterval(function() {
    collectSnapshot(function(err, snapshot) {
        if (err) {
            console.error('[pg-monitor] Collection error:', err.message);
            return;
        }
        console.log(
            '[pg-monitor] %s | Status: %s | Connections: %d/%d | Cache: %s% | Alerts: %d',
            snapshot.timestamp,
            snapshot.status,
            snapshot.connections.total,
            snapshot.connections.maxConnections,
            snapshot.cache.hitRatio,
            snapshot.alerts.length
        );

        // Log critical alerts
        snapshot.alerts.forEach(function(alert) {
            if (alert.level === 'critical') {
                console.error('[pg-monitor] CRITICAL:', alert.message);
            }
        });
    });
}, 60000);

// Start the server
app.listen(port, function() {
    console.log('[pg-monitor] Monitoring service running on port ' + port);
    // Collect initial snapshot
    collectSnapshot(function(err, snapshot) {
        if (err) {
            console.error('[pg-monitor] Initial collection failed:', err.message);
        } else {
            console.log('[pg-monitor] Initial collection complete. Status:', snapshot.status);
        }
    });
});

// Graceful shutdown
process.on('SIGTERM', function() {
    console.log('[pg-monitor] Shutting down...');
    clearInterval(collectionInterval);
    pool.end(function() {
        process.exit(0);
    });
});

Example health check output:

{
  "timestamp": "2026-02-13T14:30:00.000Z",
  "connections": {
    "total": 45,
    "active": 12,
    "idle": 30,
    "idleInTransaction": 3,
    "maxConnections": 200,
    "usagePercent": 22.5
  },
  "cache": {
    "heapRead": 158234,
    "heapHit": 48293847,
    "hitRatio": 99.67
  },
  "slowQueries": [
    {
      "pid": 29384,
      "usename": "app_user",
      "datname": "production",
      "state": "active",
      "duration_seconds": 45,
      "query": "SELECT * FROM orders JOIN order_items ON ..."
    }
  ],
  "tables": [
    {
      "table_name": "events",
      "total_size": "2458 MB",
      "live_tuples": 15000000,
      "dead_tuples": 450000,
      "dead_tuple_ratio": "3.00",
      "last_autovacuum": "2026-02-13T13:15:00Z"
    }
  ],
  "locks": [],
  "alerts": [
    {
      "level": "warning",
      "metric": "idle_in_transaction",
      "message": "3 connections stuck in idle-in-transaction state"
    }
  ],
  "status": "warning"
}

Common Issues and Troubleshooting

1. Connection Exhaustion

FATAL: sorry, too many clients already

This means all connection slots are taken. Immediate actions: check for connection leaks in your application (missing pool.release() or client.release() calls), look for idle-in-transaction connections, and consider adding PgBouncer. Long-term, set idle_in_transaction_session_timeout in postgresql.conf to automatically kill stale transactions:

ALTER SYSTEM SET idle_in_transaction_session_timeout = '300s';
SELECT pg_reload_conf();

2. Autovacuum Not Keeping Up

WARNING: oldest xmin is far in the past
WARNING: database "mydb" must be vacuumed within 10000000 transactions

This is the transaction ID wraparound warning and it is serious. If PostgreSQL reaches the wraparound limit, it will shut down to prevent data corruption. Run an aggressive manual vacuum immediately:

vacuumdb --analyze --verbose --jobs=4 mydb

Then tune your autovacuum settings. The defaults are conservative. For high-write workloads, reduce autovacuum_vacuum_scale_factor from 0.2 to 0.01 and increase autovacuum_max_workers from 3 to 5 or 6.

3. Slow Queries Due to Missing Statistics

LOG: duration: 15234.567 ms  statement: SELECT * FROM users WHERE email = '[email protected]'

If a query on an indexed column suddenly becomes slow, check whether the table statistics are up to date:

-- Check when the last analyze ran
SELECT relname, last_analyze, last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'users';

-- Force a statistics update
ANALYZE users;

-- Check if the planner is using the index
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE email = '[email protected]';

Stale statistics cause the query planner to choose sequential scans over index scans. This frequently happens after bulk data loads.

4. Lock Waits Causing Application Timeouts

LOG: process 29384 still waiting for ShareLock on transaction 12345 after 30000.123 ms

Lock contention usually comes from long-running transactions holding row-level locks while other transactions try to modify the same rows. Common causes include schema migrations running ALTER TABLE with ACCESS EXCLUSIVE locks, or application code that opens a transaction, does external API calls, and then tries to commit.

Find and resolve the blocking query:

-- Find the blocking PID
SELECT pid, usename, state, query
FROM pg_stat_activity
WHERE pid IN (
    SELECT unnest(pg_blocking_pids(29384))
);

-- Cancel the blocking query
SELECT pg_cancel_backend(<blocking_pid>);

For DDL operations, use LOCK TIMEOUT to fail fast instead of waiting indefinitely:

SET lock_timeout = '5s';
ALTER TABLE users ADD COLUMN last_login TIMESTAMP;

5. Disk Space Exhaustion from WAL Accumulation

PANIC: could not write to file "pg_wal/xlogtemp.29384": No space left on device

WAL files accumulate when replication slots retain data for a disconnected replica or when wal_keep_size is set too high. Check for inactive replication slots:

SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
WHERE NOT active;

-- Drop an inactive slot to release WAL files
SELECT pg_drop_replication_slot('old_replica_slot');

Best Practices

  • Enable pg_stat_statements on every PostgreSQL instance. It has minimal overhead (typically under 1% CPU) and provides query-level visibility that is impossible to get any other way. There is no good reason to skip it.

  • Set log_min_duration_statement to capture slow queries. Start with 1000ms and lower it as you gain confidence. This single setting generates the most actionable data in your PostgreSQL logs.

  • Monitor connection counts as a percentage of max_connections, not as absolute numbers. An alert at 80% gives you time to react. An alert at 95% means you are already dropping connections.

  • Use PgBouncer for connection pooling in Node.js applications. Node.js applications tend to create many short-lived connections. The pg driver's built-in pool helps, but PgBouncer adds transaction-level pooling that dramatically reduces server-side connection overhead.

  • Set idle_in_transaction_session_timeout to a reasonable value like 5 minutes. Connections left in the idle-in-transaction state hold locks and block autovacuum. Automatically terminating them prevents cascading problems.

  • Track dead tuple ratios per table, not just globally. A single table with runaway bloat can drag down your entire database. Set per-table autovacuum thresholds for high-churn tables.

  • Run ANALYZE after bulk data loads. The query planner relies on table statistics. After inserting or updating more than 10% of a table's rows, the existing statistics are stale and queries may suddenly choose poor execution plans.

  • Keep your monitoring queries lightweight. The monitoring system should not become the performance problem. Use small connection pools (2-3 connections), avoid expensive joins in monitoring queries, and run collection at reasonable intervals (60 seconds is fine for most metrics).

  • Store metric history in a time-series database, not in PostgreSQL itself. Monitoring data that lives in the database you are monitoring creates a feedback loop. Use Prometheus, InfluxDB, or even a separate PostgreSQL instance dedicated to metrics.

  • Test your alerting before you need it. Deliberately set a threshold low enough to trigger, verify the alert fires, then set it back. An alerting system that has never fired is an alerting system you cannot trust.

References

Powered by Contentful