Postgresql

PostgreSQL Replication and High Availability

An advanced guide to PostgreSQL replication and high availability, covering streaming replication, read/write splitting in Node.js, automated failover, WAL archiving, and managed service options.

PostgreSQL Replication and High Availability

Overview

PostgreSQL replication is the mechanism that copies data from one database server (the primary) to one or more other servers (standbys or replicas), enabling read scaling, disaster recovery, and zero-downtime maintenance. If your application cannot tolerate minutes of downtime during a database failure or during schema migrations, replication is not optional -- it is foundational infrastructure. This guide covers streaming replication setup, read/write splitting in Node.js, automated failover with Patroni, WAL-based recovery, and practical monitoring for production systems.

Prerequisites

  • PostgreSQL 15+ installed (two instances for primary/replica testing)
  • Node.js 18+ with the pg driver
  • Docker and Docker Compose (for the local testing environment)
  • Basic familiarity with PostgreSQL configuration files (postgresql.conf, pg_hba.conf)
  • Understanding of TCP networking and SSH
npm install pg express

Why Replication Matters

A single PostgreSQL instance is a single point of failure. I have watched production databases go down at 2 AM because of a failed disk controller, and without replication, that means your application is offline until someone manually restores from a backup. Backups alone are not enough -- they give you recovery, but recovery takes time. Replication gives you a live copy of your data that is ready to serve traffic within seconds.

There are three primary reasons to replicate:

Read Scaling. Your primary handles all writes. Replicas handle read queries. For read-heavy workloads (which most web applications are), this can multiply your effective throughput by the number of replicas you maintain. I have seen applications go from saturating a single instance at 80% CPU to running comfortably at 30% per node after adding two read replicas.

Disaster Recovery. When your primary dies -- and eventually it will -- a standby server already has a near-complete copy of your data. Failover to a warm standby takes seconds, not hours. The difference between "we were down for 3 seconds" and "we were down for 4 hours while we restored from last night's backup" is often the difference between keeping and losing customers.

Zero-Downtime Maintenance. Major version upgrades, OS patches, hardware replacements -- all of these traditionally require downtime. With replication, you can promote a standby, patch the old primary, re-add it as a standby, and switch back. Your application never goes offline.


Replication Types

PostgreSQL supports several replication mechanisms, each suited to different use cases.

Physical (Streaming) Replication

This is the workhorse. The primary streams its Write-Ahead Log (WAL) to standbys in real time. The standby replays these WAL records to maintain an exact byte-for-byte copy of the primary. Every database, every table, every index -- the entire cluster is replicated.

Streaming replication is what you want for high availability. It is simple, reliable, and fast. The standby can serve read-only queries while replaying WAL.

Logical Replication

Introduced in PostgreSQL 10, logical replication works at the SQL statement level rather than the physical WAL level. You define publications (sets of tables) on the primary and subscriptions on the target. This allows selective replication -- you can replicate specific tables to specific targets, and the target can have its own tables, indexes, and even write operations.

Use logical replication when you need to replicate a subset of tables, replicate between different PostgreSQL major versions, or replicate to a database that also needs to accept writes.

Physical vs. Logical: When to Use Which

Feature Physical (Streaming) Logical
Granularity Entire cluster Per-table
Read-only standby Yes No (target is read-write)
Cross-version No Yes
DDL replication Yes No
Failover capable Yes Not directly
Setup complexity Low Medium

For high availability, use physical streaming replication. For data distribution, ETL pipelines, or cross-version migrations, use logical replication.


WAL (Write-Ahead Log) Fundamentals

Everything in PostgreSQL replication starts with the WAL. Before any data change is written to the actual data files (tables, indexes), it is first written to the WAL. This guarantees durability -- if the server crashes mid-write, PostgreSQL can replay the WAL on startup to recover to a consistent state.

WAL files are 16MB segments stored in pg_wal/ (formerly pg_xlog/). Each segment has a name like 000000010000000000000001. The primary generates WAL continuously as data changes occur.

Streaming replication works by shipping these WAL records from the primary to standbys in real time, over a TCP connection. The standby has a WAL receiver process that connects to the primary's WAL sender process. WAL records flow continuously, typically with sub-second latency on a local network.

-- See current WAL position on the primary
SELECT pg_current_wal_lsn();
--  pg_current_wal_lsn
-- --------------------
--  0/3A0028D0

-- See WAL position on a standby
SELECT pg_last_wal_replay_lsn();
--  pg_last_wal_replay_lsn
-- ------------------------
--  0/3A0028D0

The difference between the primary's current WAL LSN (Log Sequence Number) and the standby's replayed LSN is your replication lag.


Setting Up Streaming Replication

This is the step-by-step process I use for setting up a primary and one standby. I will use PostgreSQL 16 paths and syntax.

Primary Configuration

First, create a replication user on the primary:

CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'strong_password_here';

Edit postgresql.conf on the primary:

# Enable WAL shipping
wal_level = replica
max_wal_senders = 5
wal_keep_size = 1GB

# Optional: enable WAL archiving for PITR
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/wal_archive/%f'

# Listen on all interfaces (or specific IP)
listen_addresses = '*'

Edit pg_hba.conf to allow replication connections from the standby:

# TYPE  DATABASE        USER          ADDRESS           METHOD
host    replication     replicator    10.0.1.0/24       scram-sha-256

Restart the primary:

sudo systemctl restart postgresql

Standby Configuration

Stop PostgreSQL on the standby server, then wipe its data directory and create a base backup from the primary:

sudo systemctl stop postgresql
sudo rm -rf /var/lib/postgresql/16/main/*

sudo -u postgres pg_basebackup \
  -h 10.0.1.10 \
  -U replicator \
  -D /var/lib/postgresql/16/main \
  -Fp -Xs -P -R

The -R flag is critical -- it creates the standby.signal file and writes the primary connection info to postgresql.auto.conf. Without it, the standby will start as a standalone primary.

Verify the standby configuration was created:

cat /var/lib/postgresql/16/main/postgresql.auto.conf
# primary_conninfo = 'user=replicator password=strong_password_here host=10.0.1.10 port=5432'

Optionally tune standby-specific settings in postgresql.conf:

# Allow read queries on the standby
hot_standby = on

# Performance tuning for replay
max_standby_streaming_delay = 30s
hot_standby_feedback = on

Start the standby:

sudo systemctl start postgresql

Verify Replication

On the primary, check connected standbys:

SELECT client_addr, state, sent_lsn, replay_lsn,
       (sent_lsn - replay_lsn) AS byte_lag
FROM pg_stat_replication;

Expected output:

 client_addr  |   state   |  sent_lsn   | replay_lsn  | byte_lag
--------------+-----------+-------------+-------------+----------
 10.0.1.11    | streaming | 0/3A003F80  | 0/3A003F80  |        0

A byte_lag of 0 means the standby is fully caught up.


Synchronous vs. Asynchronous Replication

By default, streaming replication is asynchronous: the primary commits a transaction and immediately acknowledges it to the client, without waiting for the standby to confirm receipt. This is fast but introduces a window where data exists only on the primary.

Asynchronous (default):

  • Primary commits instantly
  • Standby may lag behind by milliseconds to seconds
  • If the primary fails, the most recent transactions may be lost
  • Negligible performance impact

Synchronous:

  • Primary waits for at least one standby to confirm WAL receipt before acknowledging the commit
  • Zero data loss guarantee
  • Adds network round-trip latency to every write operation
  • If the synchronous standby goes down, the primary stalls

To enable synchronous replication on the primary:

# postgresql.conf on primary
synchronous_commit = on
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'

Each standby identifies itself via its application_name in its primary_conninfo:

# postgresql.auto.conf on standby
primary_conninfo = 'user=replicator password=... host=10.0.1.10 application_name=standby1'

My recommendation: use asynchronous replication for most applications. The sub-second lag is acceptable for read replicas, and the performance cost of synchronous writes is significant. Reserve synchronous replication for financial systems or any workload where losing even one committed transaction is unacceptable.


Read Replicas for Load Distribution

Once you have streaming replication running, your standbys can serve read-only queries. This is the simplest and most impactful scaling strategy for read-heavy applications.

The architecture is straightforward:

  • All INSERT, UPDATE, DELETE, and DDL statements go to the primary
  • All SELECT statements go to one or more replicas
  • Your application layer routes queries accordingly

This is not automatic. PostgreSQL does not route queries for you. Your application must maintain separate connection pools and decide where to send each query.


Connection Routing in Node.js

Here is how I implement read/write splitting in Node.js using the pg module. The key is maintaining two connection pools -- one for the primary, one for the replica -- and exposing a data access layer that routes queries based on whether they mutate data.

var pg = require('pg');

var primaryPool = new pg.Pool({
  host: process.env.PG_PRIMARY_HOST || 'localhost',
  port: parseInt(process.env.PG_PRIMARY_PORT || '5432'),
  database: process.env.PG_DATABASE || 'myapp',
  user: process.env.PG_USER || 'appuser',
  password: process.env.PG_PASSWORD || 'password',
  max: 20,
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 5000
});

var replicaPool = new pg.Pool({
  host: process.env.PG_REPLICA_HOST || 'localhost',
  port: parseInt(process.env.PG_REPLICA_PORT || '5433'),
  database: process.env.PG_DATABASE || 'myapp',
  user: process.env.PG_USER || 'appuser',
  password: process.env.PG_PASSWORD || 'password',
  max: 30,  // More connections for reads
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 5000
});

// Error handling for both pools
primaryPool.on('error', function(err) {
  console.error('Primary pool error:', err.message);
});

replicaPool.on('error', function(err) {
  console.error('Replica pool error:', err.message);
});

/**
 * Execute a read query against the replica.
 * Falls back to primary if the replica is unavailable.
 */
function queryRead(text, params) {
  return replicaPool.query(text, params).catch(function(err) {
    console.warn('Replica query failed, falling back to primary:', err.message);
    return primaryPool.query(text, params);
  });
}

/**
 * Execute a write query against the primary.
 */
function queryWrite(text, params) {
  return primaryPool.query(text, params);
}

/**
 * Execute a transaction (always on the primary).
 */
function transaction(callback) {
  return primaryPool.connect().then(function(client) {
    return client.query('BEGIN')
      .then(function() {
        return callback(client);
      })
      .then(function(result) {
        return client.query('COMMIT').then(function() {
          client.release();
          return result;
        });
      })
      .catch(function(err) {
        return client.query('ROLLBACK').then(function() {
          client.release();
          throw err;
        });
      });
  });
}

module.exports = {
  queryRead: queryRead,
  queryWrite: queryWrite,
  transaction: transaction,
  primaryPool: primaryPool,
  replicaPool: replicaPool
};

Note the fallback in queryRead. If the replica is down, reads automatically route to the primary. This keeps your application running during replica maintenance.


Connection Pooling with PgBouncer

Direct connections from Node.js to PostgreSQL are fine for small applications. At scale, you want a connection pooler between your application and PostgreSQL. PgBouncer is the standard choice -- it is lightweight, battle-tested, and used by virtually every large PostgreSQL deployment.

A typical production architecture:

App Server 1 ─┐
App Server 2 ──┼──> PgBouncer (primary) ──> PostgreSQL Primary
App Server 3 ─┘

App Server 1 ─┐
App Server 2 ──┼──> PgBouncer (replica) ──> PostgreSQL Replica
App Server 3 ─┘

PgBouncer configuration (pgbouncer.ini):

[databases]
myapp_primary = host=10.0.1.10 port=5432 dbname=myapp
myapp_replica = host=10.0.1.11 port=5432 dbname=myapp

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
min_pool_size = 5
reserve_pool_size = 5
server_lifetime = 3600
server_idle_timeout = 600
log_connections = 0
log_disconnections = 0

Use pool_mode = transaction -- this means a server connection is assigned to a client only for the duration of a transaction, then returned to the pool. This is dramatically more efficient than session mode for web applications where most connections are idle between requests.

pgpool-II is an alternative that offers query-based routing (automatically sending SELECT to replicas and writes to the primary). However, it is more complex to configure, heavier at runtime, and introduces its own failure modes. I prefer handling routing in the application layer where I have full control.


Logical Replication for Selective Table Replication

When you need to replicate only specific tables -- for example, shipping product catalog data to an analytics database, or replicating user accounts across regions -- logical replication is the right tool.

On the primary (publisher):

-- Create a publication for specific tables
CREATE PUBLICATION product_pub FOR TABLE products, categories, inventory;

-- Or replicate all tables
CREATE PUBLICATION all_tables_pub FOR ALL TABLES;

On the target (subscriber):

-- Create the target tables first (schema is NOT replicated)
-- Then create a subscription
CREATE SUBSCRIPTION product_sub
  CONNECTION 'host=10.0.1.10 dbname=myapp user=replicator password=...'
  PUBLICATION product_pub;

Key differences from streaming replication:

  • The subscriber is a fully read-write database
  • You must create the table schemas on the subscriber manually
  • DDL changes (ALTER TABLE, CREATE INDEX) are not replicated
  • Each table must have a primary key or REPLICA IDENTITY set

Check subscription status:

SELECT subname, received_lsn, latest_end_lsn,
       latest_end_lsn - received_lsn AS lag_bytes
FROM pg_stat_subscription;

Monitoring Replication Lag

Replication lag is the single most important metric for a replicated PostgreSQL deployment. If your replica is 30 seconds behind and you fail over to it, you lose 30 seconds of committed data.

From the Primary

SELECT
  application_name,
  client_addr,
  state,
  sync_state,
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn,
  pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
  write_lag,
  flush_lag,
  replay_lag
FROM pg_stat_replication;

Realistic output from a healthy system:

 application_name | client_addr |   state   | sync_state | sent_lsn    | write_lsn   | flush_lsn   | replay_lsn  | replay_lag_bytes | write_lag | flush_lag | replay_lag
------------------+-------------+-----------+------------+-------------+-------------+-------------+-------------+------------------+-----------+-----------+------------
 standby1         | 10.0.1.11   | streaming | async      | 0/4E002A38  | 0/4E002A38  | 0/4E002A38  | 0/4E002A38  |                0 | 00:00:00  | 00:00:00  | 00:00:00
 standby2         | 10.0.1.12   | streaming | async      | 0/4E002A38  | 0/4E002A38  | 0/4E002A38  | 0/4E001B60  |             3800 | 00:00:00  | 00:00:00  | 00:00:00.003

The replay_lag column shows the time lag directly. Anything under 1 second is healthy for most applications.

From the Replica

SELECT
  now() - pg_last_xact_replay_timestamp() AS replication_delay,
  pg_last_wal_receive_lsn(),
  pg_last_wal_replay_lsn(),
  pg_wal_lsn_diff(
    pg_last_wal_receive_lsn(),
    pg_last_wal_replay_lsn()
  ) AS replay_lag_bytes;

Alerting on Lag

Set up monitoring that alerts when replication lag exceeds your threshold. Here is a simple Node.js health check:

var db = require('./db');

function checkReplicationLag(thresholdMs) {
  var query = "SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) * 1000 AS lag_ms";

  return db.replicaPool.query(query).then(function(result) {
    var lagMs = parseFloat(result.rows[0].lag_ms);

    if (lagMs > thresholdMs) {
      console.error('ALERT: Replication lag is ' + lagMs + 'ms (threshold: ' + thresholdMs + 'ms)');
      return { healthy: false, lagMs: lagMs };
    }

    return { healthy: true, lagMs: lagMs };
  }).catch(function(err) {
    console.error('ALERT: Cannot reach replica:', err.message);
    return { healthy: false, lagMs: -1, error: err.message };
  });
}

// Check every 10 seconds, alert if lag exceeds 5 seconds
setInterval(function() {
  checkReplicationLag(5000);
}, 10000);

Automated Failover with Patroni

Manual failover is fine for planned maintenance. For unplanned failures at 2 AM, you need automated failover. Patroni is the best tool for this job. It is a Python-based HA manager that uses a distributed consensus store (etcd, Consul, or ZooKeeper) to manage cluster state and perform automatic failover.

How Patroni Works

  1. Patroni runs as a daemon on each PostgreSQL node
  2. Each node registers itself with the consensus store (etcd)
  3. One node holds the leader lock -- that node is the primary
  4. Patroni continuously monitors the primary's health
  5. If the primary becomes unreachable, the remaining nodes perform a leader election
  6. The winning standby is promoted to primary
  7. Other standbys reconfigure to follow the new primary

Patroni Configuration

scope: myapp-cluster
name: node1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.1.10:8008

etcd3:
  hosts: 10.0.1.100:2379,10.0.1.101:2379,10.0.1.102:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: replica
        max_wal_senders: 5
        max_replication_slots: 5
        hot_standby: on
        wal_keep_size: 1GB
  initdb:
    - encoding: UTF8
    - data-checksums

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.1.10:5432
  data_dir: /var/lib/postgresql/16/main
  authentication:
    superuser:
      username: postgres
      password: postgres_password
    replication:
      username: replicator
      password: replicator_password
  parameters:
    shared_buffers: 2GB
    effective_cache_size: 6GB
    work_mem: 64MB

Connecting Through Patroni

With Patroni, you do not connect directly to a specific PostgreSQL node. Instead, you put a load balancer (HAProxy) or a DNS-based service discovery layer in front of the cluster. HAProxy is the most common choice:

frontend postgresql_primary
    bind *:5000
    default_backend primary_backend

backend primary_backend
    option httpchk GET /primary
    http-check expect status 200
    server node1 10.0.1.10:5432 check port 8008
    server node2 10.0.1.11:5432 check port 8008
    server node3 10.0.1.12:5432 check port 8008

frontend postgresql_replicas
    bind *:5001
    default_backend replica_backend

backend replica_backend
    option httpchk GET /replica
    http-check expect status 200
    balance roundrobin
    server node1 10.0.1.10:5432 check port 8008
    server node2 10.0.1.11:5432 check port 8008
    server node3 10.0.1.12:5432 check port 8008

HAProxy health-checks each node's Patroni REST API. Only the current primary returns 200 for /primary. All standbys return 200 for /replica. Failover is transparent to your application.

repmgr as an Alternative

repmgr is an older, simpler tool for PostgreSQL replication management and failover. It works without a consensus store (no etcd needed) and uses a witness node pattern instead. For smaller deployments where running etcd feels like overkill, repmgr is a solid choice. However, Patroni has become the de facto standard for production HA, and its integration with Kubernetes (via the Zalando PostgreSQL operator) makes it the better long-term investment.


Point-in-Time Recovery (PITR) with WAL Archiving

Replication protects you from server failure, but it does not protect you from bad SQL. If someone runs DELETE FROM orders WHERE 1=1; at 3:15 PM, that statement replicates to all standbys instantly. All your replicas now have the same missing data.

PITR lets you restore your database to any specific moment in time by combining a base backup with archived WAL files.

Enable WAL Archiving

On the primary:

# postgresql.conf
archive_mode = on
archive_command = 'test ! -f /var/lib/postgresql/wal_archive/%f && cp %p /var/lib/postgresql/wal_archive/%f'
archive_timeout = 60  # Force archive every 60 seconds even if WAL not full

For production, archive to object storage (S3, Spaces) instead of local disk. Tools like pgBackRest or WAL-G handle this with compression and encryption:

# WAL-G configuration for DigitalOcean Spaces
export WALG_S3_PREFIX=s3://my-bucket/wal-archive
export AWS_ENDPOINT=https://nyc3.digitaloceanspaces.com
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret

# Take a base backup
wal-g backup-push /var/lib/postgresql/16/main

# List backups
wal-g backup-list

Performing PITR

# Stop PostgreSQL
sudo systemctl stop postgresql

# Restore base backup
wal-g backup-fetch /var/lib/postgresql/16/main LATEST

# Create recovery configuration
cat > /var/lib/postgresql/16/main/postgresql.auto.conf << 'EOF'
restore_command = 'wal-g wal-fetch %f %p'
recovery_target_time = '2026-02-07 15:14:00 UTC'
recovery_target_action = 'promote'
EOF

# Create recovery signal
touch /var/lib/postgresql/16/main/recovery.signal

# Start PostgreSQL -- it will replay WAL up to the target time
sudo systemctl start postgresql

The database replays all WAL records up to 15:14:00, one minute before the accidental DELETE. Your data is restored.


High Availability on Managed Services

If you are running on a managed platform, replication and failover are handled for you -- to a degree.

DigitalOcean Managed Databases

DigitalOcean provides managed PostgreSQL with built-in HA:

  • Standby nodes: Add standby nodes that automatically fail over if the primary goes down. Failover takes 15-30 seconds.
  • Read replicas: Add read-only replicas with separate connection strings. You still implement read/write splitting in your application.
  • Automatic backups: Daily backups with 7-day retention. PITR is available on production plans.
  • Connection pooling: Built-in PgBouncer with configurable pool modes.

Your .do/app.yaml for an App Platform deployment with separate connection strings:

envs:
  - key: PG_PRIMARY_HOST
    value: ${db-primary.HOSTNAME}
  - key: PG_PRIMARY_PORT
    value: ${db-primary.PORT}
  - key: PG_REPLICA_HOST
    value: ${db-replica.HOSTNAME}
  - key: PG_REPLICA_PORT
    value: ${db-replica.PORT}

AWS RDS

RDS Multi-AZ provides synchronous replication to a standby in a different availability zone. Failover is automatic and takes 60-120 seconds. Read replicas are asynchronous and can be in different regions.

Azure Database for PostgreSQL

Azure's Flexible Server offers zone-redundant HA with automatic failover. Read replicas support up to 5 replicas with configurable regions.

The critical point with all managed services: you still own the application-level routing. The managed service gives you separate endpoints for primary and replica. Your application code must direct writes to the primary endpoint and reads to the replica endpoint. The db.js module I showed earlier works identically regardless of whether you are running self-managed or managed PostgreSQL.


Testing Your Failover Strategy

An untested failover plan is not a plan. Here is what I test quarterly:

  1. Simulate primary failure: Stop the primary PostgreSQL process (not a graceful shutdown -- kill -9 it). Verify that Patroni or your managed service promotes the standby within your SLA window.

  2. Verify data integrity: After failover, run checksums against a known dataset. Ensure no transactions were lost (or that the loss is within your acceptable window for async replication).

  3. Test application behavior: Does your application reconnect automatically? Does it correctly detect the new primary? Connection pool libraries usually handle reconnection, but verify it.

  4. Measure failover duration: Time the entire process from primary failure to the application successfully writing to the new primary.

  5. Test failback: After promoting the standby, can you reintroduce the old primary as a new standby? With pg_rewind, this should be possible without a full base backup.

# Test: kill the primary PostgreSQL process
sudo kill -9 $(head -1 /var/lib/postgresql/16/main/postmaster.pid)

# Watch Patroni logs on the standby
journalctl -u patroni -f
# Expected: "promoted self to leader by acquiring session lock"

# Verify from the application
curl http://localhost:3000/health
# Expected: {"status":"ok","primary":"connected","replica":"connected"}

Backup Strategies That Complement Replication

Replication is not a backup. Say it again. Replication is not a backup. If you drop a table on the primary, that DROP replicates to every standby. Replication protects against hardware failure. Backups protect against human error.

A complete strategy includes:

  • Continuous WAL archiving to object storage (S3, Spaces)
  • Daily base backups via pgBackRest or WAL-G
  • Retention policy: Keep 7 daily, 4 weekly, and 12 monthly backups
  • Regular restore testing: Restore a backup to a test server at least monthly
  • Cross-region copies: Store backups in a different region than your primary
# pgBackRest daily backup cron
0 2 * * * pgbackrest --stanza=myapp --type=diff backup

# Weekly full backup
0 3 * * 0 pgbackrest --stanza=myapp --type=full backup

# Verify backup integrity
pgbackrest --stanza=myapp check

Complete Working Example

Here is a complete Node.js Express application with a read/write splitting data access layer, plus a Docker Compose file for local testing with a primary and replica.

Docker Compose: Primary + Replica

version: '3.8'

services:
  pg-primary:
    image: postgres:16
    container_name: pg-primary
    environment:
      POSTGRES_USER: appuser
      POSTGRES_PASSWORD: apppassword
      POSTGRES_DB: myapp
    ports:
      - "5432:5432"
    volumes:
      - ./docker/primary/init.sh:/docker-entrypoint-initdb.d/init.sh
      - primary_data:/var/lib/postgresql/data
    command: >
      postgres
        -c wal_level=replica
        -c max_wal_senders=5
        -c max_replication_slots=5
        -c hot_standby=on
        -c wal_keep_size=256MB
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U appuser -d myapp"]
      interval: 5s
      timeout: 3s
      retries: 5

  pg-replica:
    image: postgres:16
    container_name: pg-replica
    environment:
      PGUSER: replicator
      PGPASSWORD: replicator_pass
    ports:
      - "5433:5432"
    volumes:
      - replica_data:/var/lib/postgresql/data
    depends_on:
      pg-primary:
        condition: service_healthy
    entrypoint: |
      bash -c "
      # Wait for primary
      until pg_isready -h pg-primary -U appuser; do sleep 1; done

      # If data directory is empty, take a base backup
      if [ ! -s /var/lib/postgresql/data/PG_VERSION ]; then
        rm -rf /var/lib/postgresql/data/*
        pg_basebackup -h pg-primary -U replicator -D /var/lib/postgresql/data -Fp -Xs -P -R
        echo 'hot_standby = on' >> /var/lib/postgresql/data/postgresql.auto.conf
      fi

      exec postgres
      "

volumes:
  primary_data:
  replica_data:

Primary Init Script

Save as docker/primary/init.sh:

#!/bin/bash
set -e

psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
    CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'replicator_pass';

    CREATE TABLE users (
        id SERIAL PRIMARY KEY,
        email VARCHAR(255) UNIQUE NOT NULL,
        name VARCHAR(255) NOT NULL,
        created_at TIMESTAMP DEFAULT NOW()
    );

    CREATE TABLE posts (
        id SERIAL PRIMARY KEY,
        user_id INTEGER REFERENCES users(id),
        title VARCHAR(500) NOT NULL,
        body TEXT,
        published BOOLEAN DEFAULT false,
        created_at TIMESTAMP DEFAULT NOW()
    );

    INSERT INTO users (email, name) VALUES
        ('[email protected]', 'Alice Johnson'),
        ('[email protected]', 'Bob Smith');

    INSERT INTO posts (user_id, title, body, published) VALUES
        (1, 'Getting Started with PostgreSQL', 'A beginner guide...', true),
        (2, 'Advanced Indexing Strategies', 'Deep dive into indexes...', true);
EOSQL

# Allow replication connections
echo "host replication replicator all scram-sha-256" >> "$PGDATA/pg_hba.conf"

Node.js Application

db.js -- Data access layer with read/write splitting:

var pg = require('pg');

var config = {
  primary: {
    host: process.env.PG_PRIMARY_HOST || 'localhost',
    port: parseInt(process.env.PG_PRIMARY_PORT || '5432'),
    database: process.env.PG_DATABASE || 'myapp',
    user: process.env.PG_USER || 'appuser',
    password: process.env.PG_PASSWORD || 'apppassword',
    max: 20,
    idleTimeoutMillis: 30000,
    connectionTimeoutMillis: 5000
  },
  replica: {
    host: process.env.PG_REPLICA_HOST || 'localhost',
    port: parseInt(process.env.PG_REPLICA_PORT || '5433'),
    database: process.env.PG_DATABASE || 'myapp',
    user: process.env.PG_USER || 'appuser',
    password: process.env.PG_PASSWORD || 'apppassword',
    max: 30,
    idleTimeoutMillis: 30000,
    connectionTimeoutMillis: 5000
  }
};

var primaryPool = new pg.Pool(config.primary);
var replicaPool = new pg.Pool(config.replica);

var replicaHealthy = true;

primaryPool.on('error', function(err) {
  console.error('[DB] Primary pool error:', err.message);
});

replicaPool.on('error', function(err) {
  console.error('[DB] Replica pool error:', err.message);
  replicaHealthy = false;
});

function read(text, params) {
  if (!replicaHealthy) {
    return primaryPool.query(text, params);
  }

  return replicaPool.query(text, params).catch(function(err) {
    console.warn('[DB] Replica read failed, falling back to primary:', err.message);
    replicaHealthy = false;

    // Re-check replica health every 30 seconds
    setTimeout(function() {
      replicaPool.query('SELECT 1').then(function() {
        console.log('[DB] Replica is back online');
        replicaHealthy = true;
      }).catch(function() {
        replicaHealthy = false;
      });
    }, 30000);

    return primaryPool.query(text, params);
  });
}

function write(text, params) {
  return primaryPool.query(text, params);
}

function transaction(callback) {
  return primaryPool.connect().then(function(client) {
    return client.query('BEGIN')
      .then(function() {
        return callback(client);
      })
      .then(function(result) {
        return client.query('COMMIT').then(function() {
          client.release();
          return result;
        });
      })
      .catch(function(err) {
        return client.query('ROLLBACK').then(function() {
          client.release();
          throw err;
        });
      });
  });
}

function getReplicationStatus() {
  return primaryPool.query(
    "SELECT application_name, client_addr, state, " +
    "pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes, " +
    "replay_lag " +
    "FROM pg_stat_replication"
  ).then(function(result) {
    return result.rows;
  });
}

function healthCheck() {
  var primaryOk = false;
  var replicaOk = false;
  var lagBytes = null;

  return primaryPool.query('SELECT 1')
    .then(function() {
      primaryOk = true;
      return replicaPool.query('SELECT 1');
    })
    .then(function() {
      replicaOk = true;
      return getReplicationStatus();
    })
    .then(function(status) {
      if (status.length > 0) {
        lagBytes = parseInt(status[0].lag_bytes || '0');
      }
      return {
        status: primaryOk ? 'ok' : 'degraded',
        primary: primaryOk ? 'connected' : 'disconnected',
        replica: replicaOk ? 'connected' : 'disconnected',
        replicationLagBytes: lagBytes
      };
    })
    .catch(function(err) {
      return {
        status: 'degraded',
        primary: primaryOk ? 'connected' : 'disconnected',
        replica: replicaOk ? 'connected' : 'disconnected',
        error: err.message
      };
    });
}

module.exports = {
  read: read,
  write: write,
  transaction: transaction,
  healthCheck: healthCheck,
  getReplicationStatus: getReplicationStatus,
  primaryPool: primaryPool,
  replicaPool: replicaPool
};

app.js -- Express application:

var express = require('express');
var db = require('./db');

var app = express();
app.use(express.json());

// Health check endpoint (includes replication status)
app.get('/health', function(req, res) {
  db.healthCheck().then(function(status) {
    var statusCode = status.status === 'ok' ? 200 : 503;
    res.status(statusCode).json(status);
  });
});

// READ: Get all users (routed to replica)
app.get('/users', function(req, res) {
  db.read('SELECT id, email, name, created_at FROM users ORDER BY id')
    .then(function(result) {
      res.json(result.rows);
    })
    .catch(function(err) {
      res.status(500).json({ error: err.message });
    });
});

// READ: Get user by ID (routed to replica)
app.get('/users/:id', function(req, res) {
  db.read('SELECT id, email, name, created_at FROM users WHERE id = $1', [req.params.id])
    .then(function(result) {
      if (result.rows.length === 0) {
        return res.status(404).json({ error: 'User not found' });
      }
      res.json(result.rows[0]);
    })
    .catch(function(err) {
      res.status(500).json({ error: err.message });
    });
});

// WRITE: Create a user (routed to primary)
app.post('/users', function(req, res) {
  var email = req.body.email;
  var name = req.body.name;

  db.write(
    'INSERT INTO users (email, name) VALUES ($1, $2) RETURNING id, email, name, created_at',
    [email, name]
  ).then(function(result) {
    res.status(201).json(result.rows[0]);
  }).catch(function(err) {
    res.status(500).json({ error: err.message });
  });
});

// WRITE: Create a post with transaction (routed to primary)
app.post('/posts', function(req, res) {
  var userId = req.body.user_id;
  var title = req.body.title;
  var body = req.body.body;

  db.transaction(function(client) {
    return client.query(
      'SELECT id FROM users WHERE id = $1', [userId]
    ).then(function(userResult) {
      if (userResult.rows.length === 0) {
        throw new Error('User not found');
      }
      return client.query(
        'INSERT INTO posts (user_id, title, body) VALUES ($1, $2, $3) RETURNING *',
        [userId, title, body]
      );
    });
  }).then(function(result) {
    res.status(201).json(result.rows[0]);
  }).catch(function(err) {
    var statusCode = err.message === 'User not found' ? 404 : 500;
    res.status(statusCode).json({ error: err.message });
  });
});

// READ: Get posts with user info (routed to replica)
app.get('/posts', function(req, res) {
  var query = [
    'SELECT p.id, p.title, p.body, p.published, p.created_at,',
    '  u.name AS author_name, u.email AS author_email',
    'FROM posts p',
    'JOIN users u ON u.id = p.user_id',
    'WHERE p.published = true',
    'ORDER BY p.created_at DESC'
  ].join('\n');

  db.read(query).then(function(result) {
    res.json(result.rows);
  }).catch(function(err) {
    res.status(500).json({ error: err.message });
  });
});

// Replication status endpoint (admin)
app.get('/admin/replication', function(req, res) {
  db.getReplicationStatus().then(function(status) {
    res.json(status);
  }).catch(function(err) {
    res.status(500).json({ error: err.message });
  });
});

var port = process.env.PORT || 3000;
app.listen(port, function() {
  console.log('Server running on port ' + port);
  console.log('Primary: ' + (process.env.PG_PRIMARY_HOST || 'localhost') + ':' + (process.env.PG_PRIMARY_PORT || '5432'));
  console.log('Replica: ' + (process.env.PG_REPLICA_HOST || 'localhost') + ':' + (process.env.PG_REPLICA_PORT || '5433'));
});

Running the Complete Example

# Start PostgreSQL primary and replica
docker compose up -d

# Wait for replication to establish
sleep 10

# Verify replication is working
docker exec pg-primary psql -U appuser -d myapp \
  -c "SELECT client_addr, state, replay_lag FROM pg_stat_replication;"

# Install and start the Node.js app
npm install pg express
node app.js

# Test write (goes to primary)
curl -X POST http://localhost:3000/users \
  -H "Content-Type: application/json" \
  -d '{"email":"[email protected]","name":"Carol Williams"}'

# Test read (goes to replica)
curl http://localhost:3000/users

# Check health (includes replication status)
curl http://localhost:3000/health

Common Issues and Troubleshooting

1. Replica Fails to Start: "database system identifier differs"

FATAL: database system identifier differs between the primary and standby
DETAIL: The primary's identifier is 7321456789012345678, the standby's identifier is 7321456789012345999.

This means the standby's data directory was not created from a base backup of the current primary. The standby has data from a different PostgreSQL cluster entirely. Solution: wipe the standby data directory and run pg_basebackup again.

sudo systemctl stop postgresql
sudo rm -rf /var/lib/postgresql/16/main/*
sudo -u postgres pg_basebackup -h primary_host -U replicator -D /var/lib/postgresql/16/main -Fp -Xs -P -R
sudo systemctl start postgresql

2. Replication Lag Keeps Growing

Check pg_stat_replication on the primary:

 application_name | state   | replay_lag_bytes
------------------+---------+------------------
 standby1         | streaming | 524288000

A lag of 500MB+ that keeps growing means the replica cannot keep up with the write workload. Common causes:

  • Slow disk I/O on the replica: The standby replays WAL sequentially, which is I/O-bound. Use SSDs.
  • Long-running queries on the replica: Queries block WAL replay due to conflict resolution. Check max_standby_streaming_delay -- if set too high, long queries delay replay. If set too low, queries get cancelled.
  • Network bandwidth: Ensure the network between primary and standby can sustain your WAL generation rate.
-- Check for queries blocking replay on the standby
SELECT pid, now() - query_start AS duration, state, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;

3. "FATAL: no pg_hba.conf entry for replication connection"

FATAL: no pg_hba.conf entry for replication connection from host "10.0.1.11", user "replicator"

The primary's pg_hba.conf does not allow replication connections from the standby's IP. Add the correct entry:

host    replication     replicator    10.0.1.11/32    scram-sha-256

Then reload (not restart):

sudo systemctl reload postgresql

4. Standby Queries Cancelled Due to Recovery Conflict

ERROR: canceling statement due to conflict with recovery
DETAIL: User query might not see results because of a recently started recovery action.

This happens when the primary vacuums or removes row versions that an active query on the standby still needs. Increase max_standby_streaming_delay to give queries more time, or enable hot_standby_feedback so the primary knows which row versions the standby still needs:

# On the standby
max_standby_streaming_delay = 60s
hot_standby_feedback = on

Warning: hot_standby_feedback = on can cause table bloat on the primary, because it prevents vacuum from cleaning up row versions that the standby is still using. Monitor pg_stat_user_tables.n_dead_tup on the primary.

5. WAL Segments Removed Before Standby Could Fetch Them

LOG: started streaming WAL from primary at 0/12000000 on timeline 1
FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000012 has already been removed

The primary has deleted WAL segments that the standby still needs. This happens when the standby falls too far behind and wal_keep_size is too small. Solutions:

  • Increase wal_keep_size on the primary (e.g., 2GB or more)
  • Use replication slots (guarantees WAL retention, but can fill your disk if the standby is down for too long):
-- On the primary
SELECT * FROM pg_create_physical_replication_slot('standby1_slot');

Then reference the slot in the standby's primary_conninfo:

primary_slot_name = 'standby1_slot'

Best Practices

  • Always use replication slots for physical replication. They prevent WAL segments from being deleted before the standby has consumed them. Monitor slot lag to prevent disk exhaustion.

  • Set wal_keep_size as a safety net, not a primary mechanism. Replication slots are the reliable way to retain WAL. wal_keep_size is your backstop if slots are not configured.

  • Monitor replication lag continuously. Set alerts at two thresholds: a warning at 5 seconds and a critical alert at 30 seconds. Lag that grows monotonically indicates a systemic problem.

  • Test failover regularly. Schedule quarterly failover drills. Document the process, measure the duration, and verify application behavior after promotion. An untested failover plan will fail when you need it most.

  • Use pg_rewind for fast failback. After a failover, pg_rewind can resynchronize the old primary from the new primary without a full base backup, saving hours on large databases.

  • Keep backups independent from replication. Replication replicas are not backups. Run pgBackRest or WAL-G on a schedule, archive WAL to object storage, and test restores monthly.

  • Use connection pooling in production. PgBouncer in transaction mode between your application and PostgreSQL reduces connection overhead dramatically. A database that supports 200 direct connections can serve 2000+ application connections through PgBouncer.

  • Enable hot_standby_feedback cautiously. It prevents query cancellations on the replica but can cause vacuum delays on the primary. Monitor dead tuple counts on the primary when this is enabled.

  • Size your replicas identically to the primary. Under-provisioned replicas fall behind under load. When they get promoted during failover, they cannot handle the full write workload. Your standby must be able to run as the primary at any moment.

  • Implement read-after-write consistency where needed. After a write, the application may immediately read the data. If the read goes to a replica that has not yet replayed the write, the data appears missing. For critical flows (e.g., user registration then immediate login), read from the primary for a short window after the write.


References

Powered by Contentful