Nodejs

Node.js Clustering for Multi-Core Systems

A comprehensive guide to using Node.js cluster module for utilizing multi-core CPUs covering worker management, zero-downtime restarts, load balancing, and IPC communication.

Node.js Clustering for Multi-Core Systems

Overview

Node.js runs on a single thread by default, which means a 16-core production server is effectively using 6% of its available CPU capacity. The built-in cluster module solves this by forking multiple worker processes that share the same server port, allowing your application to handle significantly more concurrent requests. If you are running Node.js in production on anything larger than a single-core VM and you are not using clustering (or a process manager that does it for you), you are leaving performance on the table.

Prerequisites

  • Node.js v14 or later (v18+ recommended for production)
  • Basic understanding of how processes work in an operating system
  • Familiarity with Express.js or any HTTP framework
  • A machine with multiple CPU cores (use os.cpus().length to check)

How the Cluster Module Works

The cluster module uses child_process.fork() under the hood. When you call cluster.fork(), Node.js spawns a new process that runs the same script file. The key distinction is that one process acts as the master (also called the primary in newer docs) and the rest are workers. The master process does not serve requests directly — it manages the lifecycle of workers and distributes incoming connections to them.

var cluster = require("cluster");
var os = require("os");

if (cluster.isMaster) {
  console.log("Master process " + process.pid + " is running");

  var numCPUs = os.cpus().length;
  for (var i = 0; i < numCPUs; i++) {
    cluster.fork();
  }
} else {
  console.log("Worker " + process.pid + " started");
}

Running this on a 4-core machine produces:

$ node server.js
Master process 12345 is running
Worker 12346 started
Worker 12347 started
Worker 12348 started
Worker 12349 started

Each worker is a full, independent Node.js process with its own V8 instance, event loop, and memory space. Workers do not share memory. This is fundamentally different from threads in languages like Java or Go, where threads share the same heap. The isolation is both a strength (one worker crashing does not take down the others) and a constraint (sharing state requires explicit IPC).

Master/Worker Architecture

The master process has three jobs: fork workers, monitor their health, and replace them when they die. It should never handle HTTP requests directly. Keeping the master lean means that even if every worker crashes simultaneously, the master stays alive and can restart them.

var cluster = require("cluster");
var os = require("os");

if (cluster.isMaster) {
  var numCPUs = os.cpus().length;
  console.log("Forking " + numCPUs + " workers");

  for (var i = 0; i < numCPUs; i++) {
    var worker = cluster.fork();
    console.log("Forked worker " + worker.process.pid);
  }

  cluster.on("exit", function (worker, code, signal) {
    console.log(
      "Worker " + worker.process.pid + " died (code: " + code +
      ", signal: " + signal + ")"
    );
  });
} else {
  // Worker process: start your server here
  require("./app");
}

A common mistake is putting heavy initialization logic in the master process. Do not do this. The master should fork workers and manage events. All application logic, database connections, and route handling belong in the worker code path.

Sharing Server Ports

One of the most useful features of the cluster module is that multiple workers can listen on the same port. The master process opens the port and distributes incoming connections to workers. You do not need a reverse proxy or port mapping to make this work.

var cluster = require("cluster");
var http = require("http");
var os = require("os");

if (cluster.isMaster) {
  var numCPUs = os.cpus().length;
  for (var i = 0; i < numCPUs; i++) {
    cluster.fork();
  }
} else {
  http.createServer(function (req, res) {
    res.writeHead(200);
    res.end("Handled by worker " + process.pid + "\n");
  }).listen(8080);

  console.log("Worker " + process.pid + " listening on port 8080");
}

When you hit http://localhost:8080 repeatedly, you will see different PIDs in the response. The operating system or Node.js (depending on the balancing strategy) decides which worker handles each connection.

Load Balancing Strategies

Node.js supports two load balancing approaches for distributing connections across workers.

Round-Robin (Default on Linux and macOS)

The master accepts connections and distributes them to workers in rotation. This is the default on all platforms except Windows. It provides the most even distribution.

// Force round-robin scheduling
cluster.schedulingPolicy = cluster.SCHED_RR;

Or set it via environment variable before starting:

$ NODE_CLUSTER_SCHED_POLICY=rr node server.js

OS-Level Scheduling (Default on Windows)

The operating system decides which worker handles each connection. In practice, this can lead to uneven distribution where one or two workers handle most of the traffic while others sit idle.

// Use OS scheduling
cluster.schedulingPolicy = cluster.SCHED_NONE;
$ NODE_CLUSTER_SCHED_POLICY=none node server.js

My recommendation: always use round-robin. The OS scheduler tends to be "sticky" in the worst way, funneling connections to the same worker repeatedly. I have seen production scenarios where one worker was handling 70% of traffic while three others were nearly idle. Round-robin eliminates this problem.

Worker Lifecycle Management

Workers go through several states: online, listening, disconnect, and exit. The master can listen for all of these events.

var cluster = require("cluster");

if (cluster.isMaster) {
  var worker = cluster.fork();

  worker.on("online", function () {
    console.log("Worker " + worker.process.pid + " is online");
  });

  worker.on("listening", function (address) {
    console.log(
      "Worker " + worker.process.pid + " is listening on port " +
      address.port
    );
  });

  worker.on("disconnect", function () {
    console.log("Worker " + worker.process.pid + " disconnected");
  });

  worker.on("exit", function (code, signal) {
    if (signal) {
      console.log("Worker killed by signal: " + signal);
    } else if (code !== 0) {
      console.log("Worker exited with error code: " + code);
    } else {
      console.log("Worker exited cleanly");
    }
  });
}

Handling Worker Crashes and Automatic Restart

In production, workers will crash. Memory leaks, unhandled exceptions, segfaults in native modules — it happens. The master must restart dead workers automatically.

var cluster = require("cluster");
var os = require("os");

if (cluster.isMaster) {
  var numCPUs = os.cpus().length;

  for (var i = 0; i < numCPUs; i++) {
    cluster.fork();
  }

  cluster.on("exit", function (worker, code, signal) {
    console.log("Worker " + worker.process.pid + " died. Restarting...");
    cluster.fork();
  });
} else {
  require("./app");
}

This is the minimum viable restart strategy, but it has a dangerous flaw: if your application has a bug that crashes immediately on startup, this creates a fork bomb — the master endlessly spawns workers that immediately die. You need restart throttling.

var cluster = require("cluster");
var os = require("os");

var RESTART_DELAY = 1000; // 1 second
var MAX_RESTARTS = 30;
var RESTART_WINDOW = 60000; // 1 minute
var restartLog = [];

function shouldRestart() {
  var now = Date.now();
  restartLog = restartLog.filter(function (timestamp) {
    return now - timestamp < RESTART_WINDOW;
  });

  if (restartLog.length >= MAX_RESTARTS) {
    console.error(
      "Too many restarts (" + MAX_RESTARTS + " in " +
      (RESTART_WINDOW / 1000) + "s). Stopping."
    );
    return false;
  }

  restartLog.push(now);
  return true;
}

if (cluster.isMaster) {
  var numCPUs = os.cpus().length;

  for (var i = 0; i < numCPUs; i++) {
    cluster.fork();
  }

  cluster.on("exit", function (worker, code, signal) {
    console.log("Worker " + worker.process.pid + " exited (code: " + code + ")");

    if (shouldRestart()) {
      setTimeout(function () {
        console.log("Restarting worker...");
        cluster.fork();
      }, RESTART_DELAY);
    }
  });
} else {
  require("./app");
}

Zero-Downtime Restarts

Deploying new code without dropping a single request is a hard requirement for many production systems. The cluster module makes this achievable with a rolling restart strategy: replace workers one at a time, waiting for each new worker to be ready before killing the next old one.

var cluster = require("cluster");
var os = require("os");

if (cluster.isMaster) {
  var numCPUs = os.cpus().length;

  for (var i = 0; i < numCPUs; i++) {
    cluster.fork();
  }

  process.on("SIGUSR2", function () {
    console.log("Received SIGUSR2. Starting rolling restart...");
    var workerIds = Object.keys(cluster.workers);
    var index = 0;

    function restartNext() {
      if (index >= workerIds.length) {
        console.log("Rolling restart complete");
        return;
      }

      var worker = cluster.workers[workerIds[index]];
      if (!worker) {
        index++;
        restartNext();
        return;
      }

      console.log("Restarting worker " + worker.process.pid);

      var replacement = cluster.fork();

      replacement.on("listening", function () {
        console.log("Replacement worker " + replacement.process.pid + " is ready");
        worker.disconnect();

        // Force kill if worker does not exit within 5 seconds
        var killTimer = setTimeout(function () {
          console.log("Force killing worker " + worker.process.pid);
          worker.kill();
        }, 5000);

        worker.on("exit", function () {
          clearTimeout(killTimer);
          index++;
          restartNext();
        });
      });
    }

    restartNext();
  });

  cluster.on("exit", function (worker, code, signal) {
    if (!worker.exitedAfterDisconnect) {
      console.log("Worker " + worker.process.pid + " crashed. Restarting...");
      cluster.fork();
    }
  });
} else {
  require("./app");
}

Trigger the rolling restart by sending SIGUSR2 to the master process:

$ kill -SIGUSR2 <master_pid>

The key detail here is worker.exitedAfterDisconnect. This flag is true when the worker was intentionally disconnected (part of our rolling restart) and false when it crashed unexpectedly. We only auto-restart on unexpected exits to avoid interfering with the rolling restart sequence.

Inter-Process Communication (IPC)

Workers cannot share memory, but they can exchange messages with the master through IPC channels. This is useful for sharing lightweight state, broadcasting configuration changes, or collecting metrics.

var cluster = require("cluster");

if (cluster.isMaster) {
  var requestCounts = {};

  var worker = cluster.fork();

  worker.on("message", function (msg) {
    if (msg.type === "request_count") {
      requestCounts[worker.id] = msg.count;
      console.log("Worker " + worker.id + " has handled " + msg.count + " requests");
    }
  });

  // Send configuration to worker
  worker.send({ type: "config", maxRequestsPerMinute: 1000 });
} else {
  var count = 0;

  process.on("message", function (msg) {
    if (msg.type === "config") {
      console.log("Received config: max " + msg.maxRequestsPerMinute + " req/min");
    }
  });

  // Report request count every 10 seconds
  setInterval(function () {
    process.send({ type: "request_count", count: count });
  }, 10000);

  var http = require("http");
  http.createServer(function (req, res) {
    count++;
    res.writeHead(200);
    res.end("OK");
  }).listen(8080);
}

Broadcasting Messages to All Workers

A common pattern is having the master relay messages from one worker to all others:

if (cluster.isMaster) {
  cluster.on("message", function (worker, msg) {
    if (msg.type === "broadcast") {
      Object.keys(cluster.workers).forEach(function (id) {
        cluster.workers[id].send(msg.data);
      });
    }
  });
}

Keep IPC messages small. Sending large objects through IPC involves JSON serialization and deserialization on both sides, which can be surprisingly expensive. If you need to share large datasets between workers, use Redis or a shared database instead.

Sticky Sessions for Stateful Connections

Round-robin load balancing breaks WebSocket connections and any session that relies on in-memory state. If a client opens a WebSocket to worker A, subsequent HTTP requests for that session might land on worker B, which has no knowledge of the connection.

The solution is sticky sessions — routing all requests from the same client to the same worker. The sticky-session package handles this, but understanding the mechanism is important.

var cluster = require("cluster");
var http = require("http");
var net = require("net");
var os = require("os");

var numCPUs = os.cpus().length;

if (cluster.isMaster) {
  var workers = [];

  for (var i = 0; i < numCPUs; i++) {
    workers.push(cluster.fork());
  }

  // Create a raw TCP server that distributes connections by IP hash
  var server = net.createServer({ pauseOnConnect: true }, function (connection) {
    var ip = connection.remoteAddress || "";
    var hash = ipHash(ip, workers.length);
    var worker = workers[hash];
    worker.send("sticky-session:connection", connection);
  });

  server.listen(8080);

  function ipHash(ip, numBuckets) {
    var hash = 0;
    for (var i = 0; i < ip.length; i++) {
      hash = ((hash << 5) - hash) + ip.charCodeAt(i);
      hash = hash & hash; // Convert to 32bit integer
    }
    return Math.abs(hash) % numBuckets;
  }
} else {
  var express = require("express");
  var app = express();

  var server = http.createServer(app);

  app.get("/", function (req, res) {
    res.send("Worker " + process.pid);
  });

  // Listen on a random port — the master handles the real port
  server.listen(0, "localhost");

  process.on("message", function (msg, connection) {
    if (msg === "sticky-session:connection") {
      server.emit("connection", connection);
      connection.resume();
    }
  });
}

The trade-off with sticky sessions is that you lose the even distribution of round-robin. If many clients share the same IP (corporate NATs, for example), one worker may get disproportionate traffic. In practice, this is usually acceptable for WebSocket applications.

Memory Considerations Per Worker

Each worker is a separate process with its own V8 heap. On a server with 4 GB of RAM running 4 workers, each worker gets roughly 1 GB (minus OS overhead). You need to plan memory budgets carefully.

var cluster = require("cluster");
var os = require("os");

var TOTAL_MEMORY_MB = os.totalmem() / 1024 / 1024;
var OS_RESERVED_MB = 512;
var AVAILABLE_MB = TOTAL_MEMORY_MB - OS_RESERVED_MB;
var numCPUs = os.cpus().length;
var PER_WORKER_MB = Math.floor(AVAILABLE_MB / numCPUs);

console.log("Total memory: " + Math.floor(TOTAL_MEMORY_MB) + " MB");
console.log("Workers: " + numCPUs);
console.log("Memory per worker: " + PER_WORKER_MB + " MB");

if (cluster.isMaster) {
  for (var i = 0; i < numCPUs; i++) {
    var worker = cluster.fork({
      NODE_OPTIONS: "--max-old-space-size=" + PER_WORKER_MB
    });
  }
}

Typical output on an 8 GB, 4-core machine:

Total memory: 8192 MB
Workers: 4
Memory per worker: 1920 MB

Monitor worker memory usage from the master:

setInterval(function () {
  Object.keys(cluster.workers).forEach(function (id) {
    cluster.workers[id].send({ type: "memory_report" });
  });
}, 30000);

// In worker:
process.on("message", function (msg) {
  if (msg.type === "memory_report") {
    var usage = process.memoryUsage();
    process.send({
      type: "memory_stats",
      rss: Math.floor(usage.rss / 1024 / 1024),
      heapUsed: Math.floor(usage.heapUsed / 1024 / 1024),
      heapTotal: Math.floor(usage.heapTotal / 1024 / 1024)
    });
  }
});

Benchmarking Single vs Clustered Performance

The whole point of clustering is performance. Here is how to measure the difference. Start with a simple CPU-bound endpoint:

// app.js
var express = require("express");
var app = express();

app.get("/", function (req, res) {
  // Simulate CPU work
  var sum = 0;
  for (var i = 0; i < 1e6; i++) {
    sum += Math.sqrt(i);
  }
  res.json({ pid: process.pid, result: sum });
});

module.exports = app;

Run without clustering:

// single.js
var app = require("./app");
app.listen(8080);

Run with clustering:

// clustered.js
var cluster = require("cluster");
var os = require("os");

if (cluster.isMaster) {
  var numCPUs = os.cpus().length;
  for (var i = 0; i < numCPUs; i++) {
    cluster.fork();
  }
} else {
  var app = require("./app");
  app.listen(8080);
}

Benchmark with autocannon (install via npm install -g autocannon):

# Single process
$ node single.js &
$ autocannon -c 100 -d 10 http://localhost:8080

# Typical result on 4-core machine:
# Avg Latency: 42 ms
# Req/Sec:     2,340
# Total:       23,400 requests in 10s

# Clustered (4 workers)
$ node clustered.js &
$ autocannon -c 100 -d 10 http://localhost:8080

# Typical result on 4-core machine:
# Avg Latency: 12 ms
# Req/Sec:     8,120
# Total:       81,200 requests in 10s

For CPU-bound workloads, you can expect nearly linear scaling — 4 cores gives close to 4x throughput. For I/O-bound workloads (database queries, API calls), the improvement is less dramatic because the single-threaded event loop already handles I/O concurrency well. You will still see gains from clustering I/O-bound applications, typically 1.5-2.5x, because the event loop itself has overhead that gets distributed across workers.

Complete Working Example

Here is a production-ready clustered Express server with all the pieces assembled: automatic restart with throttling, graceful shutdown, health monitoring, and IPC-based metrics aggregation.

// server.js
var cluster = require("cluster");
var os = require("os");
var http = require("http");

var NUM_WORKERS = os.cpus().length;
var RESTART_DELAY_MS = 2000;
var MAX_RESTARTS_PER_MINUTE = 20;
var GRACEFUL_SHUTDOWN_TIMEOUT = 10000;
var HEALTH_CHECK_INTERVAL = 15000;

if (cluster.isMaster) {
  masterProcess();
} else {
  workerProcess();
}

// ─── MASTER ──────────────────────────────────────────────────────────

function masterProcess() {
  console.log("[Master " + process.pid + "] Starting " + NUM_WORKERS + " workers");

  var restartTimestamps = [];
  var workerMetrics = {};
  var isShuttingDown = false;

  // Fork initial workers
  for (var i = 0; i < NUM_WORKERS; i++) {
    forkWorker();
  }

  function forkWorker() {
    var worker = cluster.fork();
    workerMetrics[worker.id] = { requests: 0, errors: 0, uptime: Date.now() };

    worker.on("message", function (msg) {
      if (msg.type === "metrics") {
        workerMetrics[worker.id].requests = msg.requests;
        workerMetrics[worker.id].errors = msg.errors;
      }
    });

    return worker;
  }

  // Handle worker exits
  cluster.on("exit", function (worker, code, signal) {
    delete workerMetrics[worker.id];

    if (isShuttingDown) {
      var remaining = Object.keys(cluster.workers).length;
      console.log("[Master] Worker " + worker.process.pid + " stopped. " + remaining + " remaining.");
      if (remaining === 0) {
        console.log("[Master] All workers stopped. Exiting.");
        process.exit(0);
      }
      return;
    }

    console.log(
      "[Master] Worker " + worker.process.pid + " died " +
      "(code: " + code + ", signal: " + signal + ")"
    );

    // Restart throttling
    var now = Date.now();
    restartTimestamps = restartTimestamps.filter(function (ts) {
      return now - ts < 60000;
    });

    if (restartTimestamps.length >= MAX_RESTARTS_PER_MINUTE) {
      console.error("[Master] Too many restarts. Halting auto-restart.");
      return;
    }

    restartTimestamps.push(now);
    setTimeout(function () {
      console.log("[Master] Restarting worker...");
      forkWorker();
    }, RESTART_DELAY_MS);
  });

  // Health monitoring
  setInterval(function () {
    var totalRequests = 0;
    var totalErrors = 0;
    var workerCount = 0;

    Object.keys(workerMetrics).forEach(function (id) {
      totalRequests += workerMetrics[id].requests;
      totalErrors += workerMetrics[id].errors;
      workerCount++;
    });

    console.log(
      "[Master] Health: " + workerCount + " workers, " +
      totalRequests + " total requests, " +
      totalErrors + " errors"
    );
  }, HEALTH_CHECK_INTERVAL);

  // Graceful shutdown
  function gracefulShutdown(signal) {
    if (isShuttingDown) return;
    isShuttingDown = true;

    console.log("[Master] Received " + signal + ". Graceful shutdown starting...");

    var workers = Object.keys(cluster.workers);
    workers.forEach(function (id) {
      cluster.workers[id].send({ type: "shutdown" });
      cluster.workers[id].disconnect();
    });

    // Force kill after timeout
    setTimeout(function () {
      console.error("[Master] Forcing shutdown after timeout");
      workers.forEach(function (id) {
        if (cluster.workers[id]) {
          cluster.workers[id].kill("SIGKILL");
        }
      });
      process.exit(1);
    }, GRACEFUL_SHUTDOWN_TIMEOUT);
  }

  process.on("SIGTERM", function () { gracefulShutdown("SIGTERM"); });
  process.on("SIGINT", function () { gracefulShutdown("SIGINT"); });

  // Rolling restart via SIGUSR2
  process.on("SIGUSR2", function () {
    console.log("[Master] Rolling restart initiated");
    var ids = Object.keys(cluster.workers);
    var idx = 0;

    function nextWorker() {
      if (idx >= ids.length) {
        console.log("[Master] Rolling restart complete");
        return;
      }
      var oldWorker = cluster.workers[ids[idx]];
      if (!oldWorker) { idx++; nextWorker(); return; }

      var replacement = forkWorker();
      replacement.on("listening", function () {
        oldWorker.disconnect();
        var killTimer = setTimeout(function () {
          oldWorker.kill();
        }, 5000);
        oldWorker.on("exit", function () {
          clearTimeout(killTimer);
          idx++;
          nextWorker();
        });
      });
    }

    nextWorker();
  });
}

// ─── WORKER ──────────────────────────────────────────────────────────

function workerProcess() {
  var express = require("express");
  var app = express();

  var requestCount = 0;
  var errorCount = 0;
  var isShuttingDown = false;

  // Middleware: track requests and reject during shutdown
  app.use(function (req, res, next) {
    if (isShuttingDown) {
      res.set("Connection", "close");
      res.status(503).json({ error: "Server is shutting down" });
      return;
    }
    requestCount++;
    next();
  });

  // Routes
  app.get("/", function (req, res) {
    res.json({
      message: "Hello from worker " + process.pid,
      uptime: process.uptime()
    });
  });

  app.get("/health", function (req, res) {
    var mem = process.memoryUsage();
    res.json({
      status: "healthy",
      pid: process.pid,
      uptime: Math.floor(process.uptime()),
      memory: {
        rss: Math.floor(mem.rss / 1024 / 1024) + " MB",
        heapUsed: Math.floor(mem.heapUsed / 1024 / 1024) + " MB"
      },
      requests: requestCount
    });
  });

  app.get("/heavy", function (req, res) {
    var sum = 0;
    for (var i = 0; i < 1e7; i++) {
      sum += Math.sqrt(i);
    }
    res.json({ pid: process.pid, result: sum });
  });

  // Error handling
  app.use(function (err, req, res, next) {
    errorCount++;
    console.error("[Worker " + process.pid + "] Error: " + err.message);
    res.status(500).json({ error: "Internal server error" });
  });

  // Start server
  var server = http.createServer(app);
  server.listen(8080, function () {
    console.log("[Worker " + process.pid + "] Listening on port 8080");
  });

  // Report metrics to master every 10 seconds
  setInterval(function () {
    if (process.connected) {
      process.send({
        type: "metrics",
        requests: requestCount,
        errors: errorCount
      });
    }
  }, 10000);

  // Handle shutdown message from master
  process.on("message", function (msg) {
    if (msg.type === "shutdown") {
      console.log("[Worker " + process.pid + "] Shutdown signal received");
      isShuttingDown = true;

      server.close(function () {
        console.log("[Worker " + process.pid + "] Closed all connections");
        process.exit(0);
      });

      // Force exit if connections do not drain in time
      setTimeout(function () {
        console.error("[Worker " + process.pid + "] Forced exit after timeout");
        process.exit(1);
      }, 8000);
    }
  });

  // Handle uncaught exceptions gracefully
  process.on("uncaughtException", function (err) {
    console.error("[Worker " + process.pid + "] Uncaught exception: " + err.stack);
    errorCount++;

    // Stop accepting new connections, then exit
    isShuttingDown = true;
    server.close(function () {
      process.exit(1);
    });

    setTimeout(function () {
      process.exit(1);
    }, 5000);
  });
}

Start this server and test it:

$ node server.js
[Master 10200] Starting 4 workers
[Worker 10201] Listening on port 8080
[Worker 10202] Listening on port 8080
[Worker 10203] Listening on port 8080
[Worker 10204] Listening on port 8080
[Master] Health: 4 workers, 0 total requests, 0 errors

$ curl http://localhost:8080/health
{
  "status": "healthy",
  "pid": 10201,
  "uptime": 45,
  "memory": { "rss": "52 MB", "heapUsed": "18 MB" },
  "requests": 1
}

Trigger a rolling restart to deploy new code:

$ kill -SIGUSR2 10200
[Master] Rolling restart initiated
[Worker 10205] Listening on port 8080
[Master] Worker 10201 disconnected
[Worker 10206] Listening on port 8080
...
[Master] Rolling restart complete

Graceful shutdown on SIGTERM (what DigitalOcean, Docker, and Kubernetes send):

$ kill -SIGTERM 10200
[Master] Received SIGTERM. Graceful shutdown starting...
[Worker 10205] Shutdown signal received
[Worker 10205] Closed all connections
[Worker 10206] Shutdown signal received
[Worker 10206] Closed all connections
[Master] All workers stopped. Exiting.

Common Issues and Troubleshooting

1. EADDRINUSE: Port Already in Use

Error: listen EADDRINUSE: address already in use :::8080

This happens when you try to call server.listen() in the master process instead of only in workers. The master opens the port internally for connection distribution. If you also try to bind in the master, it conflicts.

Fix: Only call listen() inside the else (worker) branch. The master should never create an HTTP server.

2. Workers Not Receiving Connections on Windows

[Worker 10201] Listening on port 8080
[Worker 10202] Listening on port 8080
# All requests go to worker 10201, 10202 is idle

Windows defaults to SCHED_NONE (OS scheduling), which often results in severely uneven distribution.

Fix: Force round-robin scheduling:

cluster.schedulingPolicy = cluster.SCHED_RR;

Or set the environment variable NODE_CLUSTER_SCHED_POLICY=rr before starting the process.

3. IPC Channel Closed Errors

Error [ERR_IPC_CHANNEL_CLOSED]: Channel closed
    at ChildProcess.target.send (internal/child_process.js:754:16)

This occurs when a worker tries to call process.send() after it has been disconnected from the master. Common during shutdown sequences or when a worker outlives its IPC channel.

Fix: Check process.connected before sending:

if (process.connected) {
  process.send({ type: "metrics", data: metrics });
}

4. Memory Leak Across Workers Crashes the Server

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

<--- Last few GCs --->
[10201:0x5629a0]   120542 ms: Mark-sweep 1495.2 (1502.9) -> 1494.8 (1502.9) MB

Each worker is an independent process. If your application has a memory leak, every worker leaks independently, and they will all hit the heap limit eventually. With 4 workers on an 8 GB server, you can exhaust system memory four times faster than a single process.

Fix: Set --max-old-space-size per worker (as shown in the memory section), and implement a memory watchdog that restarts workers proactively:

// In worker process
setInterval(function () {
  var usage = process.memoryUsage();
  var heapUsedMB = usage.heapUsed / 1024 / 1024;

  if (heapUsedMB > 500) {
    console.warn("[Worker " + process.pid + "] Heap at " + Math.floor(heapUsedMB) + " MB. Recycling.");
    process.exit(0); // Master will restart this worker
  }
}, 30000);

5. Port Sharing Fails with Non-HTTP Servers

Error: Could not bind to port 3000

The cluster module's port sharing only works with net.Server, http.Server, and https.Server. If you are trying to share a raw UDP socket or a custom protocol server, you need a different approach — typically having the master accept and forward connections manually.

Fix: Use the sticky session pattern (master accepts on a raw TCP server, forwards to workers) or have each worker listen on a different port with a load balancer in front.

Best Practices

  • Do not run more workers than CPU cores. Spawning 16 workers on a 4-core machine causes context switching overhead that actually degrades performance. Match workers to physical (not logical) cores for CPU-bound workloads.

  • Use a process manager in production. PM2, systemd, or Docker's restart policies handle worker management more robustly than hand-rolled cluster code. PM2's cluster mode (pm2 start app.js -i max) gives you clustering, log management, and monitoring out of the box.

  • Keep the master process minimal. No database connections, no route handlers, no heavy libraries. The master's only job is worker lifecycle management. A lean master recovers faster from worker crashes.

  • Store shared state externally. Workers do not share memory. If you need shared sessions, counters, or caches, use Redis. Do not try to synchronize state through IPC — it does not scale and adds fragile complexity.

  • Implement graceful shutdown in every worker. When a worker receives a disconnect signal, stop accepting new connections, finish processing in-flight requests, then exit. Set a hard timeout (10-30 seconds) to force exit if connections do not drain. This is essential for zero-downtime deploys.

  • Set --max-old-space-size per worker. Without explicit limits, each worker defaults to V8's heap limit (around 1.5 GB on 64-bit systems). On a 4-core, 4 GB server, four workers could try to allocate 6 GB of heap total, triggering OS-level OOM kills.

  • Monitor each worker independently. Aggregate metrics in the master process but track per-worker health. One worker with high latency or error rates indicates a problem that averaged metrics would hide.

  • Use round-robin scheduling explicitly. Do not rely on platform defaults. Set cluster.schedulingPolicy = cluster.SCHED_RR at the top of your entry file so behavior is consistent across Linux, macOS, and Windows deployments.

  • Test your crash recovery. Intentionally kill workers during load testing. Verify that the master restarts them, that in-flight requests on other workers complete normally, and that the restart throttle prevents fork bombs.

References

Powered by Contentful