Containerization

Health Checks in Docker and Kubernetes

Implement robust health checks for Node.js applications in Docker and Kubernetes, covering liveness probes, readiness probes, graceful shutdown, and comprehensive health check endpoint design.

Health Checks in Docker and Kubernetes

A container running does not mean your application is working. Your Node.js process might be alive but stuck in an infinite loop, unable to connect to the database, or out of memory. Health checks are how orchestrators detect these conditions and take corrective action — restarting unhealthy containers, removing them from load balancer rotation, or delaying traffic until startup completes. This guide covers health check implementation from simple Docker HEALTHCHECK instructions to comprehensive Kubernetes probe configurations.

Prerequisites

  • Docker Desktop v4.0+ or Docker Engine
  • Docker Compose v2
  • kubectl and a Kubernetes cluster (minikube, kind, or cloud-managed)
  • Node.js 18+ and Express.js
  • Basic familiarity with Docker and Kubernetes concepts

Docker HEALTHCHECK Instruction

The HEALTHCHECK instruction in a Dockerfile tells Docker how to test whether a container is still working.

FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000

HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

CMD ["node", "app.js"]

The parameters:

  • --interval=30s: Check every 30 seconds
  • --timeout=5s: Fail if the check takes longer than 5 seconds
  • --start-period=10s: Grace period for container startup (checks during this period do not count toward retries)
  • --retries=3: Mark unhealthy after 3 consecutive failures

Check container health status:

docker ps
# CONTAINER ID  IMAGE    STATUS                    PORTS
# abc123        myapp    Up 2 min (healthy)        0.0.0.0:3000->3000/tcp
# def456        myapp    Up 30s (health: starting) 0.0.0.0:3001->3000/tcp

Using curl vs wget for health checks:

# wget (available in Alpine by default)
HEALTHCHECK CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

# curl (requires installing curl in Alpine)
HEALTHCHECK CMD curl -f http://localhost:3000/health || exit 1

# Node.js script (no external dependencies)
HEALTHCHECK CMD node -e "var http = require('http'); http.get('http://localhost:3000/health', function(res) { process.exit(res.statusCode === 200 ? 0 : 1); }).on('error', function() { process.exit(1); });"

I prefer wget on Alpine images because it is built-in. The Node.js option works everywhere but starts a full Node process for each check — that is 30-50MB of memory overhead every 30 seconds, briefly.

Implementing Health Check Endpoints in Express.js

Shallow Health Check

A shallow check verifies the application process is running and can handle HTTP requests. It does not test dependencies.

var express = require('express');
var app = express();

app.get('/health', function(req, res) {
  res.status(200).json({
    status: 'healthy',
    uptime: process.uptime(),
    timestamp: Date.now()
  });
});

This is fast (sub-millisecond) and always succeeds unless the process is completely stuck. Use it for Kubernetes liveness probes.

Deep Health Check

A deep check verifies all critical dependencies — database, cache, external services.

var pg = require('pg');
var redis = require('redis');

var pool = new pg.Pool({
  connectionString: process.env.DATABASE_URL
});

var redisClient = redis.createClient({
  url: process.env.REDIS_URL
});
redisClient.connect();

app.get('/health/ready', function(req, res) {
  var checks = {
    database: 'unknown',
    redis: 'unknown',
    memory: 'unknown'
  };
  var healthy = true;

  // Check database
  pool.query('SELECT 1', function(dbErr) {
    checks.database = dbErr ? 'unhealthy' : 'healthy';
    if (dbErr) healthy = false;

    // Check Redis
    redisClient.ping().then(function() {
      checks.redis = 'healthy';
    }).catch(function(redisErr) {
      checks.redis = 'unhealthy';
      healthy = false;
    }).finally(function() {
      // Check memory
      var memUsage = process.memoryUsage();
      var heapUsedMB = Math.round(memUsage.heapUsed / 1024 / 1024);
      var heapTotalMB = Math.round(memUsage.heapTotal / 1024 / 1024);
      var heapPercent = Math.round((memUsage.heapUsed / memUsage.heapTotal) * 100);

      checks.memory = heapPercent > 90 ? 'warning' : 'healthy';
      if (heapPercent > 95) {
        checks.memory = 'unhealthy';
        healthy = false;
      }

      var statusCode = healthy ? 200 : 503;
      res.status(statusCode).json({
        status: healthy ? 'healthy' : 'unhealthy',
        checks: checks,
        memory: {
          heapUsed: heapUsedMB + 'MB',
          heapTotal: heapTotalMB + 'MB',
          heapPercent: heapPercent + '%'
        },
        uptime: Math.round(process.uptime()) + 's',
        timestamp: new Date().toISOString()
      });
    });
  });
});

Response when healthy:

{
  "status": "healthy",
  "checks": {
    "database": "healthy",
    "redis": "healthy",
    "memory": "healthy"
  },
  "memory": {
    "heapUsed": "45MB",
    "heapTotal": "78MB",
    "heapPercent": "57%"
  },
  "uptime": "3842s",
  "timestamp": "2026-02-13T14:30:00.000Z"
}

Response when unhealthy (HTTP 503):

{
  "status": "unhealthy",
  "checks": {
    "database": "unhealthy",
    "redis": "healthy",
    "memory": "healthy"
  },
  "memory": {
    "heapUsed": "52MB",
    "heapTotal": "78MB",
    "heapPercent": "66%"
  },
  "uptime": "3842s",
  "timestamp": "2026-02-13T14:30:00.000Z"
}

Health Check Module

Extract health checks into a reusable module.

// health/index.js
var os = require('os');

function HealthChecker(options) {
  this.checks = {};
  this.timeout = (options && options.timeout) || 5000;
}

HealthChecker.prototype.addCheck = function(name, checkFn) {
  this.checks[name] = checkFn;
};

HealthChecker.prototype.run = function(callback) {
  var results = {};
  var healthy = true;
  var checkNames = Object.keys(this.checks);
  var completed = 0;
  var timeout = this.timeout;

  if (checkNames.length === 0) {
    return callback(null, { status: 'healthy', checks: {} });
  }

  var timer = setTimeout(function() {
    if (completed < checkNames.length) {
      healthy = false;
      callback(null, {
        status: 'unhealthy',
        checks: results,
        error: 'Health check timed out after ' + timeout + 'ms'
      });
    }
  }, timeout);

  checkNames.forEach(function(name) {
    var checkFn = this.checks[name];
    var startTime = Date.now();

    checkFn(function(err) {
      var duration = Date.now() - startTime;
      results[name] = {
        status: err ? 'unhealthy' : 'healthy',
        duration: duration + 'ms'
      };
      if (err) {
        results[name].error = err.message;
        healthy = false;
      }

      completed++;
      if (completed === checkNames.length) {
        clearTimeout(timer);
        callback(null, {
          status: healthy ? 'healthy' : 'unhealthy',
          checks: results,
          system: {
            uptime: Math.round(process.uptime()) + 's',
            memory: {
              used: Math.round(process.memoryUsage().heapUsed / 1024 / 1024) + 'MB',
              total: Math.round(os.totalmem() / 1024 / 1024) + 'MB'
            },
            cpu: os.loadavg()
          },
          timestamp: new Date().toISOString()
        });
      }
    });
  }.bind(this));
};

module.exports = HealthChecker;
// app.js
var express = require('express');
var pg = require('pg');
var redis = require('redis');
var HealthChecker = require('./health');

var app = express();

var pool = new pg.Pool({ connectionString: process.env.DATABASE_URL });
var redisClient = redis.createClient({ url: process.env.REDIS_URL });
redisClient.connect();

var healthChecker = new HealthChecker({ timeout: 5000 });

healthChecker.addCheck('database', function(callback) {
  pool.query('SELECT 1', function(err) {
    callback(err);
  });
});

healthChecker.addCheck('redis', function(callback) {
  redisClient.ping().then(function() {
    callback(null);
  }).catch(function(err) {
    callback(err);
  });
});

// Shallow check for liveness
app.get('/health', function(req, res) {
  res.status(200).json({ status: 'healthy' });
});

// Deep check for readiness
app.get('/health/ready', function(req, res) {
  healthChecker.run(function(err, result) {
    var statusCode = result.status === 'healthy' ? 200 : 503;
    res.status(statusCode).json(result);
  });
});

Docker Compose Health Checks

Use health checks with depends_on to orchestrate startup order.

services:
  api:
    build: .
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgresql://appuser:secret@postgres:5432/myapp
      - REDIS_URL=redis://redis:6379
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/health"]
      interval: 15s
      timeout: 5s
      start_period: 30s
      retries: 3

  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_USER: appuser
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: myapp
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U appuser -d myapp"]
      interval: 5s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5

The condition: service_healthy ensures the API container does not start until PostgreSQL and Redis are both healthy. This eliminates the "connection refused on startup" race condition.

Kubernetes Liveness Probes

Liveness probes determine if a container should be restarted. If the probe fails, Kubernetes kills the container and creates a new one.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
        - name: api
          image: myapp:latest
          ports:
            - containerPort: 3000
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 15
            periodSeconds: 20
            timeoutSeconds: 5
            failureThreshold: 3
            successThreshold: 1

The liveness probe should be a shallow check. Do not include database connectivity in liveness probes — if the database is down, restarting your API container will not fix the database. It will just create a restart loop.

Probe Types

# HTTP GET probe
livenessProbe:
  httpGet:
    path: /health
    port: 3000
    httpHeaders:
      - name: X-Health-Check
        value: kubernetes

# TCP socket probe
livenessProbe:
  tcpSocket:
    port: 3000

# Command execution probe
livenessProbe:
  exec:
    command:
      - node
      - -e
      - "var http = require('http'); http.get('http://localhost:3000/health', function(r) { process.exit(r.statusCode === 200 ? 0 : 1); }).on('error', function() { process.exit(1); });"

HTTP GET is the most common for web applications. TCP socket is lighter but only checks if the port is open. Exec runs a command inside the container.

Kubernetes Readiness Probes

Readiness probes determine if a container should receive traffic. A container that fails readiness is removed from the Service's endpoint list but is NOT restarted.

readinessProbe:
  httpGet:
    path: /health/ready
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3
  successThreshold: 1

The readiness probe should be a deep check that verifies database connectivity and critical dependencies. When the database goes down:

  1. Readiness probe fails → pod removed from Service endpoints
  2. No new traffic routes to this pod
  3. Existing connections drain
  4. When the database recovers, readiness probe passes → pod re-added to endpoints

This prevents users from hitting pods that cannot serve requests.

Startup Probes

Startup probes handle slow-starting containers. While the startup probe is running, liveness and readiness probes are disabled.

startupProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 30  # 30 * 5s = 150s maximum startup time

Use startup probes when your application needs time to:

  • Load large models or datasets
  • Run database migrations
  • Warm caches
  • Establish connection pools

Without a startup probe, slow-starting apps get killed by the liveness probe before they finish initializing.

// Track startup completion
var isReady = false;

function initialize(callback) {
  console.log('Starting initialization...');

  // Run migrations
  console.log('Running migrations...');
  // ...

  // Warm cache
  console.log('Warming cache...');
  // ...

  // Establish connection pools
  console.log('Connecting to database...');
  pool.query('SELECT 1', function(err) {
    if (err) return callback(err);

    console.log('Initialization complete');
    isReady = true;
    callback(null);
  });
}

app.get('/health', function(req, res) {
  // Startup and liveness - just check if process is responding
  res.status(200).json({ status: 'alive' });
});

app.get('/health/ready', function(req, res) {
  if (!isReady) {
    return res.status(503).json({ status: 'not ready', reason: 'initializing' });
  }

  healthChecker.run(function(err, result) {
    var statusCode = result.status === 'healthy' ? 200 : 503;
    res.status(statusCode).json(result);
  });
});

initialize(function(err) {
  if (err) {
    console.error('Initialization failed:', err.message);
    process.exit(1);
  }
});

Graceful Shutdown

Health checks work hand-in-hand with graceful shutdown. When Kubernetes terminates a pod, it sends SIGTERM and waits for terminationGracePeriodSeconds (default: 30 seconds) before sending SIGKILL.

var http = require('http');
var app = require('./app');

var server = http.createServer(app);
var isShuttingDown = false;

server.listen(3000, function() {
  console.log('Server listening on port 3000');
});

function shutdown(signal) {
  console.log(signal + ' received. Starting graceful shutdown...');
  isShuttingDown = true;

  // Stop accepting new connections
  server.close(function() {
    console.log('HTTP server closed');

    // Close database connections
    pool.end(function() {
      console.log('Database pool closed');

      // Close Redis connection
      redisClient.quit().then(function() {
        console.log('Redis connection closed');
        console.log('Graceful shutdown complete');
        process.exit(0);
      });
    });
  });

  // Force exit if graceful shutdown takes too long
  setTimeout(function() {
    console.error('Forced shutdown after timeout');
    process.exit(1);
  }, 25000); // Leave 5s buffer before SIGKILL
}

process.on('SIGTERM', function() { shutdown('SIGTERM'); });
process.on('SIGINT', function() { shutdown('SIGINT'); });

// Middleware to reject requests during shutdown
app.use(function(req, res, next) {
  if (isShuttingDown) {
    res.set('Connection', 'close');
    return res.status(503).json({ error: 'Server is shutting down' });
  }
  next();
});

The shutdown sequence:

  1. SIGTERM received
  2. Set isShuttingDown = true → health checks fail → no new traffic routed
  3. server.close() waits for in-flight requests to complete
  4. Close database and cache connections
  5. Exit process
# Kubernetes deployment with graceful shutdown
spec:
  terminationGracePeriodSeconds: 30
  containers:
    - name: api
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 5"]

The preStop hook with sleep 5 gives the Kubernetes networking layer time to update iptables rules and remove the pod from the Service before the application starts rejecting requests. Without this, some requests may route to a terminating pod.

Complete Working Example

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  labels:
    app: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      terminationGracePeriodSeconds: 30
      containers:
        - name: api
          image: myapp:latest
          ports:
            - containerPort: 3000
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: api-secrets
                  key: database-url
            - name: REDIS_URL
              valueFrom:
                secretKeyRef:
                  name: api-secrets
                  key: redis-url
          resources:
            requests:
              memory: "128Mi"
              cpu: "100m"
            limits:
              memory: "256Mi"
              cpu: "500m"
          startupProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 30
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            periodSeconds: 20
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3000
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 5"]
---
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  selector:
    app: api
  ports:
    - port: 80
      targetPort: 3000
  type: ClusterIP
FROM node:20-alpine AS production
WORKDIR /app

COPY package*.json ./
RUN npm ci --only=production

COPY . .

EXPOSE 3000

HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

USER node
CMD ["node", "app.js"]
// app.js - Complete application with health checks and graceful shutdown
var express = require('express');
var http = require('http');
var pg = require('pg');
var redis = require('redis');
var HealthChecker = require('./health');

var app = express();
var isShuttingDown = false;
var isReady = false;

app.use(express.json());

// Reject requests during shutdown
app.use(function(req, res, next) {
  if (isShuttingDown) {
    res.set('Connection', 'close');
    return res.status(503).json({ error: 'Server is shutting down' });
  }
  next();
});

// Database
var pool = new pg.Pool({
  connectionString: process.env.DATABASE_URL,
  max: 20,
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 5000
});

// Redis
var redisClient = redis.createClient({ url: process.env.REDIS_URL });
redisClient.on('error', function(err) {
  console.error('Redis error:', err.message);
});
redisClient.connect();

// Health checker
var healthChecker = new HealthChecker({ timeout: 5000 });

healthChecker.addCheck('database', function(callback) {
  pool.query('SELECT 1', function(err) { callback(err); });
});

healthChecker.addCheck('redis', function(callback) {
  redisClient.ping().then(function() { callback(null); })
    .catch(function(err) { callback(err); });
});

// Health endpoints
app.get('/health', function(req, res) {
  res.status(200).json({ status: 'alive', uptime: process.uptime() });
});

app.get('/health/ready', function(req, res) {
  if (!isReady) {
    return res.status(503).json({ status: 'initializing' });
  }
  healthChecker.run(function(err, result) {
    res.status(result.status === 'healthy' ? 200 : 503).json(result);
  });
});

// Application routes
app.get('/api/status', function(req, res) {
  res.json({ version: '1.0.0', ready: isReady });
});

// Initialize
function initialize(callback) {
  pool.query('SELECT NOW()', function(err) {
    if (err) return callback(err);
    console.log('Database connected');
    isReady = true;
    callback(null);
  });
}

// Start server
var server = http.createServer(app);
server.listen(3000, function() {
  console.log('Server listening on port 3000');
  initialize(function(err) {
    if (err) {
      console.error('Initialization failed:', err.message);
      process.exit(1);
    }
    console.log('Application ready');
  });
});

// Graceful shutdown
function shutdown(signal) {
  console.log(signal + ' received');
  isShuttingDown = true;

  server.close(function() {
    console.log('HTTP server closed');
    pool.end(function() {
      redisClient.quit().then(function() {
        console.log('Shutdown complete');
        process.exit(0);
      });
    });
  });

  setTimeout(function() {
    console.error('Forced shutdown');
    process.exit(1);
  }, 25000);
}

process.on('SIGTERM', function() { shutdown('SIGTERM'); });
process.on('SIGINT', function() { shutdown('SIGINT'); });

Common Issues and Troubleshooting

1. Container Stuck in CrashLoopBackOff

NAME    READY   STATUS             RESTARTS   AGE
api-1   0/1     CrashLoopBackOff   5          3m

The liveness probe is killing the container before it finishes starting. Add a startup probe or increase initialDelaySeconds:

startupProbe:
  httpGet:
    path: /health
    port: 3000
  failureThreshold: 30
  periodSeconds: 5

2. Pod Not Receiving Traffic

kubectl get endpoints api
# NAME   ENDPOINTS
# api    <none>

No endpoints means all pods are failing readiness probes. Check the probe:

kubectl describe pod api-abc123 | grep -A10 Readiness
# Readiness probe failed: HTTP probe failed with statuscode: 503

kubectl logs api-abc123
# Error: connect ECONNREFUSED 10.96.0.5:5432

The database is unreachable. Fix the database connection, not the health check.

3. Health Check Timeouts

Warning  Unhealthy  Pod/api-1  Liveness probe failed: Get "http://10.244.0.5:3000/health": context deadline exceeded

The health check endpoint is too slow. Common causes:

  • Health check queries a slow database view
  • Network latency between nodes
  • Container is CPU-starved

Fix: increase timeoutSeconds, simplify the liveness check, or increase CPU limits.

4. Docker Health Check Shows "unhealthy" but App Works

docker ps
# STATUS: Up 5 min (unhealthy)

curl http://localhost:3000/health
# {"status":"healthy"}

The health check command inside the container cannot reach the endpoint. Common causes:

  • wget or curl not installed in the image
  • Health check URL is wrong
  • The app binds to 127.0.0.1 but the health check uses a different interface

Fix: exec into the container and run the health check manually to debug:

docker exec -it myapp sh
wget --spider http://localhost:3000/health

Best Practices

  • Separate liveness from readiness. Liveness checks the process, readiness checks dependencies. Mixing them causes unnecessary restarts.
  • Keep liveness probes fast and simple. A 200 OK from /health is sufficient. Never include database queries in liveness probes.
  • Use startup probes for slow-starting apps. They prevent premature kills during initialization without requiring inflated initialDelaySeconds.
  • Always implement graceful shutdown. Handle SIGTERM, drain connections, and close database pools. Set a forced exit timeout shorter than terminationGracePeriodSeconds.
  • Add a preStop sleep in Kubernetes. A 5-second delay gives the network time to update before your app starts rejecting requests.
  • Return structured JSON from health endpoints. Include check names, durations, and error details. This makes debugging production issues dramatically easier.
  • Set appropriate timeouts. Health check timeouts should be shorter than the probe interval. A 5-second timeout with a 10-second interval means you detect issues within 15 seconds.
  • Do not expose health endpoints publicly. They reveal internal architecture details. Restrict them to internal networks or use authentication.

References

Powered by Contentful