Digitalocean

DigitalOcean Monitoring and Alerting

A practical guide to monitoring and alerting on DigitalOcean, covering built-in metrics, alert policies, uptime checks, custom Node.js metrics, and integration with external monitoring tools.

DigitalOcean Monitoring and Alerting

Overview

DigitalOcean provides a built-in monitoring stack that covers infrastructure metrics, uptime checks, and alert policies with notification channels like Slack and email. For teams running production workloads on DigitalOcean, understanding how to configure these tools properly is the difference between catching a disk-full condition at 2 PM and getting paged at 3 AM when your database crashes. This guide walks through every layer of monitoring available on the platform, from the Droplet metrics agent to custom application-level metrics in Node.js, and shows you how to wire it all together into a coherent observability strategy.

Prerequisites

  • A DigitalOcean account with at least one Droplet or App Platform deployment
  • Node.js v18+ installed locally
  • A DigitalOcean Personal Access Token with read/write scope
  • Basic familiarity with Express.js and REST APIs
  • A Slack workspace (optional, for notification integration)
  • doctl CLI installed and authenticated (optional but recommended)

Built-in DigitalOcean Monitoring

The Droplet Metrics Agent

Every Droplet created after January 2018 comes with the DigitalOcean metrics agent pre-installed. If you are running an older Droplet, you need to install it manually. The agent collects system-level metrics and sends them to DigitalOcean's monitoring backend, where they become available in the control panel and through the API.

To install the agent on an existing Droplet:

curl -sSL https://repos.insights.digitalocean.com/install.sh | sudo bash

Verify the agent is running:

systemctl status do-agent

You should see output indicating the service is active. If the agent is not running, start it:

sudo systemctl enable do-agent
sudo systemctl start do-agent

The metrics agent collects the following data points at one-minute intervals:

  • CPU utilization - percentage of CPU time spent in user, system, and idle states
  • Memory usage - total, used, free, cached, and buffered memory
  • Disk I/O - read/write operations per second, bytes read/written
  • Disk usage - percentage of disk space used per mount point
  • Network bandwidth - inbound and outbound bytes per second per interface
  • Load average - 1-minute, 5-minute, and 15-minute load averages

These metrics are retained for 30 days and are accessible through the Graphs tab on any Droplet's detail page in the control panel.

Understanding the Metrics API

You can query Droplet metrics programmatically. This is useful for building custom dashboards or triggering actions based on metric thresholds outside of DigitalOcean's built-in alert system.

var https = require("https");

var token = process.env.DIGITALOCEAN_TOKEN;
var dropletId = process.env.DROPLET_ID;

var now = Math.floor(Date.now() / 1000);
var oneHourAgo = now - 3600;

var options = {
    hostname: "api.digitalocean.com",
    path: "/v2/monitoring/metrics/droplet/cpu?host_id=" + dropletId +
          "&start=" + oneHourAgo + "&end=" + now,
    method: "GET",
    headers: {
        "Authorization": "Bearer " + token,
        "Content-Type": "application/json"
    }
};

var req = https.request(options, function(res) {
    var body = "";
    res.on("data", function(chunk) {
        body += chunk;
    });
    res.on("end", function() {
        var data = JSON.parse(body);
        var results = data.data.result;
        results.forEach(function(series) {
            var mode = series.metric.mode;
            var lastValue = series.values[series.values.length - 1];
            console.log("CPU mode: " + mode + ", value: " + lastValue[1] + "%");
        });
    });
});

req.on("error", function(err) {
    console.error("Metrics request failed:", err.message);
});

req.end();

Available metric endpoints include:

Endpoint Description
/v2/monitoring/metrics/droplet/cpu CPU utilization by mode
/v2/monitoring/metrics/droplet/memory_free Free memory in bytes
/v2/monitoring/metrics/droplet/memory_available Available memory in bytes
/v2/monitoring/metrics/droplet/memory_total Total memory in bytes
/v2/monitoring/metrics/droplet/disk_read Disk read bytes/sec
/v2/monitoring/metrics/droplet/disk_write Disk write bytes/sec
/v2/monitoring/metrics/droplet/bandwidth Network bandwidth
/v2/monitoring/metrics/droplet/load_1 1-minute load average
/v2/monitoring/metrics/droplet/load_5 5-minute load average
/v2/monitoring/metrics/droplet/load_15 15-minute load average
/v2/monitoring/metrics/droplet/filesystem_free Free disk space
/v2/monitoring/metrics/droplet/filesystem_size Total disk size

Setting Up Alert Policies

Alert policies are the backbone of proactive monitoring. Without them, you are simply collecting data nobody looks at. DigitalOcean supports alerts on CPU, memory, disk, bandwidth, and custom uptime checks.

Creating Alert Policies via the API

Here is how to create a comprehensive set of alert policies programmatically:

var https = require("https");

var token = process.env.DIGITALOCEAN_TOKEN;

function createAlertPolicy(policy, callback) {
    var postData = JSON.stringify(policy);

    var options = {
        hostname: "api.digitalocean.com",
        path: "/v2/monitoring/alerts",
        method: "POST",
        headers: {
            "Authorization": "Bearer " + token,
            "Content-Type": "application/json",
            "Content-Length": Buffer.byteLength(postData)
        }
    };

    var req = https.request(options, function(res) {
        var body = "";
        res.on("data", function(chunk) {
            body += chunk;
        });
        res.on("end", function() {
            if (res.statusCode === 200 || res.statusCode === 201) {
                callback(null, JSON.parse(body));
            } else {
                callback(new Error("HTTP " + res.statusCode + ": " + body));
            }
        });
    });

    req.on("error", function(err) {
        callback(err);
    });

    req.write(postData);
    req.end();
}

var dropletIds = ["your-droplet-id"];

var policies = [
    {
        alerts: {
            email: ["[email protected]"],
            slack: [{
                channel: "#alerts",
                url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
            }]
        },
        compare: "GreaterThan",
        description: "CPU usage above 80% for 5 minutes",
        enabled: true,
        entities: dropletIds,
        tags: ["production"],
        type: "v1/insights/droplet/cpu",
        value: 80,
        window: "5m"
    },
    {
        alerts: {
            email: ["[email protected]"],
            slack: [{
                channel: "#alerts",
                url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
            }]
        },
        compare: "GreaterThan",
        description: "Memory usage above 90% for 5 minutes",
        enabled: true,
        entities: dropletIds,
        tags: ["production"],
        type: "v1/insights/droplet/memory_utilization_percent",
        value: 90,
        window: "5m"
    },
    {
        alerts: {
            email: ["[email protected]"],
            slack: [{
                channel: "#alerts",
                url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
            }]
        },
        compare: "GreaterThan",
        description: "Disk usage above 85%",
        enabled: true,
        entities: dropletIds,
        tags: ["production"],
        type: "v1/insights/droplet/disk_utilization_percent",
        value: 85,
        window: "5m"
    },
    {
        alerts: {
            email: ["[email protected]"],
            slack: [{
                channel: "#alerts",
                url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
            }]
        },
        compare: "GreaterThan",
        description: "Inbound bandwidth above 100 Mbps for 5 minutes",
        enabled: true,
        entities: dropletIds,
        tags: ["production"],
        type: "v1/insights/droplet/public_inbound_bandwidth",
        value: 100000000,
        window: "5m"
    }
];

var created = 0;
policies.forEach(function(policy) {
    createAlertPolicy(policy, function(err, result) {
        created++;
        if (err) {
            console.error("Failed to create policy:", policy.description, err.message);
        } else {
            console.log("Created policy:", result.policy.description);
        }
        if (created === policies.length) {
            console.log("All policies processed.");
        }
    });
});

Recommended Alert Thresholds

Through years of operating Node.js services on DigitalOcean, I have landed on these thresholds as a starting point. You should tune them based on your workload patterns.

Metric Warning Critical Window
CPU 70% 90% 5 min
Memory 80% 95% 5 min
Disk Usage 75% 90% 5 min
Load Average (1m) 2x vCPU count 4x vCPU count 5 min
Inbound Bandwidth 80% of limit 95% of limit 5 min

A common mistake is setting thresholds too aggressively. A Node.js process doing a garbage collection sweep can spike to 100% CPU for a few seconds. That is normal. Use the 5-minute window to avoid false positives.

Uptime Checks

Uptime checks verify that your HTTP/HTTPS endpoints are reachable and responding correctly. DigitalOcean runs these checks from multiple regions every 30 seconds and triggers alerts when a check fails.

Creating Uptime Checks via the API

var https = require("https");

var token = process.env.DIGITALOCEAN_TOKEN;

function createUptimeCheck(check, callback) {
    var postData = JSON.stringify(check);

    var options = {
        hostname: "api.digitalocean.com",
        path: "/v2/uptime/checks",
        method: "POST",
        headers: {
            "Authorization": "Bearer " + token,
            "Content-Type": "application/json",
            "Content-Length": Buffer.byteLength(postData)
        }
    };

    var req = https.request(options, function(res) {
        var body = "";
        res.on("data", function(chunk) {
            body += chunk;
        });
        res.on("end", function() {
            if (res.statusCode === 201) {
                callback(null, JSON.parse(body));
            } else {
                callback(new Error("HTTP " + res.statusCode + ": " + body));
            }
        });
    });

    req.on("error", callback);
    req.write(postData);
    req.end();
}

var checks = [
    {
        name: "Production API Health",
        type: "https",
        target: "https://api.yourapp.com/health",
        regions: ["us_east", "us_west", "eu_west"],
        enabled: true
    },
    {
        name: "Production Website",
        type: "https",
        target: "https://www.yourapp.com",
        regions: ["us_east", "us_west", "eu_west", "se_asia"],
        enabled: true
    }
];

checks.forEach(function(check) {
    createUptimeCheck(check, function(err, result) {
        if (err) {
            console.error("Failed:", check.name, err.message);
            return;
        }
        console.log("Created uptime check:", result.check.id, result.check.name);

        // Now create an alert for this check
        var alertData = JSON.stringify({
            name: check.name + " - Down Alert",
            type: "down",
            threshold: 2,
            comparison: "greater_than",
            period: "2m",
            notifications: {
                email: ["[email protected]"],
                slack: [{
                    channel: "#alerts",
                    url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
                }]
            }
        });

        var alertOptions = {
            hostname: "api.digitalocean.com",
            path: "/v2/uptime/checks/" + result.check.id + "/alerts",
            method: "POST",
            headers: {
                "Authorization": "Bearer " + token,
                "Content-Type": "application/json",
                "Content-Length": Buffer.byteLength(alertData)
            }
        };

        var alertReq = https.request(alertOptions, function(alertRes) {
            var alertBody = "";
            alertRes.on("data", function(chunk) { alertBody += chunk; });
            alertRes.on("end", function() {
                console.log("Alert created for:", check.name);
            });
        });

        alertReq.write(alertData);
        alertReq.end();
    });
});

Slack and Email Notification Channels

DigitalOcean supports two notification channels out of the box: email and Slack webhooks. For serious production deployments, use both. Email serves as the durable record. Slack serves as the real-time channel your team actually watches.

Setting Up Slack Notifications

  1. Create a Slack app at https://api.slack.com/apps
  2. Enable Incoming Webhooks for the app
  3. Add a webhook URL to the channel you want alerts in (e.g., #alerts-production)
  4. Use the webhook URL in your alert policies

Here is a utility module for sending custom Slack notifications from your Node.js application alongside the DigitalOcean-native alerts:

var https = require("https");
var url = require("url");

function SlackNotifier(webhookUrl) {
    this.webhookUrl = webhookUrl;
    this.parsed = url.parse(webhookUrl);
}

SlackNotifier.prototype.send = function(message, callback) {
    var payload = JSON.stringify(message);

    var options = {
        hostname: this.parsed.hostname,
        path: this.parsed.path,
        method: "POST",
        headers: {
            "Content-Type": "application/json",
            "Content-Length": Buffer.byteLength(payload)
        }
    };

    var req = https.request(options, function(res) {
        var body = "";
        res.on("data", function(chunk) { body += chunk; });
        res.on("end", function() {
            callback(null, res.statusCode);
        });
    });

    req.on("error", callback);
    req.write(payload);
    req.end();
};

SlackNotifier.prototype.alert = function(title, text, severity, callback) {
    var colors = {
        critical: "#dc3545",
        warning: "#ffc107",
        info: "#17a2b8",
        ok: "#28a745"
    };

    var message = {
        attachments: [{
            color: colors[severity] || colors.info,
            title: title,
            text: text,
            ts: Math.floor(Date.now() / 1000),
            fields: [
                { title: "Severity", value: severity.toUpperCase(), short: true },
                { title: "Environment", value: process.env.NODE_ENV || "unknown", short: true }
            ]
        }]
    };

    this.send(message, callback);
};

module.exports = SlackNotifier;

Usage:

var SlackNotifier = require("./slack-notifier");
var notifier = new SlackNotifier(process.env.SLACK_WEBHOOK_URL);

notifier.alert(
    "High Memory Usage",
    "Memory usage on web-1 has exceeded 90% for the last 5 minutes.",
    "critical",
    function(err) {
        if (err) console.error("Slack notification failed:", err.message);
    }
);

Application-Level Monitoring from Node.js

Infrastructure metrics tell you the machine is healthy. Application metrics tell you the software is healthy. You need both. DigitalOcean's built-in monitoring handles the infrastructure side. For application-level observability, you need to instrument your code.

Custom Health Endpoints

Every production Node.js service should expose a health check endpoint. This endpoint serves double duty: DigitalOcean uptime checks hit it, and your load balancer uses it for health-based routing.

var os = require("os");

function healthCheckHandler(req, res) {
    var memUsage = process.memoryUsage();
    var uptime = process.uptime();
    var cpus = os.cpus();

    var cpuLoad = os.loadavg();
    var totalMem = os.totalmem();
    var freeMem = os.freemem();
    var memPercent = ((totalMem - freeMem) / totalMem * 100).toFixed(2);

    var health = {
        status: "healthy",
        timestamp: new Date().toISOString(),
        uptime: Math.floor(uptime) + "s",
        process: {
            pid: process.pid,
            version: process.version,
            heapUsed: Math.round(memUsage.heapUsed / 1024 / 1024) + "MB",
            heapTotal: Math.round(memUsage.heapTotal / 1024 / 1024) + "MB",
            rss: Math.round(memUsage.rss / 1024 / 1024) + "MB",
            external: Math.round(memUsage.external / 1024 / 1024) + "MB"
        },
        system: {
            loadAvg: cpuLoad.map(function(v) { return v.toFixed(2); }),
            cpuCount: cpus.length,
            memoryPercent: memPercent + "%",
            freeMem: Math.round(freeMem / 1024 / 1024) + "MB",
            platform: os.platform(),
            hostname: os.hostname()
        }
    };

    // Mark as unhealthy if heap is over 90% of total
    if (memUsage.heapUsed / memUsage.heapTotal > 0.9) {
        health.status = "degraded";
        health.warnings = health.warnings || [];
        health.warnings.push("Heap usage above 90%");
    }

    // Mark as unhealthy if load average is high
    if (cpuLoad[0] > cpus.length * 2) {
        health.status = "degraded";
        health.warnings = health.warnings || [];
        health.warnings.push("Load average exceeds 2x CPU count");
    }

    var statusCode = health.status === "healthy" ? 200 : 503;
    res.status(statusCode).json(health);
}

module.exports = healthCheckHandler;

Prometheus-Compatible Metrics with prom-client

If you want to go beyond basic health checks, prom-client gives you Prometheus-compatible metrics that work with Grafana, Datadog, and any other tool that speaks the Prometheus exposition format.

var promClient = require("prom-client");

// Create a registry
var register = new promClient.Registry();

// Add default Node.js metrics (GC, event loop, memory)
promClient.collectDefaultMetrics({ register: register });

// Custom metrics
var httpRequestDuration = new promClient.Histogram({
    name: "http_request_duration_seconds",
    help: "Duration of HTTP requests in seconds",
    labelNames: ["method", "route", "status_code"],
    buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
    registers: [register]
});

var httpRequestsTotal = new promClient.Counter({
    name: "http_requests_total",
    help: "Total number of HTTP requests",
    labelNames: ["method", "route", "status_code"],
    registers: [register]
});

var activeConnections = new promClient.Gauge({
    name: "http_active_connections",
    help: "Number of active HTTP connections",
    registers: [register]
});

var dbQueryDuration = new promClient.Histogram({
    name: "db_query_duration_seconds",
    help: "Duration of database queries in seconds",
    labelNames: ["operation", "collection"],
    buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5],
    registers: [register]
});

var errorCounter = new promClient.Counter({
    name: "app_errors_total",
    help: "Total number of application errors",
    labelNames: ["type", "code"],
    registers: [register]
});

// Middleware for tracking request metrics
function metricsMiddleware(req, res, next) {
    var start = process.hrtime();

    activeConnections.inc();

    res.on("finish", function() {
        activeConnections.dec();

        var duration = process.hrtime(start);
        var durationSeconds = duration[0] + duration[1] / 1e9;

        var route = req.route ? req.route.path : req.path;
        var labels = {
            method: req.method,
            route: route,
            status_code: res.statusCode
        };

        httpRequestDuration.observe(labels, durationSeconds);
        httpRequestsTotal.inc(labels);
    });

    next();
}

// Endpoint to expose metrics
function metricsHandler(req, res) {
    res.set("Content-Type", register.contentType);
    register.metrics().then(function(metrics) {
        res.end(metrics);
    }).catch(function(err) {
        res.status(500).end(err.message);
    });
}

module.exports = {
    register: register,
    middleware: metricsMiddleware,
    handler: metricsHandler,
    httpRequestDuration: httpRequestDuration,
    httpRequestsTotal: httpRequestsTotal,
    activeConnections: activeConnections,
    dbQueryDuration: dbQueryDuration,
    errorCounter: errorCounter
};

Log Management Strategies

DigitalOcean does not provide a centralized logging service comparable to AWS CloudWatch Logs. You have a few options.

Structured Logging

The first step is producing structured logs that are easy to parse and search:

var os = require("os");

function Logger(service) {
    this.service = service;
    this.hostname = os.hostname();
}

Logger.prototype.log = function(level, message, meta) {
    var entry = {
        timestamp: new Date().toISOString(),
        level: level,
        service: this.service,
        hostname: this.hostname,
        message: message,
        pid: process.pid
    };

    if (meta) {
        Object.keys(meta).forEach(function(key) {
            entry[key] = meta[key];
        });
    }

    // Errors get stack traces
    if (meta && meta.error instanceof Error) {
        entry.error = {
            name: meta.error.name,
            message: meta.error.message,
            stack: meta.error.stack
        };
    }

    process.stdout.write(JSON.stringify(entry) + "\n");
};

Logger.prototype.info = function(message, meta) {
    this.log("info", message, meta);
};

Logger.prototype.warn = function(message, meta) {
    this.log("warn", message, meta);
};

Logger.prototype.error = function(message, meta) {
    this.log("error", message, meta);
};

module.exports = Logger;

Log Forwarding Options

For Droplet-based deployments, install a log forwarder. Three solid options:

  1. Vector (by Datadog, open source) - lightweight, fast, works well with DigitalOcean
  2. Fluent Bit - low memory footprint, extensive plugin ecosystem
  3. rsyslog - already installed on most Linux distributions

Example Vector configuration for forwarding to a Grafana Cloud Loki instance:

[sources.app_logs]
type = "file"
include = ["/var/log/yourapp/*.log"]
read_from = "beginning"

[transforms.parse_json]
type = "remap"
inputs = ["app_logs"]
source = '''
. = parse_json!(.message)
'''

[sinks.loki]
type = "loki"
inputs = ["parse_json"]
endpoint = "https://logs-prod-us-central1.grafana.net"
encoding.codec = "json"
labels.service = "{{ service }}"
labels.hostname = "{{ hostname }}"
labels.level = "{{ level }}"

[sinks.loki.auth]
strategy = "basic"
user = "${GRAFANA_LOKI_USER}"
password = "${GRAFANA_LOKI_TOKEN}"

For App Platform deployments, your application logs are available through doctl and the DigitalOcean control panel, but they are retained for a limited time. For production, always forward logs to an external destination.

Database Monitoring for Managed Databases

DigitalOcean Managed Databases (PostgreSQL, MySQL, Redis, MongoDB) include built-in monitoring dashboards. You can also query database metrics through the API.

var https = require("https");

var token = process.env.DIGITALOCEAN_TOKEN;
var clusterId = process.env.DB_CLUSTER_ID;

function getDatabaseMetrics(metricName, callback) {
    var now = Math.floor(Date.now() / 1000);
    var oneHourAgo = now - 3600;

    var options = {
        hostname: "api.digitalocean.com",
        path: "/v2/databases/" + clusterId + "/metrics/credentials",
        method: "GET",
        headers: {
            "Authorization": "Bearer " + token,
            "Content-Type": "application/json"
        }
    };

    var req = https.request(options, function(res) {
        var body = "";
        res.on("data", function(chunk) { body += chunk; });
        res.on("end", function() {
            callback(null, JSON.parse(body));
        });
    });

    req.on("error", callback);
    req.end();
}

// Key database metrics to monitor:
// - Connection count vs max connections
// - Query latency (p50, p95, p99)
// - Replication lag (for read replicas)
// - Cache hit ratio (PostgreSQL shared buffers, Redis hit rate)
// - Disk usage growth rate

Key metrics to watch for managed databases:

  • Connection count - If you approach max_connections, new connections will be refused. Set alerts at 80%.
  • Replication lag - For read replicas, lag above 1 second indicates the replica is falling behind. This matters for read-after-write consistency.
  • Disk usage - Managed databases have disk limits. Running out of disk is catastrophic and not always recoverable.
  • Slow queries - Enable pg_stat_statements for PostgreSQL or the slow query log for MySQL.

Load Balancer Monitoring

DigitalOcean Load Balancers provide metrics on request rates, response codes, and connection counts. Monitor these through the control panel or API.

The critical metrics for load balancers:

  • 4xx and 5xx response rates - A spike in 5xx responses means your backend is failing. A spike in 4xx might indicate a misconfigured client or an attack.
  • Active connections - If this approaches your load balancer's limit, you need to scale.
  • Backend health check results - If backends are being marked unhealthy, investigate immediately.
  • TLS handshake errors - Certificate problems show up here before users report them.

Set up health check monitoring on the load balancer itself:

// Load balancer health check configuration
var lbConfig = {
    health_check: {
        protocol: "http",
        port: 8080,
        path: "/health",
        check_interval_seconds: 10,
        response_timeout_seconds: 5,
        unhealthy_threshold: 3,
        healthy_threshold: 5
    }
};

Use a 3-unhealthy threshold. This means a backend must fail three consecutive health checks before being removed from rotation. This prevents a single slow response from taking a healthy server out of the pool.

Combining DigitalOcean with External Monitoring Tools

DigitalOcean's built-in monitoring is good for infrastructure basics. For a complete observability stack, combine it with external tools.

Grafana Cloud Integration

Grafana Cloud provides a free tier that includes Prometheus metrics storage, Loki for logs, and Grafana dashboards. This is the best cost-effective option for small to medium deployments.

Set up the Grafana Agent on your Droplet to forward Prometheus metrics:

# /etc/grafana-agent/agent.yaml
server:
  log_level: info

metrics:
  global:
    scrape_interval: 15s
  configs:
    - name: default
      scrape_configs:
        - job_name: 'node-app'
          static_configs:
            - targets: ['localhost:8080']
              labels:
                environment: 'production'
                service: 'web-api'
          metrics_path: '/metrics'
      remote_write:
        - url: https://prometheus-prod-us-central1.grafana.net/api/prom/push
          basic_auth:
            username: '${GRAFANA_PROM_USER}'
            password: '${GRAFANA_PROM_TOKEN}'

Datadog Integration

For teams with budget, Datadog provides the most comprehensive monitoring experience. Install the Datadog agent on your Droplet:

DD_API_KEY=your-api-key DD_SITE="datadoghq.com" bash -c \
  "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)"

Configure the Node.js integration:

# /etc/datadog-agent/conf.d/openmetrics.d/conf.yaml
instances:
  - openmetrics_endpoint: http://localhost:8080/metrics
    namespace: "yourapp"
    metrics:
      - http_request_duration_seconds
      - http_requests_total
      - http_active_connections
      - db_query_duration_seconds
      - app_errors_total

Incident Response Workflows

Monitoring without a response plan is theater. Here is a practical incident response workflow for a small team running on DigitalOcean.

Severity Levels

Level Definition Response Time Example
SEV1 Service down, all users affected 15 minutes Website unreachable
SEV2 Service degraded, some users affected 30 minutes API latency > 5 seconds
SEV3 Non-critical issue, no user impact Next business day Disk usage at 75%

Automated Escalation

var SlackNotifier = require("./slack-notifier");

var channels = {
    alerts: new SlackNotifier(process.env.SLACK_ALERTS_WEBHOOK),
    oncall: new SlackNotifier(process.env.SLACK_ONCALL_WEBHOOK),
    management: new SlackNotifier(process.env.SLACK_MGMT_WEBHOOK)
};

function escalate(severity, title, details) {
    switch (severity) {
        case "SEV1":
            // Notify all channels immediately
            channels.alerts.alert(title, details, "critical", function() {});
            channels.oncall.alert("[SEV1] " + title, details, "critical", function() {});
            channels.management.alert("[SEV1] " + title, details, "critical", function() {});
            break;
        case "SEV2":
            channels.alerts.alert(title, details, "warning", function() {});
            channels.oncall.alert("[SEV2] " + title, details, "warning", function() {});
            break;
        case "SEV3":
            channels.alerts.alert(title, details, "info", function() {});
            break;
    }
}

module.exports = escalate;

Cost of Monitoring vs Cost of Downtime

Let me put this bluntly: monitoring is cheap; downtime is expensive.

DigitalOcean's built-in monitoring is free. Alert policies, uptime checks, Droplet metrics - all included at no additional cost. There is zero excuse for not having basic alerts configured.

For external tools, a reasonable budget looks like:

Tool Monthly Cost What You Get
DigitalOcean Built-in $0 Infrastructure metrics, alerts, uptime checks
Grafana Cloud Free $0 10k metrics series, 50GB logs, 50GB traces
Grafana Cloud Pro $29+ Higher limits, alerting, SLOs
Datadog $15/host Full-stack observability
UptimeRobot Free $0 50 monitors, 5-min intervals

Compare this to the cost of one hour of downtime. For a SaaS application doing $50K/month in revenue, one hour of downtime costs approximately $69. For e-commerce, the per-hour cost can be ten times that. A $29/month monitoring subscription pays for itself the first time it catches an issue before it becomes an outage.

Complete Working Example

Here is a complete Express.js application with comprehensive monitoring built in. This ties together everything discussed above.

var express = require("express");
var os = require("os");
var promClient = require("prom-client");
var https = require("https");
var url = require("url");

// ============================================================
// Prometheus Metrics Setup
// ============================================================
var register = new promClient.Registry();
promClient.collectDefaultMetrics({ register: register });

var httpDuration = new promClient.Histogram({
    name: "http_request_duration_seconds",
    help: "HTTP request duration in seconds",
    labelNames: ["method", "route", "status"],
    buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
    registers: [register]
});

var httpTotal = new promClient.Counter({
    name: "http_requests_total",
    help: "Total HTTP requests",
    labelNames: ["method", "route", "status"],
    registers: [register]
});

var activeConns = new promClient.Gauge({
    name: "http_active_connections",
    help: "Active HTTP connections",
    registers: [register]
});

var errors = new promClient.Counter({
    name: "app_errors_total",
    help: "Application errors",
    labelNames: ["type"],
    registers: [register]
});

// ============================================================
// Slack Notifier
// ============================================================
function SlackNotifier(webhookUrl) {
    if (!webhookUrl) {
        this.disabled = true;
        return;
    }
    this.parsed = url.parse(webhookUrl);
    this.disabled = false;
}

SlackNotifier.prototype.send = function(text, color, callback) {
    if (this.disabled) {
        if (callback) callback(null);
        return;
    }

    var payload = JSON.stringify({
        attachments: [{
            color: color,
            text: text,
            ts: Math.floor(Date.now() / 1000)
        }]
    });

    var options = {
        hostname: this.parsed.hostname,
        path: this.parsed.path,
        method: "POST",
        headers: {
            "Content-Type": "application/json",
            "Content-Length": Buffer.byteLength(payload)
        }
    };

    var req = https.request(options, function(res) {
        var body = "";
        res.on("data", function(chunk) { body += chunk; });
        res.on("end", function() {
            if (callback) callback(null, res.statusCode);
        });
    });

    req.on("error", function(err) {
        if (callback) callback(err);
    });

    req.write(payload);
    req.end();
};

var slack = new SlackNotifier(process.env.SLACK_WEBHOOK_URL);

// ============================================================
// Logger
// ============================================================
function log(level, message, meta) {
    var entry = {
        timestamp: new Date().toISOString(),
        level: level,
        message: message,
        pid: process.pid,
        hostname: os.hostname()
    };
    if (meta) {
        Object.keys(meta).forEach(function(key) {
            entry[key] = meta[key];
        });
    }
    process.stdout.write(JSON.stringify(entry) + "\n");
}

// ============================================================
// Express App
// ============================================================
var app = express();
app.use(express.json());

// Metrics middleware
app.use(function(req, res, next) {
    var start = process.hrtime();
    activeConns.inc();

    res.on("finish", function() {
        activeConns.dec();
        var diff = process.hrtime(start);
        var duration = diff[0] + diff[1] / 1e9;
        var route = req.route ? req.route.path : req.path;
        var labels = { method: req.method, route: route, status: res.statusCode };
        httpDuration.observe(labels, duration);
        httpTotal.inc(labels);

        log("info", "request", {
            method: req.method,
            path: req.path,
            status: res.statusCode,
            duration: duration.toFixed(4) + "s",
            ip: req.ip
        });
    });

    next();
});

// Request logging
app.use(function(req, res, next) {
    req.startTime = Date.now();
    next();
});

// ============================================================
// Health Check Endpoint
// ============================================================
app.get("/health", function(req, res) {
    var memUsage = process.memoryUsage();
    var cpuLoad = os.loadavg();
    var uptime = process.uptime();
    var totalMem = os.totalmem();
    var freeMem = os.freemem();

    var status = "healthy";
    var warnings = [];

    if (memUsage.heapUsed / memUsage.heapTotal > 0.9) {
        status = "degraded";
        warnings.push("Heap usage above 90%");
    }

    if (cpuLoad[0] > os.cpus().length * 2) {
        status = "degraded";
        warnings.push("High load average: " + cpuLoad[0].toFixed(2));
    }

    var health = {
        status: status,
        timestamp: new Date().toISOString(),
        uptime: Math.floor(uptime) + "s",
        version: process.env.APP_VERSION || "unknown",
        node: process.version,
        process: {
            heapUsed: Math.round(memUsage.heapUsed / 1024 / 1024) + "MB",
            heapTotal: Math.round(memUsage.heapTotal / 1024 / 1024) + "MB",
            rss: Math.round(memUsage.rss / 1024 / 1024) + "MB"
        },
        system: {
            loadAvg: cpuLoad.map(function(v) { return v.toFixed(2); }),
            cpuCount: os.cpus().length,
            memoryUsed: ((1 - freeMem / totalMem) * 100).toFixed(1) + "%"
        }
    };

    if (warnings.length > 0) {
        health.warnings = warnings;
    }

    res.status(status === "healthy" ? 200 : 503).json(health);
});

// ============================================================
// Prometheus Metrics Endpoint
// ============================================================
app.get("/metrics", function(req, res) {
    res.set("Content-Type", register.contentType);
    register.metrics().then(function(data) {
        res.end(data);
    }).catch(function(err) {
        res.status(500).end(err.message);
    });
});

// ============================================================
// Readiness and Liveness Probes
// ============================================================
var ready = false;

app.get("/ready", function(req, res) {
    if (ready) {
        res.status(200).json({ ready: true });
    } else {
        res.status(503).json({ ready: false });
    }
});

app.get("/live", function(req, res) {
    res.status(200).json({ alive: true, pid: process.pid });
});

// ============================================================
// Sample API Routes
// ============================================================
app.get("/api/data", function(req, res) {
    // Simulate some work
    var result = { items: [], count: 0 };
    for (var i = 0; i < 100; i++) {
        result.items.push({ id: i, value: Math.random() });
    }
    result.count = result.items.length;
    res.json(result);
});

// ============================================================
// Error Handling
// ============================================================
app.use(function(err, req, res, next) {
    errors.inc({ type: err.name || "UnknownError" });

    log("error", "Unhandled error", {
        error: err.message,
        stack: err.stack,
        path: req.path,
        method: req.method
    });

    slack.send(
        "[ERROR] " + err.message + " on " + req.method + " " + req.path,
        "#dc3545",
        function() {}
    );

    res.status(500).json({ error: "Internal server error" });
});

// ============================================================
// Process-Level Monitoring
// ============================================================
process.on("uncaughtException", function(err) {
    errors.inc({ type: "UncaughtException" });
    log("error", "Uncaught exception", { error: err.message, stack: err.stack });
    slack.send("[FATAL] Uncaught exception: " + err.message, "#dc3545", function() {
        process.exit(1);
    });
});

process.on("unhandledRejection", function(reason) {
    errors.inc({ type: "UnhandledRejection" });
    var message = reason instanceof Error ? reason.message : String(reason);
    log("error", "Unhandled rejection", { reason: message });
    slack.send("[ERROR] Unhandled rejection: " + message, "#ffc107", function() {});
});

// Memory usage check every 60 seconds
setInterval(function() {
    var mem = process.memoryUsage();
    var heapPercent = (mem.heapUsed / mem.heapTotal * 100).toFixed(1);

    if (mem.heapUsed / mem.heapTotal > 0.85) {
        log("warn", "High heap usage", {
            heapUsed: Math.round(mem.heapUsed / 1024 / 1024) + "MB",
            heapTotal: Math.round(mem.heapTotal / 1024 / 1024) + "MB",
            percent: heapPercent + "%"
        });

        slack.send(
            "[WARN] Heap usage at " + heapPercent + "% on " + os.hostname(),
            "#ffc107",
            function() {}
        );
    }
}, 60000);

// ============================================================
// Setup DigitalOcean Alert Policies on Startup
// ============================================================
function setupAlertPolicies() {
    var doToken = process.env.DIGITALOCEAN_TOKEN;
    var dropletId = process.env.DROPLET_ID;

    if (!doToken || !dropletId) {
        log("info", "Skipping DO alert setup - missing token or droplet ID");
        return;
    }

    var policies = [
        {
            type: "v1/insights/droplet/cpu",
            description: "CPU > 80% for 5 minutes",
            compare: "GreaterThan",
            value: 80,
            window: "5m"
        },
        {
            type: "v1/insights/droplet/memory_utilization_percent",
            description: "Memory > 90% for 5 minutes",
            compare: "GreaterThan",
            value: 90,
            window: "5m"
        },
        {
            type: "v1/insights/droplet/disk_utilization_percent",
            description: "Disk > 85% for 5 minutes",
            compare: "GreaterThan",
            value: 85,
            window: "5m"
        }
    ];

    policies.forEach(function(policy) {
        policy.enabled = true;
        policy.entities = [dropletId];
        policy.tags = [process.env.NODE_ENV || "development"];
        policy.alerts = {
            email: [process.env.ALERT_EMAIL || "[email protected]"]
        };

        if (process.env.SLACK_WEBHOOK_URL) {
            policy.alerts.slack = [{
                channel: "#alerts",
                url: process.env.SLACK_WEBHOOK_URL
            }];
        }

        var postData = JSON.stringify(policy);
        var options = {
            hostname: "api.digitalocean.com",
            path: "/v2/monitoring/alerts",
            method: "POST",
            headers: {
                "Authorization": "Bearer " + doToken,
                "Content-Type": "application/json",
                "Content-Length": Buffer.byteLength(postData)
            }
        };

        var req = https.request(options, function(res) {
            var body = "";
            res.on("data", function(chunk) { body += chunk; });
            res.on("end", function() {
                if (res.statusCode === 200 || res.statusCode === 201) {
                    log("info", "Alert policy created: " + policy.description);
                } else {
                    log("warn", "Alert policy failed: " + policy.description, {
                        status: res.statusCode,
                        response: body
                    });
                }
            });
        });

        req.on("error", function(err) {
            log("error", "Alert policy request failed", { error: err.message });
        });

        req.write(postData);
        req.end();
    });
}

// ============================================================
// Start Server
// ============================================================
var PORT = process.env.PORT || 8080;

var server = app.listen(PORT, function() {
    log("info", "Server started", { port: PORT, env: process.env.NODE_ENV });
    ready = true;
    setupAlertPolicies();
});

// Graceful shutdown
process.on("SIGTERM", function() {
    log("info", "SIGTERM received, shutting down gracefully");
    ready = false;

    server.close(function() {
        log("info", "HTTP server closed");
        process.exit(0);
    });

    // Force shutdown after 30 seconds
    setTimeout(function() {
        log("error", "Forced shutdown after timeout");
        process.exit(1);
    }, 30000);
});

module.exports = app;

Install the dependencies:

npm install express prom-client

Run the application:

PORT=8080 NODE_ENV=production SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/HOOK node app.js

Test the endpoints:

# Health check
curl http://localhost:8080/health

# Prometheus metrics
curl http://localhost:8080/metrics

# Readiness probe
curl http://localhost:8080/ready

# Liveness probe
curl http://localhost:8080/live

Common Issues and Troubleshooting

1. Metrics Agent Not Reporting Data

Error: No graphs visible on the Droplet monitoring tab. The control panel shows "Install the metrics agent to see monitoring data."

Cause: The do-agent service is either not installed or not running. This commonly happens on Droplets created from custom images or snapshots.

Fix:

# Check if the agent is installed
which do-agent

# If not installed
curl -sSL https://repos.insights.digitalocean.com/install.sh | sudo bash

# If installed but not running
sudo systemctl start do-agent
sudo systemctl enable do-agent

# Verify it is reporting
journalctl -u do-agent -f

If the agent is running but metrics still do not appear, check the Droplet metadata service:

curl -s http://169.254.169.254/metadata/v1/id

If this returns nothing, the metadata service is not accessible, which the agent requires. This happens on Droplets with misconfigured network settings.

2. Alert Policies Not Firing

Error: You have configured alert policies but never receive notifications, even when thresholds are clearly exceeded.

Cause: The most common reason is that the entities field in the alert policy contains the wrong Droplet ID, or the alert window has not been exceeded. Another frequent cause is that the Slack webhook URL has expired or been revoked.

Fix:

# Verify your Droplet ID
doctl compute droplet list --format ID,Name

# List existing alert policies
doctl monitoring alert list

# Test your Slack webhook directly
curl -X POST -H 'Content-type: application/json' \
  --data '{"text":"Test alert from DigitalOcean monitoring setup"}' \
  https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK

If the Slack test works but DigitalOcean alerts do not arrive, check that the alert policy's window is appropriate. A 5-minute window means the condition must persist for a full 5 minutes before triggering.

3. prom-client Memory Leak with High-Cardinality Labels

Error: JavaScript heap out of memory after running for several hours, with the /metrics endpoint taking increasingly long to respond.

Cause: Using request paths as metric labels without normalization creates a new time series for every unique URL. If your API has user IDs or UUIDs in paths, the label cardinality explodes.

Fix:

// BAD - unbounded cardinality
var route = req.path; // "/users/abc123", "/users/def456", etc.

// GOOD - normalized route
var route = req.route ? req.route.path : "unknown";
// Results in "/users/:id" regardless of the actual user ID

Also set a maximum age for your metrics to prevent unbounded growth:

var httpDuration = new promClient.Histogram({
    name: "http_request_duration_seconds",
    help: "HTTP request duration",
    labelNames: ["method", "route", "status"],
    buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
    registers: [register],
    maxAgeSeconds: 600,
    ageBuckets: 5
});

4. Uptime Check Failing Behind Load Balancer

Error: DigitalOcean uptime check reports the service as down, but the service is accessible from your browser. The uptime check shows HTTP 502 Bad Gateway or Connection Timeout.

Cause: DigitalOcean uptime checks come from specific IP ranges. If your firewall or DigitalOcean Cloud Firewall is blocking these IPs, the checks will fail. Another common cause is that the health check endpoint is behind authentication middleware.

Fix:

// Ensure health endpoint is BEFORE auth middleware
app.get("/health", function(req, res) {
    res.status(200).json({ status: "ok" });
});

// Auth middleware applied after health route
app.use(authMiddleware);

// Protected routes below
app.get("/api/data", function(req, res) {
    // ...
});

For firewall issues, allow DigitalOcean's monitoring IP ranges. You can find the current list in their documentation, but the simplest fix is to ensure your Cloud Firewall allows inbound HTTP/HTTPS from all sources to your health check port.

5. Database Connection Pool Exhaustion Not Detected

Error: Application returns Error: Cannot acquire connection from pool. Pool is probably full. but no DigitalOcean alert fires because database CPU and memory are fine.

Cause: DigitalOcean's database monitoring tracks server-level metrics, not application-level connection pool state. Your pool can be exhausted while the database server sits idle.

Fix: Monitor connection pool metrics from your application:

var pool = require("./db").pool; // Your database pool

var poolActive = new promClient.Gauge({
    name: "db_pool_active_connections",
    help: "Active database connections",
    registers: [register]
});

var poolIdle = new promClient.Gauge({
    name: "db_pool_idle_connections",
    help: "Idle database connections",
    registers: [register]
});

var poolWaiting = new promClient.Gauge({
    name: "db_pool_waiting_requests",
    help: "Requests waiting for a connection",
    registers: [register]
});

setInterval(function() {
    poolActive.set(pool.totalCount - pool.idleCount);
    poolIdle.set(pool.idleCount);
    poolWaiting.set(pool.waitingCount);

    if (pool.waitingCount > 10) {
        log("warn", "Connection pool congestion", {
            active: pool.totalCount - pool.idleCount,
            idle: pool.idleCount,
            waiting: pool.waitingCount,
            total: pool.totalCount
        });
    }
}, 5000);

Best Practices

  • Alert on symptoms, not causes. Monitor error rates and latency from the user's perspective first. CPU and memory alerts are useful, but a user-facing 5xx rate spike tells you something is actually broken, not just busy.

  • Set up alerts on day one, not after the first outage. DigitalOcean's built-in alerts are free. There is no reason to skip this step. At minimum, configure CPU > 80%, memory > 90%, and disk > 85% alerts for every production Droplet.

  • Use multiple notification channels. Email and Slack together. Email is durable and searchable. Slack is immediate. If one channel is down, you still get notified through the other. For critical alerts, add a PagerDuty or Opsgenie integration for phone call escalation.

  • Normalize metric labels to prevent cardinality explosions. Never use raw user IDs, session IDs, or UUIDs as Prometheus labels. Use route patterns (/users/:id) instead of resolved paths (/users/abc123). High cardinality is the number one cause of monitoring system performance problems.

  • Implement health checks with depth levels. A basic /health endpoint that returns 200 is a starting point. Add a /health?deep=true mode that checks database connectivity, external API reachability, and cache availability. Use the shallow check for load balancer probes (fast, frequent) and the deep check for uptime monitoring (less frequent, more informative).

  • Monitor your monitoring. If your Slack webhook is revoked, your alerts silently vanish. Periodically send test alerts to verify the notification pipeline is working. Set a monthly calendar reminder to check that your alert policies are still active and your notification channels still function.

  • Separate operational alerts from informational dashboards. Not every metric needs an alert. Dashboard metrics like request rates and cache hit ratios are useful for capacity planning and debugging, but they should not wake someone up at 3 AM. Reserve alerts for conditions that require human intervention.

  • Log structured JSON, not plaintext. Structured logs are parseable, searchable, and filterable. Plaintext logs require regex patterns that break when the format changes. Every log entry should include a timestamp, severity level, service name, and enough context to understand the event without looking at surrounding lines.

  • Budget for log retention. DigitalOcean does not provide long-term log storage. Forward logs to Grafana Cloud Loki, Datadog, or even a dedicated Droplet running Loki. Thirty days of retention is the minimum for production. Ninety days gives you enough runway to spot trends.

  • Practice incident response before you need it. Run a tabletop exercise where you simulate a database disk-full scenario. Walk through how the alert fires, who gets paged, what the runbook says, and how the issue gets resolved. The first time you practice this should not be during an actual incident.

References

Powered by Contentful