Digitalocean

DigitalOcean Monitoring and Alerting

A practical guide to DigitalOcean monitoring covering built-in metrics, custom agents, alert policies, uptime checks, and integrating with Node.js applications.

DigitalOcean Monitoring and Alerting

Monitoring tells you what is happening inside your infrastructure. Alerting tells you when something goes wrong before your users notice. Without monitoring, you discover problems when customers report them. With monitoring and proper alerts, you discover problems when they start — often before they affect anyone.

DigitalOcean provides built-in monitoring for Droplets, Kubernetes clusters, managed databases, and load balancers. The monitoring agent collects CPU, memory, disk, and bandwidth metrics automatically. Alert policies trigger notifications when metrics cross thresholds. Uptime checks verify your application is reachable from the outside. This guide covers the full observability stack for Node.js applications on DigitalOcean.

Prerequisites

  • A DigitalOcean account
  • One or more Droplets or a Kubernetes cluster
  • A Node.js application deployed on DigitalOcean
  • doctl CLI installed (optional)

Built-in Droplet Monitoring

Enabling the Monitoring Agent

New Droplets created through the dashboard have monitoring enabled by default. For existing Droplets or those created via CLI:

# Create a Droplet with monitoring enabled
doctl compute droplet create my-app \
  --image ubuntu-22-04-x64 \
  --size s-2vcpu-4gb \
  --region nyc3 \
  --monitoring true

For existing Droplets, install the agent manually:

# SSH into your Droplet
ssh deploy@YOUR_DROPLET_IP

# Install the DigitalOcean monitoring agent
curl -sSL https://repos.insights.digitalocean.com/install.sh | sudo bash

The agent collects metrics every minute and sends them to DigitalOcean's monitoring backend. No configuration files to manage.

Available Metrics

The monitoring agent reports these metrics:

CPU

  • CPU utilization (%) — percentage of CPU time used. Sustained >80% indicates a need to scale or optimize.

Memory

  • Memory utilization (%) — RAM in use. High memory usage can cause the OOM killer to terminate your Node.js process.

Disk

  • Disk utilization (%) — storage space used. At 100%, the application cannot write logs, temp files, or database records.
  • Disk I/O (read/write) — bytes read from and written to disk per second.

Bandwidth

  • Public inbound/outbound (Mbps) — network traffic on the public interface.
  • Private inbound/outbound (Mbps) — network traffic on the VPC private interface.

Viewing Metrics

Navigate to your Droplet in the DigitalOcean dashboard and click Graphs. You can view metrics for the last 1 hour, 6 hours, 24 hours, 7 days, or 30 days.

Alert Policies

Creating Alerts via Dashboard

  1. Navigate to Monitoring > Alerts
  2. Click Create Alert Policy
  3. Configure the alert:
    • Metric: CPU utilization, memory utilization, disk utilization, bandwidth
    • Threshold: the value that triggers the alert
    • Duration: how long the metric must exceed the threshold
    • Resources: which Droplets to monitor (by name or tag)
    • Notifications: email addresses and/or Slack webhooks

Creating Alerts via CLI

# Alert when CPU exceeds 80% for 5 minutes
doctl monitoring alert create \
  --type "v1/insights/droplet/cpu" \
  --compare "GreaterThan" \
  --value 80 \
  --window "5m" \
  --entities YOUR_DROPLET_ID \
  --emails "[email protected]" \
  --description "High CPU on web server"

# Alert when disk exceeds 90%
doctl monitoring alert create \
  --type "v1/insights/droplet/disk_utilization_percent" \
  --compare "GreaterThan" \
  --value 90 \
  --window "5m" \
  --entities YOUR_DROPLET_ID \
  --emails "[email protected]" \
  --description "Disk nearly full"

# Alert when memory exceeds 85%
doctl monitoring alert create \
  --type "v1/insights/droplet/memory_utilization_percent" \
  --compare "GreaterThan" \
  --value 85 \
  --window "5m" \
  --entities YOUR_DROPLET_ID \
  --emails "[email protected]" \
  --description "High memory usage"

Recommended Alert Thresholds

Metric Warning Critical Action
CPU >70% for 10min >90% for 5min Scale or optimize
Memory >80% for 10min >90% for 5min Restart, add RAM, fix leak
Disk >80% >90% Clean logs, expand storage
Bandwidth out >80% of limit >90% of limit Check for DDoS, optimize

Set warning thresholds lower than critical thresholds. Warnings give you time to investigate. Critical alerts mean you need to act immediately.

Slack Notifications

# Add a Slack webhook to an alert
doctl monitoring alert create \
  --type "v1/insights/droplet/cpu" \
  --compare "GreaterThan" \
  --value 80 \
  --window "5m" \
  --entities YOUR_DROPLET_ID \
  --slack-urls "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK" \
  --description "High CPU alert"

Uptime Checks

Uptime checks monitor your application from external locations. DigitalOcean sends HTTP requests to your endpoint and verifies the response. This detects outages that internal monitoring might miss — network issues, DNS failures, SSL problems.

Creating Uptime Checks

# HTTP uptime check
doctl monitoring uptime check create \
  --name "My App Health" \
  --type "https" \
  --target "https://myapp.example.com/health" \
  --regions "us_east,us_west,eu_west" \
  --enabled true

Check Types

  • HTTP/HTTPS — sends a GET request and checks for a 2xx response
  • TCP — connects to a port and checks for a successful connection
  • Ping — sends ICMP ping and checks for a response

Uptime Check Alerts

# Alert when the uptime check fails
doctl monitoring uptime alert create YOUR_CHECK_ID \
  --name "App Down Alert" \
  --type "down" \
  --period "2m" \
  --comparison "greater_than" \
  --threshold 1 \
  --emails "[email protected]" \
  --slack-urls "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"

The alert triggers when the uptime check fails from one or more regions for 2 minutes. Using multiple regions prevents false alarms from localized network issues.

Health Check Endpoint

Build a health check that reports real application status:

// health.js
var db = require("./db");
var redis = require("./redis");
var os = require("os");

function checkHealth(req, res) {
  var checks = {
    uptime: process.uptime(),
    timestamp: Date.now(),
    memory: {
      rss: Math.round(process.memoryUsage().rss / 1024 / 1024),
      heapUsed: Math.round(process.memoryUsage().heapUsed / 1024 / 1024),
      heapTotal: Math.round(process.memoryUsage().heapTotal / 1024 / 1024)
    },
    system: {
      loadAvg: os.loadavg(),
      freeMemory: Math.round(os.freemem() / 1024 / 1024),
      totalMemory: Math.round(os.totalmem() / 1024 / 1024)
    }
  };

  var promises = [];

  // Check database
  promises.push(
    db.query("SELECT 1")
      .then(function() { checks.database = "connected"; })
      .catch(function(err) { checks.database = "error: " + err.message; })
  );

  // Check Redis
  promises.push(
    redis.ping()
      .then(function() { checks.redis = "connected"; })
      .catch(function(err) { checks.redis = "error: " + err.message; })
  );

  Promise.all(promises).then(function() {
    var healthy = checks.database === "connected" && checks.redis === "connected";
    var status = healthy ? 200 : 503;
    res.status(status).json(checks);
  });
}

module.exports = checkHealth;
// app.js
var checkHealth = require("./health");
app.get("/health", checkHealth);

The uptime check verifies the HTTP response code. A 200 means healthy. A 503 triggers the alert.

Application-Level Monitoring

Request Metrics Middleware

Track response times and error rates inside your Node.js application:

// middleware/metrics.js
var metrics = {
  requests: 0,
  errors: 0,
  totalDuration: 0,
  statusCodes: {},
  paths: {}
};

function metricsMiddleware(req, res, next) {
  var start = Date.now();

  res.on("finish", function() {
    var duration = Date.now() - start;
    metrics.requests++;
    metrics.totalDuration += duration;

    // Track status codes
    var code = res.statusCode;
    metrics.statusCodes[code] = (metrics.statusCodes[code] || 0) + 1;

    if (code >= 500) {
      metrics.errors++;
    }

    // Track slow endpoints
    var path = req.route ? req.route.path : req.path;
    if (!metrics.paths[path]) {
      metrics.paths[path] = { count: 0, totalDuration: 0, maxDuration: 0 };
    }
    metrics.paths[path].count++;
    metrics.paths[path].totalDuration += duration;
    if (duration > metrics.paths[path].maxDuration) {
      metrics.paths[path].maxDuration = duration;
    }
  });

  next();
}

function getMetrics() {
  var avgDuration = metrics.requests > 0
    ? Math.round(metrics.totalDuration / metrics.requests)
    : 0;

  var pathStats = {};
  Object.keys(metrics.paths).forEach(function(path) {
    var p = metrics.paths[path];
    pathStats[path] = {
      count: p.count,
      avgDuration: Math.round(p.totalDuration / p.count),
      maxDuration: p.maxDuration
    };
  });

  return {
    requests: metrics.requests,
    errors: metrics.errors,
    errorRate: metrics.requests > 0
      ? (metrics.errors / metrics.requests * 100).toFixed(2) + "%"
      : "0%",
    avgResponseTime: avgDuration + "ms",
    statusCodes: metrics.statusCodes,
    paths: pathStats,
    uptime: process.uptime()
  };
}

function resetMetrics() {
  metrics.requests = 0;
  metrics.errors = 0;
  metrics.totalDuration = 0;
  metrics.statusCodes = {};
  metrics.paths = {};
}

module.exports = {
  middleware: metricsMiddleware,
  getMetrics: getMetrics,
  resetMetrics: resetMetrics
};
// app.js
var metrics = require("./middleware/metrics");

app.use(metrics.middleware);

// Metrics endpoint (protect in production)
app.get("/metrics", function(req, res) {
  res.json(metrics.getMetrics());
});

Memory Monitoring

Node.js applications can leak memory. Monitor heap usage and trigger alerts before the process crashes:

// monitor/memory.js
var WARNING_THRESHOLD = 0.8;  // 80% of heap limit
var CRITICAL_THRESHOLD = 0.9; // 90% of heap limit

function checkMemory() {
  var usage = process.memoryUsage();
  var heapUsed = usage.heapUsed;
  var heapTotal = usage.heapTotal;
  var ratio = heapUsed / heapTotal;

  if (ratio > CRITICAL_THRESHOLD) {
    console.error(JSON.stringify({
      level: "critical",
      type: "memory",
      heapUsedMB: Math.round(heapUsed / 1024 / 1024),
      heapTotalMB: Math.round(heapTotal / 1024 / 1024),
      percentage: Math.round(ratio * 100),
      message: "Memory usage critical"
    }));
  } else if (ratio > WARNING_THRESHOLD) {
    console.warn(JSON.stringify({
      level: "warning",
      type: "memory",
      heapUsedMB: Math.round(heapUsed / 1024 / 1024),
      heapTotalMB: Math.round(heapTotal / 1024 / 1024),
      percentage: Math.round(ratio * 100),
      message: "Memory usage high"
    }));
  }
}

// Check every 30 seconds
setInterval(checkMemory, 30000);

module.exports = { checkMemory: checkMemory };

Event Loop Monitoring

A blocked event loop causes slow responses for all requests:

// monitor/eventloop.js
var THRESHOLD_MS = 100; // Alert if event loop is blocked > 100ms

function monitorEventLoop() {
  var lastCheck = Date.now();

  setInterval(function() {
    var now = Date.now();
    var delay = now - lastCheck - 1000; // Expected interval is 1000ms
    lastCheck = now;

    if (delay > THRESHOLD_MS) {
      console.warn(JSON.stringify({
        level: "warning",
        type: "event_loop",
        delay: delay + "ms",
        message: "Event loop blocked for " + delay + "ms"
      }));
    }
  }, 1000);
}

module.exports = { start: monitorEventLoop };
// server.js
var eventLoop = require("./monitor/eventloop");
eventLoop.start();

Log-Based Monitoring

Structured Logging

Structured JSON logs are easier to search, filter, and alert on:

// logger.js
function createLogger(service) {
  function log(level, message, data) {
    var entry = {
      timestamp: new Date().toISOString(),
      level: level,
      service: service,
      message: message
    };

    if (data) {
      Object.keys(data).forEach(function(key) {
        entry[key] = data[key];
      });
    }

    if (level === "error" || level === "critical") {
      console.error(JSON.stringify(entry));
    } else {
      console.log(JSON.stringify(entry));
    }
  }

  return {
    info: function(message, data) { log("info", message, data); },
    warn: function(message, data) { log("warn", message, data); },
    error: function(message, data) { log("error", message, data); },
    critical: function(message, data) { log("critical", message, data); }
  };
}

module.exports = createLogger;
// Usage
var logger = require("./logger")("api");

logger.info("Server started", { port: 3000 });
logger.error("Database connection failed", { host: "db.example.com", error: err.message });

Request Logging

// middleware/requestLogger.js
var logger = require("../logger")("http");

function requestLogger(req, res, next) {
  var start = Date.now();
  var requestId = req.headers["x-request-id"] || generateId();

  req.requestId = requestId;
  res.setHeader("X-Request-ID", requestId);

  res.on("finish", function() {
    var duration = Date.now() - start;

    logger.info("request", {
      requestId: requestId,
      method: req.method,
      path: req.path,
      status: res.statusCode,
      duration: duration,
      ip: req.ip,
      userAgent: req.get("user-agent"),
      contentLength: res.get("content-length") || 0
    });

    if (duration > 5000) {
      logger.warn("slow_request", {
        requestId: requestId,
        method: req.method,
        path: req.path,
        duration: duration
      });
    }
  });

  next();
}

function generateId() {
  return Date.now().toString(36) + Math.random().toString(36).substr(2, 9);
}

module.exports = requestLogger;

Error Tracking

// middleware/errorTracker.js
var logger = require("../logger")("error");

function errorTracker(err, req, res, next) {
  logger.error("unhandled_error", {
    requestId: req.requestId,
    method: req.method,
    path: req.path,
    error: err.message,
    stack: err.stack,
    ip: req.ip
  });

  res.status(err.status || 500).json({
    error: process.env.NODE_ENV === "production"
      ? "Internal server error"
      : err.message
  });
}

// Catch unhandled rejections
process.on("unhandledRejection", function(reason) {
  logger.critical("unhandled_rejection", {
    error: reason instanceof Error ? reason.message : String(reason),
    stack: reason instanceof Error ? reason.stack : undefined
  });
});

// Catch uncaught exceptions
process.on("uncaughtException", function(err) {
  logger.critical("uncaught_exception", {
    error: err.message,
    stack: err.stack
  });
  process.exit(1);
});

module.exports = errorTracker;

Database Monitoring

Managed Database Metrics

DigitalOcean Managed Databases provide built-in metrics in the dashboard:

  • CPU utilization — query processing load
  • Memory usage — buffer cache and active connections
  • Disk I/O — read/write throughput
  • Connection count — active vs maximum connections
  • Replication lag — for read replicas

Query Performance Monitoring

// monitor/database.js
var db = require("../db");

var queryStats = {};

function monitoredQuery(text, params) {
  var start = Date.now();

  return db.query(text, params).then(function(result) {
    var duration = Date.now() - start;

    // Track by query template
    var key = text.substring(0, 100);
    if (!queryStats[key]) {
      queryStats[key] = { count: 0, totalDuration: 0, maxDuration: 0, slowCount: 0 };
    }

    var stat = queryStats[key];
    stat.count++;
    stat.totalDuration += duration;
    if (duration > stat.maxDuration) stat.maxDuration = duration;
    if (duration > 1000) stat.slowCount++;

    if (duration > 2000) {
      console.warn(JSON.stringify({
        level: "warning",
        type: "slow_query",
        query: key,
        duration: duration,
        rows: result.rowCount
      }));
    }

    return result;
  });
}

function getQueryStats() {
  var stats = {};
  Object.keys(queryStats).forEach(function(key) {
    var s = queryStats[key];
    stats[key] = {
      count: s.count,
      avgDuration: Math.round(s.totalDuration / s.count),
      maxDuration: s.maxDuration,
      slowCount: s.slowCount
    };
  });
  return stats;
}

module.exports = {
  query: monitoredQuery,
  getQueryStats: getQueryStats
};

Kubernetes Monitoring

DOKS Built-in Metrics

DigitalOcean Kubernetes shows cluster-level metrics:

  • CPU and memory per node and per pod
  • Pod status — running, pending, failed
  • Node health — ready, not ready

Installing Metrics Server

The metrics server enables kubectl top commands:

# Usually pre-installed on DOKS, but verify
kubectl top nodes
kubectl top pods

Prometheus and Grafana

For comprehensive Kubernetes monitoring:

# Install via DigitalOcean 1-Click
doctl kubernetes 1-click install my-cluster --1-clicks kube-prometheus-stack

Access Grafana dashboards:

kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n prometheus
# Open http://localhost:3000 (default admin/prom-operator)

Pre-built dashboards show:

  • Node resource utilization
  • Pod CPU and memory over time
  • Network traffic per pod
  • Persistent volume usage
  • API server latency

Complete Monitoring Setup

Putting it all together for a production Node.js application:

// server.js
var express = require("express");
var app = require("./app");
var logger = require("./logger")("server");
var eventLoop = require("./monitor/eventloop");
require("./monitor/memory");

var port = process.env.PORT || 3000;

var server = app.listen(port, function() {
  logger.info("Server started", { port: port, env: process.env.NODE_ENV });
});

eventLoop.start();

process.on("SIGTERM", function() {
  logger.info("SIGTERM received, shutting down");
  server.close(function() {
    logger.info("Server closed");
    process.exit(0);
  });
});
// app.js
var express = require("express");
var app = express();
var metrics = require("./middleware/metrics");
var requestLogger = require("./middleware/requestLogger");
var errorTracker = require("./middleware/errorTracker");
var checkHealth = require("./health");

// Monitoring middleware
app.use(metrics.middleware);
app.use(requestLogger);

// Health and metrics endpoints
app.get("/health", checkHealth);
app.get("/metrics", function(req, res) {
  res.json(metrics.getMetrics());
});

// Application routes
app.use("/api", require("./routes/api"));

// Error tracking (must be last)
app.use(errorTracker);

module.exports = app;

Common Issues and Troubleshooting

Alert fatigue — too many notifications

Alerts trigger too frequently on temporary spikes:

Fix: Increase the alert duration window. A 5-minute window filters out brief spikes. Set warning thresholds for investigation and critical thresholds for immediate action. Disable alerts for non-production environments.

Monitoring agent not reporting

Metrics stop appearing in the dashboard:

Fix: Check if the agent is running with systemctl status do-agent. Restart with sudo systemctl restart do-agent. Check firewall rules allow outbound HTTPS to DigitalOcean monitoring endpoints.

Uptime checks fail intermittently

False positives from a single region:

Fix: Configure checks from multiple regions. Require failures from 2+ regions before alerting. Increase the timeout for endpoints that are occasionally slow. Ensure the health check endpoint does not have authentication.

High memory usage alerts but application works fine

Linux uses available memory for disk cache:

Fix: Monitor application-level memory (heap usage) instead of system memory. The OS using memory for disk cache is normal and efficient. Focus on heapUsed from process.memoryUsage() for Node.js applications.

Best Practices

  • Monitor the four golden signals. Latency, traffic, errors, and saturation. These four metrics cover most production issues.
  • Alert on symptoms, not causes. Alert on "error rate > 5%" rather than "CPU > 80%". High CPU might be normal during a traffic spike. A high error rate always means something is wrong.
  • Use multiple notification channels. Email for warnings, Slack for critical, PagerDuty or phone calls for outages. Ensure the right people are notified at the right urgency.
  • Set up uptime checks from multiple regions. A single check point can give false negatives. Multiple regions confirm real outages and reduce false alarms.
  • Log in structured JSON format. JSON logs are searchable and parseable. Include request IDs to trace a single request across services.
  • Monitor database query performance. Slow queries are the most common cause of application latency. Track query duration and alert on slow queries.
  • Review and tune alert thresholds monthly. As your application grows, thresholds need adjustment. An alert that never fires is useless. An alert that fires constantly gets ignored.
  • Keep monitoring overhead low. Metrics collection and logging should not significantly impact application performance. Sample high-frequency events instead of logging every one.

References

Powered by Contentful