DigitalOcean Monitoring and Alerting

Shane

2/14/2026

14 min read

A practical guide to DigitalOcean monitoring covering built-in metrics, custom agents, alert policies, uptime checks, and integrating with Node.js applications.

digitalocean monitoring alerting uptime metrics observability

DigitalOcean Monitoring and Alerting

Monitoring tells you what is happening inside your infrastructure. Alerting tells you when something goes wrong before your users notice. Without monitoring, you discover problems when customers report them. With monitoring and proper alerts, you discover problems when they start — often before they affect anyone.

DigitalOcean provides built-in monitoring for Droplets, Kubernetes clusters, managed databases, and load balancers. The monitoring agent collects CPU, memory, disk, and bandwidth metrics automatically. Alert policies trigger notifications when metrics cross thresholds. Uptime checks verify your application is reachable from the outside. This guide covers the full observability stack for Node.js applications on DigitalOcean.

Prerequisites

A DigitalOcean account
One or more Droplets or a Kubernetes cluster
A Node.js application deployed on DigitalOcean
doctl CLI installed (optional)

Built-in Droplet Monitoring

Enabling the Monitoring Agent

New Droplets created through the dashboard have monitoring enabled by default. For existing Droplets or those created via CLI:

# Create a Droplet with monitoring enabled
doctl compute droplet create my-app \
  --image ubuntu-22-04-x64 \
  --size s-2vcpu-4gb \
  --region nyc3 \
  --monitoring true

For existing Droplets, install the agent manually:

# SSH into your Droplet
ssh deploy@YOUR_DROPLET_IP

# Install the DigitalOcean monitoring agent
curl -sSL https://repos.insights.digitalocean.com/install.sh | sudo bash

The agent collects metrics every minute and sends them to DigitalOcean's monitoring backend. No configuration files to manage.

Available Metrics

The monitoring agent reports these metrics:

CPU

CPU utilization (%) — percentage of CPU time used. Sustained >80% indicates a need to scale or optimize.

Memory

Memory utilization (%) — RAM in use. High memory usage can cause the OOM killer to terminate your Node.js process.

Disk

Disk utilization (%) — storage space used. At 100%, the application cannot write logs, temp files, or database records.
Disk I/O (read/write) — bytes read from and written to disk per second.

Bandwidth

Public inbound/outbound (Mbps) — network traffic on the public interface.
Private inbound/outbound (Mbps) — network traffic on the VPC private interface.

Viewing Metrics

Navigate to your Droplet in the DigitalOcean dashboard and click Graphs. You can view metrics for the last 1 hour, 6 hours, 24 hours, 7 days, or 30 days.

Alert Policies

Creating Alerts via Dashboard

Navigate to Monitoring > Alerts
Click Create Alert Policy
Configure the alert:
- Metric: CPU utilization, memory utilization, disk utilization, bandwidth
- Threshold: the value that triggers the alert
- Duration: how long the metric must exceed the threshold
- Resources: which Droplets to monitor (by name or tag)
- Notifications: email addresses and/or Slack webhooks

Creating Alerts via CLI

# Alert when CPU exceeds 80% for 5 minutes
doctl monitoring alert create \
  --type "v1/insights/droplet/cpu" \
  --compare "GreaterThan" \
  --value 80 \
  --window "5m" \
  --entities YOUR_DROPLET_ID \
  --emails "[email protected]" \
  --description "High CPU on web server"

# Alert when disk exceeds 90%
doctl monitoring alert create \
  --type "v1/insights/droplet/disk_utilization_percent" \
  --compare "GreaterThan" \
  --value 90 \
  --window "5m" \
  --entities YOUR_DROPLET_ID \
  --emails "[email protected]" \
  --description "Disk nearly full"

# Alert when memory exceeds 85%
doctl monitoring alert create \
  --type "v1/insights/droplet/memory_utilization_percent" \
  --compare "GreaterThan" \
  --value 85 \
  --window "5m" \
  --entities YOUR_DROPLET_ID \
  --emails "[email protected]" \
  --description "High memory usage"

Recommended Alert Thresholds

Metric	Warning	Critical	Action
CPU	>70% for 10min	>90% for 5min	Scale or optimize
Memory	>80% for 10min	>90% for 5min	Restart, add RAM, fix leak
Disk	>80%	>90%	Clean logs, expand storage
Bandwidth out	>80% of limit	>90% of limit	Check for DDoS, optimize

Set warning thresholds lower than critical thresholds. Warnings give you time to investigate. Critical alerts mean you need to act immediately.

Slack Notifications

# Add a Slack webhook to an alert
doctl monitoring alert create \
  --type "v1/insights/droplet/cpu" \
  --compare "GreaterThan" \
  --value 80 \
  --window "5m" \
  --entities YOUR_DROPLET_ID \
  --slack-urls "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK" \
  --description "High CPU alert"

Uptime Checks

Uptime checks monitor your application from external locations. DigitalOcean sends HTTP requests to your endpoint and verifies the response. This detects outages that internal monitoring might miss — network issues, DNS failures, SSL problems.

Creating Uptime Checks

# HTTP uptime check
doctl monitoring uptime check create \
  --name "My App Health" \
  --type "https" \
  --target "https://myapp.example.com/health" \
  --regions "us_east,us_west,eu_west" \
  --enabled true

Check Types

HTTP/HTTPS — sends a GET request and checks for a 2xx response
TCP — connects to a port and checks for a successful connection
Ping — sends ICMP ping and checks for a response

Uptime Check Alerts

# Alert when the uptime check fails
doctl monitoring uptime alert create YOUR_CHECK_ID \
  --name "App Down Alert" \
  --type "down" \
  --period "2m" \
  --comparison "greater_than" \
  --threshold 1 \
  --emails "[email protected]" \
  --slack-urls "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"

The alert triggers when the uptime check fails from one or more regions for 2 minutes. Using multiple regions prevents false alarms from localized network issues.

Health Check Endpoint

Build a health check that reports real application status:

// health.js
var db = require("./db");
var redis = require("./redis");
var os = require("os");

function checkHealth(req, res) {
  var checks = {
    uptime: process.uptime(),
    timestamp: Date.now(),
    memory: {
      rss: Math.round(process.memoryUsage().rss / 1024 / 1024),
      heapUsed: Math.round(process.memoryUsage().heapUsed / 1024 / 1024),
      heapTotal: Math.round(process.memoryUsage().heapTotal / 1024 / 1024)
    },
    system: {
      loadAvg: os.loadavg(),
      freeMemory: Math.round(os.freemem() / 1024 / 1024),
      totalMemory: Math.round(os.totalmem() / 1024 / 1024)
    }
  };

  var promises = [];

  // Check database
  promises.push(
    db.query("SELECT 1")
      .then(function() { checks.database = "connected"; })
      .catch(function(err) { checks.database = "error: " + err.message; })
  );

  // Check Redis
  promises.push(
    redis.ping()
      .then(function() { checks.redis = "connected"; })
      .catch(function(err) { checks.redis = "error: " + err.message; })
  );

  Promise.all(promises).then(function() {
    var healthy = checks.database === "connected" && checks.redis === "connected";
    var status = healthy ? 200 : 503;
    res.status(status).json(checks);
  });
}

module.exports = checkHealth;

// app.js
var checkHealth = require("./health");
app.get("/health", checkHealth);

The uptime check verifies the HTTP response code. A 200 means healthy. A 503 triggers the alert.

Application-Level Monitoring

Request Metrics Middleware

Track response times and error rates inside your Node.js application:

// middleware/metrics.js
var metrics = {
  requests: 0,
  errors: 0,
  totalDuration: 0,
  statusCodes: {},
  paths: {}
};

function metricsMiddleware(req, res, next) {
  var start = Date.now();

  res.on("finish", function() {
    var duration = Date.now() - start;
    metrics.requests++;
    metrics.totalDuration += duration;

    // Track status codes
    var code = res.statusCode;
    metrics.statusCodes[code] = (metrics.statusCodes[code] || 0) + 1;

    if (code >= 500) {
      metrics.errors++;
    }

    // Track slow endpoints
    var path = req.route ? req.route.path : req.path;
    if (!metrics.paths[path]) {
      metrics.paths[path] = { count: 0, totalDuration: 0, maxDuration: 0 };
    }
    metrics.paths[path].count++;
    metrics.paths[path].totalDuration += duration;
    if (duration > metrics.paths[path].maxDuration) {
      metrics.paths[path].maxDuration = duration;
    }
  });

  next();
}

function getMetrics() {
  var avgDuration = metrics.requests > 0
    ? Math.round(metrics.totalDuration / metrics.requests)
    : 0;

  var pathStats = {};
  Object.keys(metrics.paths).forEach(function(path) {
    var p = metrics.paths[path];
    pathStats[path] = {
      count: p.count,
      avgDuration: Math.round(p.totalDuration / p.count),
      maxDuration: p.maxDuration
    };
  });

  return {
    requests: metrics.requests,
    errors: metrics.errors,
    errorRate: metrics.requests > 0
      ? (metrics.errors / metrics.requests * 100).toFixed(2) + "%"
      : "0%",
    avgResponseTime: avgDuration + "ms",
    statusCodes: metrics.statusCodes,
    paths: pathStats,
    uptime: process.uptime()
  };
}

function resetMetrics() {
  metrics.requests = 0;
  metrics.errors = 0;
  metrics.totalDuration = 0;
  metrics.statusCodes = {};
  metrics.paths = {};
}

module.exports = {
  middleware: metricsMiddleware,
  getMetrics: getMetrics,
  resetMetrics: resetMetrics
};

// app.js
var metrics = require("./middleware/metrics");

app.use(metrics.middleware);

// Metrics endpoint (protect in production)
app.get("/metrics", function(req, res) {
  res.json(metrics.getMetrics());
});

Memory Monitoring

Node.js applications can leak memory. Monitor heap usage and trigger alerts before the process crashes:

// monitor/memory.js
var WARNING_THRESHOLD = 0.8;  // 80% of heap limit
var CRITICAL_THRESHOLD = 0.9; // 90% of heap limit

function checkMemory() {
  var usage = process.memoryUsage();
  var heapUsed = usage.heapUsed;
  var heapTotal = usage.heapTotal;
  var ratio = heapUsed / heapTotal;

  if (ratio > CRITICAL_THRESHOLD) {
    console.error(JSON.stringify({
      level: "critical",
      type: "memory",
      heapUsedMB: Math.round(heapUsed / 1024 / 1024),
      heapTotalMB: Math.round(heapTotal / 1024 / 1024),
      percentage: Math.round(ratio * 100),
      message: "Memory usage critical"
    }));
  } else if (ratio > WARNING_THRESHOLD) {
    console.warn(JSON.stringify({
      level: "warning",
      type: "memory",
      heapUsedMB: Math.round(heapUsed / 1024 / 1024),
      heapTotalMB: Math.round(heapTotal / 1024 / 1024),
      percentage: Math.round(ratio * 100),
      message: "Memory usage high"
    }));
  }
}

// Check every 30 seconds
setInterval(checkMemory, 30000);

module.exports = { checkMemory: checkMemory };

Event Loop Monitoring

A blocked event loop causes slow responses for all requests:

// monitor/eventloop.js
var THRESHOLD_MS = 100; // Alert if event loop is blocked > 100ms

function monitorEventLoop() {
  var lastCheck = Date.now();

  setInterval(function() {
    var now = Date.now();
    var delay = now - lastCheck - 1000; // Expected interval is 1000ms
    lastCheck = now;

    if (delay > THRESHOLD_MS) {
      console.warn(JSON.stringify({
        level: "warning",
        type: "event_loop",
        delay: delay + "ms",
        message: "Event loop blocked for " + delay + "ms"
      }));
    }
  }, 1000);
}

module.exports = { start: monitorEventLoop };

// server.js
var eventLoop = require("./monitor/eventloop");
eventLoop.start();

Log-Based Monitoring

Structured Logging

Structured JSON logs are easier to search, filter, and alert on:

// logger.js
function createLogger(service) {
  function log(level, message, data) {
    var entry = {
      timestamp: new Date().toISOString(),
      level: level,
      service: service,
      message: message
    };

    if (data) {
      Object.keys(data).forEach(function(key) {
        entry[key] = data[key];
      });
    }

    if (level === "error" || level === "critical") {
      console.error(JSON.stringify(entry));
    } else {
      console.log(JSON.stringify(entry));
    }
  }

  return {
    info: function(message, data) { log("info", message, data); },
    warn: function(message, data) { log("warn", message, data); },
    error: function(message, data) { log("error", message, data); },
    critical: function(message, data) { log("critical", message, data); }
  };
}

module.exports = createLogger;

// Usage
var logger = require("./logger")("api");

logger.info("Server started", { port: 3000 });
logger.error("Database connection failed", { host: "db.example.com", error: err.message });

Request Logging

// middleware/requestLogger.js
var logger = require("../logger")("http");

function requestLogger(req, res, next) {
  var start = Date.now();
  var requestId = req.headers["x-request-id"] || generateId();

  req.requestId = requestId;
  res.setHeader("X-Request-ID", requestId);

  res.on("finish", function() {
    var duration = Date.now() - start;

    logger.info("request", {
      requestId: requestId,
      method: req.method,
      path: req.path,
      status: res.statusCode,
      duration: duration,
      ip: req.ip,
      userAgent: req.get("user-agent"),
      contentLength: res.get("content-length") || 0
    });

    if (duration > 5000) {
      logger.warn("slow_request", {
        requestId: requestId,
        method: req.method,
        path: req.path,
        duration: duration
      });
    }
  });

  next();
}

function generateId() {
  return Date.now().toString(36) + Math.random().toString(36).substr(2, 9);
}

module.exports = requestLogger;

Error Tracking

// middleware/errorTracker.js
var logger = require("../logger")("error");

function errorTracker(err, req, res, next) {
  logger.error("unhandled_error", {
    requestId: req.requestId,
    method: req.method,
    path: req.path,
    error: err.message,
    stack: err.stack,
    ip: req.ip
  });

  res.status(err.status || 500).json({
    error: process.env.NODE_ENV === "production"
      ? "Internal server error"
      : err.message
  });
}

// Catch unhandled rejections
process.on("unhandledRejection", function(reason) {
  logger.critical("unhandled_rejection", {
    error: reason instanceof Error ? reason.message : String(reason),
    stack: reason instanceof Error ? reason.stack : undefined
  });
});

// Catch uncaught exceptions
process.on("uncaughtException", function(err) {
  logger.critical("uncaught_exception", {
    error: err.message,
    stack: err.stack
  });
  process.exit(1);
});

module.exports = errorTracker;

Database Monitoring

Managed Database Metrics

DigitalOcean Managed Databases provide built-in metrics in the dashboard:

CPU utilization — query processing load
Memory usage — buffer cache and active connections
Disk I/O — read/write throughput
Connection count — active vs maximum connections
Replication lag — for read replicas

Query Performance Monitoring

// monitor/database.js
var db = require("../db");

var queryStats = {};

function monitoredQuery(text, params) {
  var start = Date.now();

  return db.query(text, params).then(function(result) {
    var duration = Date.now() - start;

    // Track by query template
    var key = text.substring(0, 100);
    if (!queryStats[key]) {
      queryStats[key] = { count: 0, totalDuration: 0, maxDuration: 0, slowCount: 0 };
    }

    var stat = queryStats[key];
    stat.count++;
    stat.totalDuration += duration;
    if (duration > stat.maxDuration) stat.maxDuration = duration;
    if (duration > 1000) stat.slowCount++;

    if (duration > 2000) {
      console.warn(JSON.stringify({
        level: "warning",
        type: "slow_query",
        query: key,
        duration: duration,
        rows: result.rowCount
      }));
    }

    return result;
  });
}

function getQueryStats() {
  var stats = {};
  Object.keys(queryStats).forEach(function(key) {
    var s = queryStats[key];
    stats[key] = {
      count: s.count,
      avgDuration: Math.round(s.totalDuration / s.count),
      maxDuration: s.maxDuration,
      slowCount: s.slowCount
    };
  });
  return stats;
}

module.exports = {
  query: monitoredQuery,
  getQueryStats: getQueryStats
};

Kubernetes Monitoring

DOKS Built-in Metrics

DigitalOcean Kubernetes shows cluster-level metrics:

CPU and memory per node and per pod
Pod status — running, pending, failed
Node health — ready, not ready

Installing Metrics Server

The metrics server enables kubectl top commands:

# Usually pre-installed on DOKS, but verify
kubectl top nodes
kubectl top pods

Prometheus and Grafana

For comprehensive Kubernetes monitoring:

# Install via DigitalOcean 1-Click
doctl kubernetes 1-click install my-cluster --1-clicks kube-prometheus-stack

Access Grafana dashboards:

kubectl port-forward svc/kube-prometheus-stack-grafana 3000:80 -n prometheus
# Open http://localhost:3000 (default admin/prom-operator)

Pre-built dashboards show:

Node resource utilization
Pod CPU and memory over time
Network traffic per pod
Persistent volume usage
API server latency

Complete Monitoring Setup

Putting it all together for a production Node.js application:

// server.js
var express = require("express");
var app = require("./app");
var logger = require("./logger")("server");
var eventLoop = require("./monitor/eventloop");
require("./monitor/memory");

var port = process.env.PORT || 3000;

var server = app.listen(port, function() {
  logger.info("Server started", { port: port, env: process.env.NODE_ENV });
});

eventLoop.start();

process.on("SIGTERM", function() {
  logger.info("SIGTERM received, shutting down");
  server.close(function() {
    logger.info("Server closed");
    process.exit(0);
  });
});

// app.js
var express = require("express");
var app = express();
var metrics = require("./middleware/metrics");
var requestLogger = require("./middleware/requestLogger");
var errorTracker = require("./middleware/errorTracker");
var checkHealth = require("./health");

// Monitoring middleware
app.use(metrics.middleware);
app.use(requestLogger);

// Health and metrics endpoints
app.get("/health", checkHealth);
app.get("/metrics", function(req, res) {
  res.json(metrics.getMetrics());
});

// Application routes
app.use("/api", require("./routes/api"));

// Error tracking (must be last)
app.use(errorTracker);

module.exports = app;

Common Issues and Troubleshooting

Alert fatigue — too many notifications

Alerts trigger too frequently on temporary spikes:

Fix: Increase the alert duration window. A 5-minute window filters out brief spikes. Set warning thresholds for investigation and critical thresholds for immediate action. Disable alerts for non-production environments.

Monitoring agent not reporting

Metrics stop appearing in the dashboard:

Fix: Check if the agent is running with systemctl status do-agent. Restart with sudo systemctl restart do-agent. Check firewall rules allow outbound HTTPS to DigitalOcean monitoring endpoints.

Uptime checks fail intermittently

False positives from a single region:

Fix: Configure checks from multiple regions. Require failures from 2+ regions before alerting. Increase the timeout for endpoints that are occasionally slow. Ensure the health check endpoint does not have authentication.

High memory usage alerts but application works fine

Linux uses available memory for disk cache:

Fix: Monitor application-level memory (heap usage) instead of system memory. The OS using memory for disk cache is normal and efficient. Focus on heapUsed from process.memoryUsage() for Node.js applications.

Best Practices

Monitor the four golden signals. Latency, traffic, errors, and saturation. These four metrics cover most production issues.
Alert on symptoms, not causes. Alert on "error rate > 5%" rather than "CPU > 80%". High CPU might be normal during a traffic spike. A high error rate always means something is wrong.
Use multiple notification channels. Email for warnings, Slack for critical, PagerDuty or phone calls for outages. Ensure the right people are notified at the right urgency.
Set up uptime checks from multiple regions. A single check point can give false negatives. Multiple regions confirm real outages and reduce false alarms.
Log in structured JSON format. JSON logs are searchable and parseable. Include request IDs to trace a single request across services.
Monitor database query performance. Slow queries are the most common cause of application latency. Track query duration and alert on slow queries.
Review and tune alert thresholds monthly. As your application grows, thresholds need adjustment. An alert that never fires is useless. An alert that fires constantly gets ignored.
Keep monitoring overhead low. Metrics collection and logging should not significantly impact application performance. Sample high-frequency events instead of logging every one.

DigitalOcean Monitoring and Alerting

DigitalOcean Monitoring and Alerting

Prerequisites

Built-in Droplet Monitoring

Enabling the Monitoring Agent

Available Metrics

Viewing Metrics

Alert Policies

Creating Alerts via Dashboard

Creating Alerts via CLI

Recommended Alert Thresholds

Slack Notifications

Uptime Checks

Creating Uptime Checks

Check Types

Uptime Check Alerts

Health Check Endpoint

Application-Level Monitoring

Request Metrics Middleware

Memory Monitoring

Event Loop Monitoring

Log-Based Monitoring

Structured Logging

Request Logging

Error Tracking

Database Monitoring

Managed Database Metrics

Query Performance Monitoring

Kubernetes Monitoring

DOKS Built-in Metrics

Installing Metrics Server

Prometheus and Grafana

Complete Monitoring Setup

Common Issues and Troubleshooting

Best Practices

References

Quick Links

Need Expert Help?