DigitalOcean Monitoring and Alerting
A practical guide to monitoring and alerting on DigitalOcean, covering built-in metrics, alert policies, uptime checks, custom Node.js metrics, and integration with external monitoring tools.
DigitalOcean Monitoring and Alerting
Overview
DigitalOcean provides a built-in monitoring stack that covers infrastructure metrics, uptime checks, and alert policies with notification channels like Slack and email. For teams running production workloads on DigitalOcean, understanding how to configure these tools properly is the difference between catching a disk-full condition at 2 PM and getting paged at 3 AM when your database crashes. This guide walks through every layer of monitoring available on the platform, from the Droplet metrics agent to custom application-level metrics in Node.js, and shows you how to wire it all together into a coherent observability strategy.
Prerequisites
- A DigitalOcean account with at least one Droplet or App Platform deployment
- Node.js v18+ installed locally
- A DigitalOcean Personal Access Token with read/write scope
- Basic familiarity with Express.js and REST APIs
- A Slack workspace (optional, for notification integration)
doctlCLI installed and authenticated (optional but recommended)
Built-in DigitalOcean Monitoring
The Droplet Metrics Agent
Every Droplet created after January 2018 comes with the DigitalOcean metrics agent pre-installed. If you are running an older Droplet, you need to install it manually. The agent collects system-level metrics and sends them to DigitalOcean's monitoring backend, where they become available in the control panel and through the API.
To install the agent on an existing Droplet:
curl -sSL https://repos.insights.digitalocean.com/install.sh | sudo bash
Verify the agent is running:
systemctl status do-agent
You should see output indicating the service is active. If the agent is not running, start it:
sudo systemctl enable do-agent
sudo systemctl start do-agent
The metrics agent collects the following data points at one-minute intervals:
- CPU utilization - percentage of CPU time spent in user, system, and idle states
- Memory usage - total, used, free, cached, and buffered memory
- Disk I/O - read/write operations per second, bytes read/written
- Disk usage - percentage of disk space used per mount point
- Network bandwidth - inbound and outbound bytes per second per interface
- Load average - 1-minute, 5-minute, and 15-minute load averages
These metrics are retained for 30 days and are accessible through the Graphs tab on any Droplet's detail page in the control panel.
Understanding the Metrics API
You can query Droplet metrics programmatically. This is useful for building custom dashboards or triggering actions based on metric thresholds outside of DigitalOcean's built-in alert system.
var https = require("https");
var token = process.env.DIGITALOCEAN_TOKEN;
var dropletId = process.env.DROPLET_ID;
var now = Math.floor(Date.now() / 1000);
var oneHourAgo = now - 3600;
var options = {
hostname: "api.digitalocean.com",
path: "/v2/monitoring/metrics/droplet/cpu?host_id=" + dropletId +
"&start=" + oneHourAgo + "&end=" + now,
method: "GET",
headers: {
"Authorization": "Bearer " + token,
"Content-Type": "application/json"
}
};
var req = https.request(options, function(res) {
var body = "";
res.on("data", function(chunk) {
body += chunk;
});
res.on("end", function() {
var data = JSON.parse(body);
var results = data.data.result;
results.forEach(function(series) {
var mode = series.metric.mode;
var lastValue = series.values[series.values.length - 1];
console.log("CPU mode: " + mode + ", value: " + lastValue[1] + "%");
});
});
});
req.on("error", function(err) {
console.error("Metrics request failed:", err.message);
});
req.end();
Available metric endpoints include:
| Endpoint | Description |
|---|---|
/v2/monitoring/metrics/droplet/cpu |
CPU utilization by mode |
/v2/monitoring/metrics/droplet/memory_free |
Free memory in bytes |
/v2/monitoring/metrics/droplet/memory_available |
Available memory in bytes |
/v2/monitoring/metrics/droplet/memory_total |
Total memory in bytes |
/v2/monitoring/metrics/droplet/disk_read |
Disk read bytes/sec |
/v2/monitoring/metrics/droplet/disk_write |
Disk write bytes/sec |
/v2/monitoring/metrics/droplet/bandwidth |
Network bandwidth |
/v2/monitoring/metrics/droplet/load_1 |
1-minute load average |
/v2/monitoring/metrics/droplet/load_5 |
5-minute load average |
/v2/monitoring/metrics/droplet/load_15 |
15-minute load average |
/v2/monitoring/metrics/droplet/filesystem_free |
Free disk space |
/v2/monitoring/metrics/droplet/filesystem_size |
Total disk size |
Setting Up Alert Policies
Alert policies are the backbone of proactive monitoring. Without them, you are simply collecting data nobody looks at. DigitalOcean supports alerts on CPU, memory, disk, bandwidth, and custom uptime checks.
Creating Alert Policies via the API
Here is how to create a comprehensive set of alert policies programmatically:
var https = require("https");
var token = process.env.DIGITALOCEAN_TOKEN;
function createAlertPolicy(policy, callback) {
var postData = JSON.stringify(policy);
var options = {
hostname: "api.digitalocean.com",
path: "/v2/monitoring/alerts",
method: "POST",
headers: {
"Authorization": "Bearer " + token,
"Content-Type": "application/json",
"Content-Length": Buffer.byteLength(postData)
}
};
var req = https.request(options, function(res) {
var body = "";
res.on("data", function(chunk) {
body += chunk;
});
res.on("end", function() {
if (res.statusCode === 200 || res.statusCode === 201) {
callback(null, JSON.parse(body));
} else {
callback(new Error("HTTP " + res.statusCode + ": " + body));
}
});
});
req.on("error", function(err) {
callback(err);
});
req.write(postData);
req.end();
}
var dropletIds = ["your-droplet-id"];
var policies = [
{
alerts: {
email: ["[email protected]"],
slack: [{
channel: "#alerts",
url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
}]
},
compare: "GreaterThan",
description: "CPU usage above 80% for 5 minutes",
enabled: true,
entities: dropletIds,
tags: ["production"],
type: "v1/insights/droplet/cpu",
value: 80,
window: "5m"
},
{
alerts: {
email: ["[email protected]"],
slack: [{
channel: "#alerts",
url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
}]
},
compare: "GreaterThan",
description: "Memory usage above 90% for 5 minutes",
enabled: true,
entities: dropletIds,
tags: ["production"],
type: "v1/insights/droplet/memory_utilization_percent",
value: 90,
window: "5m"
},
{
alerts: {
email: ["[email protected]"],
slack: [{
channel: "#alerts",
url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
}]
},
compare: "GreaterThan",
description: "Disk usage above 85%",
enabled: true,
entities: dropletIds,
tags: ["production"],
type: "v1/insights/droplet/disk_utilization_percent",
value: 85,
window: "5m"
},
{
alerts: {
email: ["[email protected]"],
slack: [{
channel: "#alerts",
url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
}]
},
compare: "GreaterThan",
description: "Inbound bandwidth above 100 Mbps for 5 minutes",
enabled: true,
entities: dropletIds,
tags: ["production"],
type: "v1/insights/droplet/public_inbound_bandwidth",
value: 100000000,
window: "5m"
}
];
var created = 0;
policies.forEach(function(policy) {
createAlertPolicy(policy, function(err, result) {
created++;
if (err) {
console.error("Failed to create policy:", policy.description, err.message);
} else {
console.log("Created policy:", result.policy.description);
}
if (created === policies.length) {
console.log("All policies processed.");
}
});
});
Recommended Alert Thresholds
Through years of operating Node.js services on DigitalOcean, I have landed on these thresholds as a starting point. You should tune them based on your workload patterns.
| Metric | Warning | Critical | Window |
|---|---|---|---|
| CPU | 70% | 90% | 5 min |
| Memory | 80% | 95% | 5 min |
| Disk Usage | 75% | 90% | 5 min |
| Load Average (1m) | 2x vCPU count | 4x vCPU count | 5 min |
| Inbound Bandwidth | 80% of limit | 95% of limit | 5 min |
A common mistake is setting thresholds too aggressively. A Node.js process doing a garbage collection sweep can spike to 100% CPU for a few seconds. That is normal. Use the 5-minute window to avoid false positives.
Uptime Checks
Uptime checks verify that your HTTP/HTTPS endpoints are reachable and responding correctly. DigitalOcean runs these checks from multiple regions every 30 seconds and triggers alerts when a check fails.
Creating Uptime Checks via the API
var https = require("https");
var token = process.env.DIGITALOCEAN_TOKEN;
function createUptimeCheck(check, callback) {
var postData = JSON.stringify(check);
var options = {
hostname: "api.digitalocean.com",
path: "/v2/uptime/checks",
method: "POST",
headers: {
"Authorization": "Bearer " + token,
"Content-Type": "application/json",
"Content-Length": Buffer.byteLength(postData)
}
};
var req = https.request(options, function(res) {
var body = "";
res.on("data", function(chunk) {
body += chunk;
});
res.on("end", function() {
if (res.statusCode === 201) {
callback(null, JSON.parse(body));
} else {
callback(new Error("HTTP " + res.statusCode + ": " + body));
}
});
});
req.on("error", callback);
req.write(postData);
req.end();
}
var checks = [
{
name: "Production API Health",
type: "https",
target: "https://api.yourapp.com/health",
regions: ["us_east", "us_west", "eu_west"],
enabled: true
},
{
name: "Production Website",
type: "https",
target: "https://www.yourapp.com",
regions: ["us_east", "us_west", "eu_west", "se_asia"],
enabled: true
}
];
checks.forEach(function(check) {
createUptimeCheck(check, function(err, result) {
if (err) {
console.error("Failed:", check.name, err.message);
return;
}
console.log("Created uptime check:", result.check.id, result.check.name);
// Now create an alert for this check
var alertData = JSON.stringify({
name: check.name + " - Down Alert",
type: "down",
threshold: 2,
comparison: "greater_than",
period: "2m",
notifications: {
email: ["[email protected]"],
slack: [{
channel: "#alerts",
url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
}]
}
});
var alertOptions = {
hostname: "api.digitalocean.com",
path: "/v2/uptime/checks/" + result.check.id + "/alerts",
method: "POST",
headers: {
"Authorization": "Bearer " + token,
"Content-Type": "application/json",
"Content-Length": Buffer.byteLength(alertData)
}
};
var alertReq = https.request(alertOptions, function(alertRes) {
var alertBody = "";
alertRes.on("data", function(chunk) { alertBody += chunk; });
alertRes.on("end", function() {
console.log("Alert created for:", check.name);
});
});
alertReq.write(alertData);
alertReq.end();
});
});
Slack and Email Notification Channels
DigitalOcean supports two notification channels out of the box: email and Slack webhooks. For serious production deployments, use both. Email serves as the durable record. Slack serves as the real-time channel your team actually watches.
Setting Up Slack Notifications
- Create a Slack app at
https://api.slack.com/apps - Enable Incoming Webhooks for the app
- Add a webhook URL to the channel you want alerts in (e.g.,
#alerts-production) - Use the webhook URL in your alert policies
Here is a utility module for sending custom Slack notifications from your Node.js application alongside the DigitalOcean-native alerts:
var https = require("https");
var url = require("url");
function SlackNotifier(webhookUrl) {
this.webhookUrl = webhookUrl;
this.parsed = url.parse(webhookUrl);
}
SlackNotifier.prototype.send = function(message, callback) {
var payload = JSON.stringify(message);
var options = {
hostname: this.parsed.hostname,
path: this.parsed.path,
method: "POST",
headers: {
"Content-Type": "application/json",
"Content-Length": Buffer.byteLength(payload)
}
};
var req = https.request(options, function(res) {
var body = "";
res.on("data", function(chunk) { body += chunk; });
res.on("end", function() {
callback(null, res.statusCode);
});
});
req.on("error", callback);
req.write(payload);
req.end();
};
SlackNotifier.prototype.alert = function(title, text, severity, callback) {
var colors = {
critical: "#dc3545",
warning: "#ffc107",
info: "#17a2b8",
ok: "#28a745"
};
var message = {
attachments: [{
color: colors[severity] || colors.info,
title: title,
text: text,
ts: Math.floor(Date.now() / 1000),
fields: [
{ title: "Severity", value: severity.toUpperCase(), short: true },
{ title: "Environment", value: process.env.NODE_ENV || "unknown", short: true }
]
}]
};
this.send(message, callback);
};
module.exports = SlackNotifier;
Usage:
var SlackNotifier = require("./slack-notifier");
var notifier = new SlackNotifier(process.env.SLACK_WEBHOOK_URL);
notifier.alert(
"High Memory Usage",
"Memory usage on web-1 has exceeded 90% for the last 5 minutes.",
"critical",
function(err) {
if (err) console.error("Slack notification failed:", err.message);
}
);
Application-Level Monitoring from Node.js
Infrastructure metrics tell you the machine is healthy. Application metrics tell you the software is healthy. You need both. DigitalOcean's built-in monitoring handles the infrastructure side. For application-level observability, you need to instrument your code.
Custom Health Endpoints
Every production Node.js service should expose a health check endpoint. This endpoint serves double duty: DigitalOcean uptime checks hit it, and your load balancer uses it for health-based routing.
var os = require("os");
function healthCheckHandler(req, res) {
var memUsage = process.memoryUsage();
var uptime = process.uptime();
var cpus = os.cpus();
var cpuLoad = os.loadavg();
var totalMem = os.totalmem();
var freeMem = os.freemem();
var memPercent = ((totalMem - freeMem) / totalMem * 100).toFixed(2);
var health = {
status: "healthy",
timestamp: new Date().toISOString(),
uptime: Math.floor(uptime) + "s",
process: {
pid: process.pid,
version: process.version,
heapUsed: Math.round(memUsage.heapUsed / 1024 / 1024) + "MB",
heapTotal: Math.round(memUsage.heapTotal / 1024 / 1024) + "MB",
rss: Math.round(memUsage.rss / 1024 / 1024) + "MB",
external: Math.round(memUsage.external / 1024 / 1024) + "MB"
},
system: {
loadAvg: cpuLoad.map(function(v) { return v.toFixed(2); }),
cpuCount: cpus.length,
memoryPercent: memPercent + "%",
freeMem: Math.round(freeMem / 1024 / 1024) + "MB",
platform: os.platform(),
hostname: os.hostname()
}
};
// Mark as unhealthy if heap is over 90% of total
if (memUsage.heapUsed / memUsage.heapTotal > 0.9) {
health.status = "degraded";
health.warnings = health.warnings || [];
health.warnings.push("Heap usage above 90%");
}
// Mark as unhealthy if load average is high
if (cpuLoad[0] > cpus.length * 2) {
health.status = "degraded";
health.warnings = health.warnings || [];
health.warnings.push("Load average exceeds 2x CPU count");
}
var statusCode = health.status === "healthy" ? 200 : 503;
res.status(statusCode).json(health);
}
module.exports = healthCheckHandler;
Prometheus-Compatible Metrics with prom-client
If you want to go beyond basic health checks, prom-client gives you Prometheus-compatible metrics that work with Grafana, Datadog, and any other tool that speaks the Prometheus exposition format.
var promClient = require("prom-client");
// Create a registry
var register = new promClient.Registry();
// Add default Node.js metrics (GC, event loop, memory)
promClient.collectDefaultMetrics({ register: register });
// Custom metrics
var httpRequestDuration = new promClient.Histogram({
name: "http_request_duration_seconds",
help: "Duration of HTTP requests in seconds",
labelNames: ["method", "route", "status_code"],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
registers: [register]
});
var httpRequestsTotal = new promClient.Counter({
name: "http_requests_total",
help: "Total number of HTTP requests",
labelNames: ["method", "route", "status_code"],
registers: [register]
});
var activeConnections = new promClient.Gauge({
name: "http_active_connections",
help: "Number of active HTTP connections",
registers: [register]
});
var dbQueryDuration = new promClient.Histogram({
name: "db_query_duration_seconds",
help: "Duration of database queries in seconds",
labelNames: ["operation", "collection"],
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5],
registers: [register]
});
var errorCounter = new promClient.Counter({
name: "app_errors_total",
help: "Total number of application errors",
labelNames: ["type", "code"],
registers: [register]
});
// Middleware for tracking request metrics
function metricsMiddleware(req, res, next) {
var start = process.hrtime();
activeConnections.inc();
res.on("finish", function() {
activeConnections.dec();
var duration = process.hrtime(start);
var durationSeconds = duration[0] + duration[1] / 1e9;
var route = req.route ? req.route.path : req.path;
var labels = {
method: req.method,
route: route,
status_code: res.statusCode
};
httpRequestDuration.observe(labels, durationSeconds);
httpRequestsTotal.inc(labels);
});
next();
}
// Endpoint to expose metrics
function metricsHandler(req, res) {
res.set("Content-Type", register.contentType);
register.metrics().then(function(metrics) {
res.end(metrics);
}).catch(function(err) {
res.status(500).end(err.message);
});
}
module.exports = {
register: register,
middleware: metricsMiddleware,
handler: metricsHandler,
httpRequestDuration: httpRequestDuration,
httpRequestsTotal: httpRequestsTotal,
activeConnections: activeConnections,
dbQueryDuration: dbQueryDuration,
errorCounter: errorCounter
};
Log Management Strategies
DigitalOcean does not provide a centralized logging service comparable to AWS CloudWatch Logs. You have a few options.
Structured Logging
The first step is producing structured logs that are easy to parse and search:
var os = require("os");
function Logger(service) {
this.service = service;
this.hostname = os.hostname();
}
Logger.prototype.log = function(level, message, meta) {
var entry = {
timestamp: new Date().toISOString(),
level: level,
service: this.service,
hostname: this.hostname,
message: message,
pid: process.pid
};
if (meta) {
Object.keys(meta).forEach(function(key) {
entry[key] = meta[key];
});
}
// Errors get stack traces
if (meta && meta.error instanceof Error) {
entry.error = {
name: meta.error.name,
message: meta.error.message,
stack: meta.error.stack
};
}
process.stdout.write(JSON.stringify(entry) + "\n");
};
Logger.prototype.info = function(message, meta) {
this.log("info", message, meta);
};
Logger.prototype.warn = function(message, meta) {
this.log("warn", message, meta);
};
Logger.prototype.error = function(message, meta) {
this.log("error", message, meta);
};
module.exports = Logger;
Log Forwarding Options
For Droplet-based deployments, install a log forwarder. Three solid options:
- Vector (by Datadog, open source) - lightweight, fast, works well with DigitalOcean
- Fluent Bit - low memory footprint, extensive plugin ecosystem
- rsyslog - already installed on most Linux distributions
Example Vector configuration for forwarding to a Grafana Cloud Loki instance:
[sources.app_logs]
type = "file"
include = ["/var/log/yourapp/*.log"]
read_from = "beginning"
[transforms.parse_json]
type = "remap"
inputs = ["app_logs"]
source = '''
. = parse_json!(.message)
'''
[sinks.loki]
type = "loki"
inputs = ["parse_json"]
endpoint = "https://logs-prod-us-central1.grafana.net"
encoding.codec = "json"
labels.service = "{{ service }}"
labels.hostname = "{{ hostname }}"
labels.level = "{{ level }}"
[sinks.loki.auth]
strategy = "basic"
user = "${GRAFANA_LOKI_USER}"
password = "${GRAFANA_LOKI_TOKEN}"
For App Platform deployments, your application logs are available through doctl and the DigitalOcean control panel, but they are retained for a limited time. For production, always forward logs to an external destination.
Database Monitoring for Managed Databases
DigitalOcean Managed Databases (PostgreSQL, MySQL, Redis, MongoDB) include built-in monitoring dashboards. You can also query database metrics through the API.
var https = require("https");
var token = process.env.DIGITALOCEAN_TOKEN;
var clusterId = process.env.DB_CLUSTER_ID;
function getDatabaseMetrics(metricName, callback) {
var now = Math.floor(Date.now() / 1000);
var oneHourAgo = now - 3600;
var options = {
hostname: "api.digitalocean.com",
path: "/v2/databases/" + clusterId + "/metrics/credentials",
method: "GET",
headers: {
"Authorization": "Bearer " + token,
"Content-Type": "application/json"
}
};
var req = https.request(options, function(res) {
var body = "";
res.on("data", function(chunk) { body += chunk; });
res.on("end", function() {
callback(null, JSON.parse(body));
});
});
req.on("error", callback);
req.end();
}
// Key database metrics to monitor:
// - Connection count vs max connections
// - Query latency (p50, p95, p99)
// - Replication lag (for read replicas)
// - Cache hit ratio (PostgreSQL shared buffers, Redis hit rate)
// - Disk usage growth rate
Key metrics to watch for managed databases:
- Connection count - If you approach
max_connections, new connections will be refused. Set alerts at 80%. - Replication lag - For read replicas, lag above 1 second indicates the replica is falling behind. This matters for read-after-write consistency.
- Disk usage - Managed databases have disk limits. Running out of disk is catastrophic and not always recoverable.
- Slow queries - Enable
pg_stat_statementsfor PostgreSQL or the slow query log for MySQL.
Load Balancer Monitoring
DigitalOcean Load Balancers provide metrics on request rates, response codes, and connection counts. Monitor these through the control panel or API.
The critical metrics for load balancers:
- 4xx and 5xx response rates - A spike in 5xx responses means your backend is failing. A spike in 4xx might indicate a misconfigured client or an attack.
- Active connections - If this approaches your load balancer's limit, you need to scale.
- Backend health check results - If backends are being marked unhealthy, investigate immediately.
- TLS handshake errors - Certificate problems show up here before users report them.
Set up health check monitoring on the load balancer itself:
// Load balancer health check configuration
var lbConfig = {
health_check: {
protocol: "http",
port: 8080,
path: "/health",
check_interval_seconds: 10,
response_timeout_seconds: 5,
unhealthy_threshold: 3,
healthy_threshold: 5
}
};
Use a 3-unhealthy threshold. This means a backend must fail three consecutive health checks before being removed from rotation. This prevents a single slow response from taking a healthy server out of the pool.
Combining DigitalOcean with External Monitoring Tools
DigitalOcean's built-in monitoring is good for infrastructure basics. For a complete observability stack, combine it with external tools.
Grafana Cloud Integration
Grafana Cloud provides a free tier that includes Prometheus metrics storage, Loki for logs, and Grafana dashboards. This is the best cost-effective option for small to medium deployments.
Set up the Grafana Agent on your Droplet to forward Prometheus metrics:
# /etc/grafana-agent/agent.yaml
server:
log_level: info
metrics:
global:
scrape_interval: 15s
configs:
- name: default
scrape_configs:
- job_name: 'node-app'
static_configs:
- targets: ['localhost:8080']
labels:
environment: 'production'
service: 'web-api'
metrics_path: '/metrics'
remote_write:
- url: https://prometheus-prod-us-central1.grafana.net/api/prom/push
basic_auth:
username: '${GRAFANA_PROM_USER}'
password: '${GRAFANA_PROM_TOKEN}'
Datadog Integration
For teams with budget, Datadog provides the most comprehensive monitoring experience. Install the Datadog agent on your Droplet:
DD_API_KEY=your-api-key DD_SITE="datadoghq.com" bash -c \
"$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)"
Configure the Node.js integration:
# /etc/datadog-agent/conf.d/openmetrics.d/conf.yaml
instances:
- openmetrics_endpoint: http://localhost:8080/metrics
namespace: "yourapp"
metrics:
- http_request_duration_seconds
- http_requests_total
- http_active_connections
- db_query_duration_seconds
- app_errors_total
Incident Response Workflows
Monitoring without a response plan is theater. Here is a practical incident response workflow for a small team running on DigitalOcean.
Severity Levels
| Level | Definition | Response Time | Example |
|---|---|---|---|
| SEV1 | Service down, all users affected | 15 minutes | Website unreachable |
| SEV2 | Service degraded, some users affected | 30 minutes | API latency > 5 seconds |
| SEV3 | Non-critical issue, no user impact | Next business day | Disk usage at 75% |
Automated Escalation
var SlackNotifier = require("./slack-notifier");
var channels = {
alerts: new SlackNotifier(process.env.SLACK_ALERTS_WEBHOOK),
oncall: new SlackNotifier(process.env.SLACK_ONCALL_WEBHOOK),
management: new SlackNotifier(process.env.SLACK_MGMT_WEBHOOK)
};
function escalate(severity, title, details) {
switch (severity) {
case "SEV1":
// Notify all channels immediately
channels.alerts.alert(title, details, "critical", function() {});
channels.oncall.alert("[SEV1] " + title, details, "critical", function() {});
channels.management.alert("[SEV1] " + title, details, "critical", function() {});
break;
case "SEV2":
channels.alerts.alert(title, details, "warning", function() {});
channels.oncall.alert("[SEV2] " + title, details, "warning", function() {});
break;
case "SEV3":
channels.alerts.alert(title, details, "info", function() {});
break;
}
}
module.exports = escalate;
Cost of Monitoring vs Cost of Downtime
Let me put this bluntly: monitoring is cheap; downtime is expensive.
DigitalOcean's built-in monitoring is free. Alert policies, uptime checks, Droplet metrics - all included at no additional cost. There is zero excuse for not having basic alerts configured.
For external tools, a reasonable budget looks like:
| Tool | Monthly Cost | What You Get |
|---|---|---|
| DigitalOcean Built-in | $0 | Infrastructure metrics, alerts, uptime checks |
| Grafana Cloud Free | $0 | 10k metrics series, 50GB logs, 50GB traces |
| Grafana Cloud Pro | $29+ | Higher limits, alerting, SLOs |
| Datadog | $15/host | Full-stack observability |
| UptimeRobot Free | $0 | 50 monitors, 5-min intervals |
Compare this to the cost of one hour of downtime. For a SaaS application doing $50K/month in revenue, one hour of downtime costs approximately $69. For e-commerce, the per-hour cost can be ten times that. A $29/month monitoring subscription pays for itself the first time it catches an issue before it becomes an outage.
Complete Working Example
Here is a complete Express.js application with comprehensive monitoring built in. This ties together everything discussed above.
var express = require("express");
var os = require("os");
var promClient = require("prom-client");
var https = require("https");
var url = require("url");
// ============================================================
// Prometheus Metrics Setup
// ============================================================
var register = new promClient.Registry();
promClient.collectDefaultMetrics({ register: register });
var httpDuration = new promClient.Histogram({
name: "http_request_duration_seconds",
help: "HTTP request duration in seconds",
labelNames: ["method", "route", "status"],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
registers: [register]
});
var httpTotal = new promClient.Counter({
name: "http_requests_total",
help: "Total HTTP requests",
labelNames: ["method", "route", "status"],
registers: [register]
});
var activeConns = new promClient.Gauge({
name: "http_active_connections",
help: "Active HTTP connections",
registers: [register]
});
var errors = new promClient.Counter({
name: "app_errors_total",
help: "Application errors",
labelNames: ["type"],
registers: [register]
});
// ============================================================
// Slack Notifier
// ============================================================
function SlackNotifier(webhookUrl) {
if (!webhookUrl) {
this.disabled = true;
return;
}
this.parsed = url.parse(webhookUrl);
this.disabled = false;
}
SlackNotifier.prototype.send = function(text, color, callback) {
if (this.disabled) {
if (callback) callback(null);
return;
}
var payload = JSON.stringify({
attachments: [{
color: color,
text: text,
ts: Math.floor(Date.now() / 1000)
}]
});
var options = {
hostname: this.parsed.hostname,
path: this.parsed.path,
method: "POST",
headers: {
"Content-Type": "application/json",
"Content-Length": Buffer.byteLength(payload)
}
};
var req = https.request(options, function(res) {
var body = "";
res.on("data", function(chunk) { body += chunk; });
res.on("end", function() {
if (callback) callback(null, res.statusCode);
});
});
req.on("error", function(err) {
if (callback) callback(err);
});
req.write(payload);
req.end();
};
var slack = new SlackNotifier(process.env.SLACK_WEBHOOK_URL);
// ============================================================
// Logger
// ============================================================
function log(level, message, meta) {
var entry = {
timestamp: new Date().toISOString(),
level: level,
message: message,
pid: process.pid,
hostname: os.hostname()
};
if (meta) {
Object.keys(meta).forEach(function(key) {
entry[key] = meta[key];
});
}
process.stdout.write(JSON.stringify(entry) + "\n");
}
// ============================================================
// Express App
// ============================================================
var app = express();
app.use(express.json());
// Metrics middleware
app.use(function(req, res, next) {
var start = process.hrtime();
activeConns.inc();
res.on("finish", function() {
activeConns.dec();
var diff = process.hrtime(start);
var duration = diff[0] + diff[1] / 1e9;
var route = req.route ? req.route.path : req.path;
var labels = { method: req.method, route: route, status: res.statusCode };
httpDuration.observe(labels, duration);
httpTotal.inc(labels);
log("info", "request", {
method: req.method,
path: req.path,
status: res.statusCode,
duration: duration.toFixed(4) + "s",
ip: req.ip
});
});
next();
});
// Request logging
app.use(function(req, res, next) {
req.startTime = Date.now();
next();
});
// ============================================================
// Health Check Endpoint
// ============================================================
app.get("/health", function(req, res) {
var memUsage = process.memoryUsage();
var cpuLoad = os.loadavg();
var uptime = process.uptime();
var totalMem = os.totalmem();
var freeMem = os.freemem();
var status = "healthy";
var warnings = [];
if (memUsage.heapUsed / memUsage.heapTotal > 0.9) {
status = "degraded";
warnings.push("Heap usage above 90%");
}
if (cpuLoad[0] > os.cpus().length * 2) {
status = "degraded";
warnings.push("High load average: " + cpuLoad[0].toFixed(2));
}
var health = {
status: status,
timestamp: new Date().toISOString(),
uptime: Math.floor(uptime) + "s",
version: process.env.APP_VERSION || "unknown",
node: process.version,
process: {
heapUsed: Math.round(memUsage.heapUsed / 1024 / 1024) + "MB",
heapTotal: Math.round(memUsage.heapTotal / 1024 / 1024) + "MB",
rss: Math.round(memUsage.rss / 1024 / 1024) + "MB"
},
system: {
loadAvg: cpuLoad.map(function(v) { return v.toFixed(2); }),
cpuCount: os.cpus().length,
memoryUsed: ((1 - freeMem / totalMem) * 100).toFixed(1) + "%"
}
};
if (warnings.length > 0) {
health.warnings = warnings;
}
res.status(status === "healthy" ? 200 : 503).json(health);
});
// ============================================================
// Prometheus Metrics Endpoint
// ============================================================
app.get("/metrics", function(req, res) {
res.set("Content-Type", register.contentType);
register.metrics().then(function(data) {
res.end(data);
}).catch(function(err) {
res.status(500).end(err.message);
});
});
// ============================================================
// Readiness and Liveness Probes
// ============================================================
var ready = false;
app.get("/ready", function(req, res) {
if (ready) {
res.status(200).json({ ready: true });
} else {
res.status(503).json({ ready: false });
}
});
app.get("/live", function(req, res) {
res.status(200).json({ alive: true, pid: process.pid });
});
// ============================================================
// Sample API Routes
// ============================================================
app.get("/api/data", function(req, res) {
// Simulate some work
var result = { items: [], count: 0 };
for (var i = 0; i < 100; i++) {
result.items.push({ id: i, value: Math.random() });
}
result.count = result.items.length;
res.json(result);
});
// ============================================================
// Error Handling
// ============================================================
app.use(function(err, req, res, next) {
errors.inc({ type: err.name || "UnknownError" });
log("error", "Unhandled error", {
error: err.message,
stack: err.stack,
path: req.path,
method: req.method
});
slack.send(
"[ERROR] " + err.message + " on " + req.method + " " + req.path,
"#dc3545",
function() {}
);
res.status(500).json({ error: "Internal server error" });
});
// ============================================================
// Process-Level Monitoring
// ============================================================
process.on("uncaughtException", function(err) {
errors.inc({ type: "UncaughtException" });
log("error", "Uncaught exception", { error: err.message, stack: err.stack });
slack.send("[FATAL] Uncaught exception: " + err.message, "#dc3545", function() {
process.exit(1);
});
});
process.on("unhandledRejection", function(reason) {
errors.inc({ type: "UnhandledRejection" });
var message = reason instanceof Error ? reason.message : String(reason);
log("error", "Unhandled rejection", { reason: message });
slack.send("[ERROR] Unhandled rejection: " + message, "#ffc107", function() {});
});
// Memory usage check every 60 seconds
setInterval(function() {
var mem = process.memoryUsage();
var heapPercent = (mem.heapUsed / mem.heapTotal * 100).toFixed(1);
if (mem.heapUsed / mem.heapTotal > 0.85) {
log("warn", "High heap usage", {
heapUsed: Math.round(mem.heapUsed / 1024 / 1024) + "MB",
heapTotal: Math.round(mem.heapTotal / 1024 / 1024) + "MB",
percent: heapPercent + "%"
});
slack.send(
"[WARN] Heap usage at " + heapPercent + "% on " + os.hostname(),
"#ffc107",
function() {}
);
}
}, 60000);
// ============================================================
// Setup DigitalOcean Alert Policies on Startup
// ============================================================
function setupAlertPolicies() {
var doToken = process.env.DIGITALOCEAN_TOKEN;
var dropletId = process.env.DROPLET_ID;
if (!doToken || !dropletId) {
log("info", "Skipping DO alert setup - missing token or droplet ID");
return;
}
var policies = [
{
type: "v1/insights/droplet/cpu",
description: "CPU > 80% for 5 minutes",
compare: "GreaterThan",
value: 80,
window: "5m"
},
{
type: "v1/insights/droplet/memory_utilization_percent",
description: "Memory > 90% for 5 minutes",
compare: "GreaterThan",
value: 90,
window: "5m"
},
{
type: "v1/insights/droplet/disk_utilization_percent",
description: "Disk > 85% for 5 minutes",
compare: "GreaterThan",
value: 85,
window: "5m"
}
];
policies.forEach(function(policy) {
policy.enabled = true;
policy.entities = [dropletId];
policy.tags = [process.env.NODE_ENV || "development"];
policy.alerts = {
email: [process.env.ALERT_EMAIL || "[email protected]"]
};
if (process.env.SLACK_WEBHOOK_URL) {
policy.alerts.slack = [{
channel: "#alerts",
url: process.env.SLACK_WEBHOOK_URL
}];
}
var postData = JSON.stringify(policy);
var options = {
hostname: "api.digitalocean.com",
path: "/v2/monitoring/alerts",
method: "POST",
headers: {
"Authorization": "Bearer " + doToken,
"Content-Type": "application/json",
"Content-Length": Buffer.byteLength(postData)
}
};
var req = https.request(options, function(res) {
var body = "";
res.on("data", function(chunk) { body += chunk; });
res.on("end", function() {
if (res.statusCode === 200 || res.statusCode === 201) {
log("info", "Alert policy created: " + policy.description);
} else {
log("warn", "Alert policy failed: " + policy.description, {
status: res.statusCode,
response: body
});
}
});
});
req.on("error", function(err) {
log("error", "Alert policy request failed", { error: err.message });
});
req.write(postData);
req.end();
});
}
// ============================================================
// Start Server
// ============================================================
var PORT = process.env.PORT || 8080;
var server = app.listen(PORT, function() {
log("info", "Server started", { port: PORT, env: process.env.NODE_ENV });
ready = true;
setupAlertPolicies();
});
// Graceful shutdown
process.on("SIGTERM", function() {
log("info", "SIGTERM received, shutting down gracefully");
ready = false;
server.close(function() {
log("info", "HTTP server closed");
process.exit(0);
});
// Force shutdown after 30 seconds
setTimeout(function() {
log("error", "Forced shutdown after timeout");
process.exit(1);
}, 30000);
});
module.exports = app;
Install the dependencies:
npm install express prom-client
Run the application:
PORT=8080 NODE_ENV=production SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/HOOK node app.js
Test the endpoints:
# Health check
curl http://localhost:8080/health
# Prometheus metrics
curl http://localhost:8080/metrics
# Readiness probe
curl http://localhost:8080/ready
# Liveness probe
curl http://localhost:8080/live
Common Issues and Troubleshooting
1. Metrics Agent Not Reporting Data
Error: No graphs visible on the Droplet monitoring tab. The control panel shows "Install the metrics agent to see monitoring data."
Cause: The do-agent service is either not installed or not running. This commonly happens on Droplets created from custom images or snapshots.
Fix:
# Check if the agent is installed
which do-agent
# If not installed
curl -sSL https://repos.insights.digitalocean.com/install.sh | sudo bash
# If installed but not running
sudo systemctl start do-agent
sudo systemctl enable do-agent
# Verify it is reporting
journalctl -u do-agent -f
If the agent is running but metrics still do not appear, check the Droplet metadata service:
curl -s http://169.254.169.254/metadata/v1/id
If this returns nothing, the metadata service is not accessible, which the agent requires. This happens on Droplets with misconfigured network settings.
2. Alert Policies Not Firing
Error: You have configured alert policies but never receive notifications, even when thresholds are clearly exceeded.
Cause: The most common reason is that the entities field in the alert policy contains the wrong Droplet ID, or the alert window has not been exceeded. Another frequent cause is that the Slack webhook URL has expired or been revoked.
Fix:
# Verify your Droplet ID
doctl compute droplet list --format ID,Name
# List existing alert policies
doctl monitoring alert list
# Test your Slack webhook directly
curl -X POST -H 'Content-type: application/json' \
--data '{"text":"Test alert from DigitalOcean monitoring setup"}' \
https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
If the Slack test works but DigitalOcean alerts do not arrive, check that the alert policy's window is appropriate. A 5-minute window means the condition must persist for a full 5 minutes before triggering.
3. prom-client Memory Leak with High-Cardinality Labels
Error: JavaScript heap out of memory after running for several hours, with the /metrics endpoint taking increasingly long to respond.
Cause: Using request paths as metric labels without normalization creates a new time series for every unique URL. If your API has user IDs or UUIDs in paths, the label cardinality explodes.
Fix:
// BAD - unbounded cardinality
var route = req.path; // "/users/abc123", "/users/def456", etc.
// GOOD - normalized route
var route = req.route ? req.route.path : "unknown";
// Results in "/users/:id" regardless of the actual user ID
Also set a maximum age for your metrics to prevent unbounded growth:
var httpDuration = new promClient.Histogram({
name: "http_request_duration_seconds",
help: "HTTP request duration",
labelNames: ["method", "route", "status"],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
registers: [register],
maxAgeSeconds: 600,
ageBuckets: 5
});
4. Uptime Check Failing Behind Load Balancer
Error: DigitalOcean uptime check reports the service as down, but the service is accessible from your browser. The uptime check shows HTTP 502 Bad Gateway or Connection Timeout.
Cause: DigitalOcean uptime checks come from specific IP ranges. If your firewall or DigitalOcean Cloud Firewall is blocking these IPs, the checks will fail. Another common cause is that the health check endpoint is behind authentication middleware.
Fix:
// Ensure health endpoint is BEFORE auth middleware
app.get("/health", function(req, res) {
res.status(200).json({ status: "ok" });
});
// Auth middleware applied after health route
app.use(authMiddleware);
// Protected routes below
app.get("/api/data", function(req, res) {
// ...
});
For firewall issues, allow DigitalOcean's monitoring IP ranges. You can find the current list in their documentation, but the simplest fix is to ensure your Cloud Firewall allows inbound HTTP/HTTPS from all sources to your health check port.
5. Database Connection Pool Exhaustion Not Detected
Error: Application returns Error: Cannot acquire connection from pool. Pool is probably full. but no DigitalOcean alert fires because database CPU and memory are fine.
Cause: DigitalOcean's database monitoring tracks server-level metrics, not application-level connection pool state. Your pool can be exhausted while the database server sits idle.
Fix: Monitor connection pool metrics from your application:
var pool = require("./db").pool; // Your database pool
var poolActive = new promClient.Gauge({
name: "db_pool_active_connections",
help: "Active database connections",
registers: [register]
});
var poolIdle = new promClient.Gauge({
name: "db_pool_idle_connections",
help: "Idle database connections",
registers: [register]
});
var poolWaiting = new promClient.Gauge({
name: "db_pool_waiting_requests",
help: "Requests waiting for a connection",
registers: [register]
});
setInterval(function() {
poolActive.set(pool.totalCount - pool.idleCount);
poolIdle.set(pool.idleCount);
poolWaiting.set(pool.waitingCount);
if (pool.waitingCount > 10) {
log("warn", "Connection pool congestion", {
active: pool.totalCount - pool.idleCount,
idle: pool.idleCount,
waiting: pool.waitingCount,
total: pool.totalCount
});
}
}, 5000);
Best Practices
Alert on symptoms, not causes. Monitor error rates and latency from the user's perspective first. CPU and memory alerts are useful, but a user-facing 5xx rate spike tells you something is actually broken, not just busy.
Set up alerts on day one, not after the first outage. DigitalOcean's built-in alerts are free. There is no reason to skip this step. At minimum, configure CPU > 80%, memory > 90%, and disk > 85% alerts for every production Droplet.
Use multiple notification channels. Email and Slack together. Email is durable and searchable. Slack is immediate. If one channel is down, you still get notified through the other. For critical alerts, add a PagerDuty or Opsgenie integration for phone call escalation.
Normalize metric labels to prevent cardinality explosions. Never use raw user IDs, session IDs, or UUIDs as Prometheus labels. Use route patterns (
/users/:id) instead of resolved paths (/users/abc123). High cardinality is the number one cause of monitoring system performance problems.Implement health checks with depth levels. A basic
/healthendpoint that returns 200 is a starting point. Add a/health?deep=truemode that checks database connectivity, external API reachability, and cache availability. Use the shallow check for load balancer probes (fast, frequent) and the deep check for uptime monitoring (less frequent, more informative).Monitor your monitoring. If your Slack webhook is revoked, your alerts silently vanish. Periodically send test alerts to verify the notification pipeline is working. Set a monthly calendar reminder to check that your alert policies are still active and your notification channels still function.
Separate operational alerts from informational dashboards. Not every metric needs an alert. Dashboard metrics like request rates and cache hit ratios are useful for capacity planning and debugging, but they should not wake someone up at 3 AM. Reserve alerts for conditions that require human intervention.
Log structured JSON, not plaintext. Structured logs are parseable, searchable, and filterable. Plaintext logs require regex patterns that break when the format changes. Every log entry should include a timestamp, severity level, service name, and enough context to understand the event without looking at surrounding lines.
Budget for log retention. DigitalOcean does not provide long-term log storage. Forward logs to Grafana Cloud Loki, Datadog, or even a dedicated Droplet running Loki. Thirty days of retention is the minimum for production. Ninety days gives you enough runway to spot trends.
Practice incident response before you need it. Run a tabletop exercise where you simulate a database disk-full scenario. Walk through how the alert fires, who gets paged, what the runbook says, and how the issue gets resolved. The first time you practice this should not be during an actual incident.
