CloudWatch Monitoring and Alarms
Build comprehensive AWS monitoring with CloudWatch custom metrics, structured logging, alarms, and dashboards using Node.js
CloudWatch Monitoring and Alarms
CloudWatch is the backbone of observability on AWS. It collects metrics, aggregates logs, triggers alarms, and gives you a single pane of glass into every service you run. If you are deploying Node.js applications on Lambda, ECS, EC2, or anything else on AWS, CloudWatch is not optional — it is the first thing you configure and the last thing you check when something goes wrong.
This article covers CloudWatch end-to-end: default metrics, custom metrics from Node.js, structured logging, metric filters, alarms, dashboards, Logs Insights, the embedded metric format, anomaly detection, and cross-account monitoring. We will build a complete monitoring library you can drop into any Node.js project.
Prerequisites
- An AWS account with CloudWatch access
- Node.js 16+ installed locally
- AWS SDK v3 for JavaScript (
@aws-sdk/client-cloudwatch,@aws-sdk/client-cloudwatch-logs) - Basic familiarity with AWS IAM permissions
- An SNS topic for alarm notifications (we will create one programmatically)
Install the required packages:
npm install @aws-sdk/client-cloudwatch @aws-sdk/client-cloudwatch-logs @aws-sdk/client-sns
Understanding CloudWatch Metrics
Default Metrics
Every AWS service publishes default metrics to CloudWatch automatically. EC2 gives you CPU utilization, network I/O, and disk operations. Lambda gives you invocation count, duration, errors, and throttles. ECS gives you CPU and memory reservation. RDS gives you read/write IOPS, connections, and replica lag.
These default metrics are free (for basic monitoring at 5-minute intervals) and require zero configuration. The catch is they only tell you about infrastructure. They tell you nothing about your business logic — how many orders were placed, how many payments failed, how long your third-party API calls take. That is where custom metrics come in.
Custom Metrics
Custom metrics let you publish any numeric value to CloudWatch with dimensions that you define. A dimension is a key-value pair that categorizes the metric — think of it as a tag. You might publish an OrderCount metric with dimensions for Environment=production and Region=us-east-1.
Here is how to publish a custom metric from Node.js:
var { CloudWatchClient, PutMetricDataCommand } = require("@aws-sdk/client-cloudwatch");
var client = new CloudWatchClient({ region: "us-east-1" });
function publishMetric(namespace, metricName, value, dimensions, unit) {
var params = {
Namespace: namespace,
MetricData: [
{
MetricName: metricName,
Value: value,
Unit: unit || "Count",
Timestamp: new Date(),
Dimensions: dimensions || []
}
]
};
return client.send(new PutMetricDataCommand(params));
}
// Publish an order count metric
publishMetric(
"MyApp/Business",
"OrderCount",
1,
[
{ Name: "Environment", Value: "production" },
{ Name: "OrderType", Value: "subscription" }
],
"Count"
).then(function(result) {
console.log("Metric published:", result.$metadata.httpStatusCode);
}).catch(function(err) {
console.error("Failed to publish metric:", err.message);
});
A few things to know about custom metrics. First, the Namespace is just a grouping mechanism — use something like MyApp/Business or MyCompany/ServiceName. Do not use the AWS/ prefix; that is reserved for AWS services. Second, you can publish up to 1000 metric data points per PutMetricData call, and you can batch them to reduce API calls. Third, custom metrics cost money — $0.30 per metric per month at the time of writing, and high-resolution metrics (1-second intervals) cost more.
Batching Metrics
In production, you do not want to make an API call for every single metric. Buffer them and flush periodically:
var { CloudWatchClient, PutMetricDataCommand } = require("@aws-sdk/client-cloudwatch");
function MetricBuffer(namespace, region, flushIntervalMs) {
this.namespace = namespace;
this.client = new CloudWatchClient({ region: region || "us-east-1" });
this.buffer = [];
this.flushInterval = flushIntervalMs || 60000;
this.maxBatchSize = 1000;
var self = this;
this.timer = setInterval(function() {
self.flush();
}, this.flushInterval);
}
MetricBuffer.prototype.add = function(metricName, value, dimensions, unit) {
this.buffer.push({
MetricName: metricName,
Value: value,
Unit: unit || "Count",
Timestamp: new Date(),
Dimensions: dimensions || []
});
if (this.buffer.length >= this.maxBatchSize) {
this.flush();
}
};
MetricBuffer.prototype.flush = function() {
if (this.buffer.length === 0) return Promise.resolve();
var batches = [];
while (this.buffer.length > 0) {
batches.push(this.buffer.splice(0, this.maxBatchSize));
}
var self = this;
var promises = batches.map(function(batch) {
var params = {
Namespace: self.namespace,
MetricData: batch
};
return self.client.send(new PutMetricDataCommand(params));
});
return Promise.all(promises).catch(function(err) {
console.error("Failed to flush metrics:", err.message);
});
};
MetricBuffer.prototype.stop = function() {
clearInterval(this.timer);
return this.flush();
};
module.exports = MetricBuffer;
CloudWatch Logs
Log Groups and Log Streams
CloudWatch Logs organizes data into log groups and log streams. A log group is a collection of log streams that share the same retention, monitoring, and access control settings. A log stream is a sequence of log events from the same source — typically one per container instance, Lambda invocation, or EC2 instance.
For Lambda, AWS creates the log group automatically at /aws/lambda/your-function-name. For ECS, you configure the awslogs log driver in your task definition and specify the log group. For EC2, you install the CloudWatch agent.
Structured Logging
Stop writing unstructured log lines like console.log("Order processed: " + orderId). Structured logging means writing JSON to stdout so CloudWatch can parse individual fields. This is critical for metric filters and Logs Insights queries.
Here is a structured logger for Node.js:
var os = require("os");
function StructuredLogger(serviceName, environment) {
this.serviceName = serviceName;
this.environment = environment || process.env.NODE_ENV || "development";
this.hostname = os.hostname();
}
StructuredLogger.prototype.log = function(level, message, data) {
var entry = {
timestamp: new Date().toISOString(),
level: level,
message: message,
service: this.serviceName,
environment: this.environment,
hostname: this.hostname
};
if (data) {
Object.keys(data).forEach(function(key) {
entry[key] = data[key];
});
}
// Lambda and ECS capture stdout automatically
process.stdout.write(JSON.stringify(entry) + "\n");
};
StructuredLogger.prototype.info = function(message, data) {
this.log("INFO", message, data);
};
StructuredLogger.prototype.warn = function(message, data) {
this.log("WARN", message, data);
};
StructuredLogger.prototype.error = function(message, data) {
this.log("ERROR", message, data);
};
StructuredLogger.prototype.metric = function(metricName, value, data) {
var metricData = data || {};
metricData._metric = metricName;
metricData._value = value;
this.log("METRIC", metricName, metricData);
};
module.exports = StructuredLogger;
Usage in a Lambda function:
var StructuredLogger = require("./structured-logger");
var logger = new StructuredLogger("order-service", "production");
exports.handler = function(event, context) {
var startTime = Date.now();
logger.info("Processing order", {
orderId: event.orderId,
customerId: event.customerId,
requestId: context.awsRequestId
});
// Process the order...
var duration = Date.now() - startTime;
logger.metric("OrderProcessingTime", duration, {
orderId: event.orderId,
unit: "Milliseconds"
});
return { statusCode: 200, body: "Order processed" };
};
This produces log entries like:
{
"timestamp": "2026-02-13T14:30:00.000Z",
"level": "METRIC",
"message": "OrderProcessingTime",
"service": "order-service",
"environment": "production",
"hostname": "169.254.123.45",
"orderId": "ord-12345",
"unit": "Milliseconds",
"_metric": "OrderProcessingTime",
"_value": 342
}
Metric Filters
Metric filters let you extract metric data from log events. You define a pattern that matches log entries and CloudWatch creates a metric from the matches. This is how you turn logs into alarms without changing your application code.
var {
CloudWatchLogsClient,
PutMetricFilterCommand
} = require("@aws-sdk/client-cloudwatch-logs");
var logsClient = new CloudWatchLogsClient({ region: "us-east-1" });
function createMetricFilter(logGroupName, filterName, filterPattern, metricNamespace, metricName, metricValue) {
var params = {
logGroupName: logGroupName,
filterName: filterName,
filterPattern: filterPattern,
metricTransformations: [
{
metricNamespace: metricNamespace,
metricName: metricName,
metricValue: metricValue || "1",
defaultValue: 0
}
]
};
return logsClient.send(new PutMetricFilterCommand(params));
}
// Count ERROR-level log entries
createMetricFilter(
"/aws/lambda/order-service",
"ErrorCount",
'{ $.level = "ERROR" }',
"MyApp/Logs",
"ErrorCount",
"1"
).then(function() {
console.log("Metric filter created");
});
// Extract processing time from structured logs
createMetricFilter(
"/aws/lambda/order-service",
"OrderProcessingTime",
'{ $.level = "METRIC" && $._metric = "OrderProcessingTime" }',
"MyApp/Business",
"OrderProcessingTime",
"$._value"
).then(function() {
console.log("Processing time filter created");
});
The filter pattern syntax is specific to CloudWatch. For JSON logs, you use { $.fieldName = "value" }. You can use comparison operators (>, <, >=, <=, !=) and logical operators (&&, ||). The $._value reference extracts the actual numeric value from the log event.
Creating Alarms
Simple Alarms
A CloudWatch alarm watches a single metric and triggers when the metric crosses a threshold for a specified number of evaluation periods.
var { CloudWatchClient, PutMetricAlarmCommand } = require("@aws-sdk/client-cloudwatch");
var cwClient = new CloudWatchClient({ region: "us-east-1" });
function createAlarm(options) {
var params = {
AlarmName: options.name,
AlarmDescription: options.description || "",
Namespace: options.namespace,
MetricName: options.metricName,
Dimensions: options.dimensions || [],
Statistic: options.statistic || "Sum",
Period: options.period || 300,
EvaluationPeriods: options.evaluationPeriods || 1,
DatapointsToAlarm: options.datapointsToAlarm || 1,
Threshold: options.threshold,
ComparisonOperator: options.comparisonOperator || "GreaterThanOrEqualToThreshold",
TreatMissingData: options.treatMissingData || "notBreaching",
ActionsEnabled: true,
AlarmActions: options.alarmActions || [],
OKActions: options.okActions || [],
InsufficientDataActions: options.insufficientDataActions || []
};
return cwClient.send(new PutMetricAlarmCommand(params));
}
// Alert when error count exceeds 10 in 5 minutes
createAlarm({
name: "HighErrorRate-OrderService",
description: "Error rate exceeded threshold in order service",
namespace: "MyApp/Logs",
metricName: "ErrorCount",
statistic: "Sum",
period: 300,
evaluationPeriods: 1,
threshold: 10,
comparisonOperator: "GreaterThanOrEqualToThreshold",
alarmActions: ["arn:aws:sns:us-east-1:123456789012:ops-alerts"]
}).then(function() {
console.log("Error rate alarm created");
});
// Alert when p99 latency exceeds 2 seconds
createAlarm({
name: "HighLatency-OrderService",
description: "P99 latency exceeded 2 seconds",
namespace: "MyApp/Business",
metricName: "OrderProcessingTime",
statistic: "p99",
period: 300,
evaluationPeriods: 3,
datapointsToAlarm: 2,
threshold: 2000,
comparisonOperator: "GreaterThanOrEqualToThreshold",
alarmActions: ["arn:aws:sns:us-east-1:123456789012:ops-alerts"]
});
The TreatMissingData parameter is important. If your metric has gaps (e.g., no orders come in at 3 AM), you usually want notBreaching so the alarm does not fire on missing data. Other options are breaching, ignore, and missing.
Composite Alarms
Composite alarms combine multiple alarms with boolean logic. This is how you avoid alert fatigue. Instead of getting five separate alarm notifications, you create one composite alarm that only fires when multiple conditions are true.
var { CloudWatchClient, PutCompositeAlarmCommand } = require("@aws-sdk/client-cloudwatch");
var cwClient = new CloudWatchClient({ region: "us-east-1" });
function createCompositeAlarm(name, description, alarmRule, actions) {
var params = {
AlarmName: name,
AlarmDescription: description,
AlarmRule: alarmRule,
ActionsEnabled: true,
AlarmActions: actions || []
};
return cwClient.send(new PutCompositeAlarmCommand(params));
}
// Only alert when BOTH high errors AND high latency occur
createCompositeAlarm(
"Critical-OrderService",
"Order service is experiencing both high errors and high latency",
'ALARM("HighErrorRate-OrderService") AND ALARM("HighLatency-OrderService")',
["arn:aws:sns:us-east-1:123456789012:pager-duty"]
).then(function() {
console.log("Composite alarm created");
});
The alarm rule supports AND, OR, NOT, and TRUE/FALSE. You can nest conditions: ALARM("A") AND (ALARM("B") OR ALARM("C")).
SNS Notification Integration
Alarms are useless without notifications. SNS is the standard integration point. Create a topic, subscribe your email or PagerDuty endpoint, and reference the topic ARN in your alarm actions.
var { SNSClient, CreateTopicCommand, SubscribeCommand } = require("@aws-sdk/client-sns");
var snsClient = new SNSClient({ region: "us-east-1" });
function setupAlertingTopic(topicName, emailAddress) {
return snsClient.send(new CreateTopicCommand({ Name: topicName }))
.then(function(topicResult) {
var topicArn = topicResult.TopicArn;
console.log("Topic created:", topicArn);
return snsClient.send(new SubscribeCommand({
TopicArn: topicArn,
Protocol: "email",
Endpoint: emailAddress
})).then(function() {
console.log("Subscription created. Check email for confirmation.");
return topicArn;
});
});
}
setupAlertingTopic("ops-alerts", "[email protected]")
.then(function(topicArn) {
console.log("Use this ARN for alarm actions:", topicArn);
});
For production, you would also add HTTPS subscriptions for PagerDuty, Opsgenie, or a custom webhook handler. You can also subscribe Lambda functions to SNS topics for custom alert processing — enriching alerts with additional context, routing to Slack channels, or auto-remediating known issues.
CloudWatch Dashboards Programmatically
Dashboards created by clicking around the console are impossible to reproduce across environments. Define them in code instead:
var { CloudWatchClient, PutDashboardCommand } = require("@aws-sdk/client-cloudwatch");
var cwClient = new CloudWatchClient({ region: "us-east-1" });
function createDashboard(dashboardName, widgets) {
var body = {
widgets: widgets
};
var params = {
DashboardName: dashboardName,
DashboardBody: JSON.stringify(body)
};
return cwClient.send(new PutDashboardCommand(params));
}
var widgets = [
{
type: "metric",
x: 0,
y: 0,
width: 12,
height: 6,
properties: {
title: "Order Count",
metrics: [
["MyApp/Business", "OrderCount", "Environment", "production"]
],
period: 300,
stat: "Sum",
region: "us-east-1",
view: "timeSeries"
}
},
{
type: "metric",
x: 12,
y: 0,
width: 12,
height: 6,
properties: {
title: "Error Rate vs Latency",
metrics: [
["MyApp/Logs", "ErrorCount", { stat: "Sum", color: "#d62728" }],
["MyApp/Business", "OrderProcessingTime", { stat: "p99", yAxis: "right", color: "#1f77b4" }]
],
period: 300,
region: "us-east-1",
view: "timeSeries",
yAxis: {
left: { label: "Errors", min: 0 },
right: { label: "Latency (ms)", min: 0 }
}
}
},
{
type: "alarm",
x: 0,
y: 6,
width: 24,
height: 3,
properties: {
title: "Alarm Status",
alarms: [
"arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighErrorRate-OrderService",
"arn:aws:cloudwatch:us-east-1:123456789012:alarm:HighLatency-OrderService",
"arn:aws:cloudwatch:us-east-1:123456789012:alarm:Critical-OrderService"
]
}
},
{
type: "log",
x: 0,
y: 9,
width: 24,
height: 6,
properties: {
title: "Recent Errors",
query: "SOURCE '/aws/lambda/order-service' | fields @timestamp, @message\n| filter level = 'ERROR'\n| sort @timestamp desc\n| limit 20",
region: "us-east-1",
view: "table"
}
}
];
createDashboard("OrderService-Production", widgets)
.then(function() {
console.log("Dashboard created");
});
CloudWatch Logs Insights
Logs Insights is a query language for searching and analyzing log data. It is significantly faster than scrolling through log streams, and it supports aggregation, filtering, and visualization.
var {
CloudWatchLogsClient,
StartQueryCommand,
GetQueryResultsCommand
} = require("@aws-sdk/client-cloudwatch-logs");
var logsClient = new CloudWatchLogsClient({ region: "us-east-1" });
function queryLogs(logGroupName, queryString, startTime, endTime) {
var params = {
logGroupName: logGroupName,
startTime: Math.floor(startTime.getTime() / 1000),
endTime: Math.floor(endTime.getTime() / 1000),
queryString: queryString,
limit: 100
};
return logsClient.send(new StartQueryCommand(params))
.then(function(startResult) {
var queryId = startResult.queryId;
return waitForQueryResults(queryId);
});
}
function waitForQueryResults(queryId, attempt) {
attempt = attempt || 0;
if (attempt > 30) {
return Promise.reject(new Error("Query timed out after 30 attempts"));
}
return new Promise(function(resolve) {
setTimeout(resolve, 1000);
}).then(function() {
return logsClient.send(new GetQueryResultsCommand({ queryId: queryId }));
}).then(function(result) {
if (result.status === "Complete") {
return result.results;
}
return waitForQueryResults(queryId, attempt + 1);
});
}
// Find the slowest orders in the last hour
var now = new Date();
var oneHourAgo = new Date(now.getTime() - 3600000);
queryLogs(
"/aws/lambda/order-service",
[
'fields @timestamp, orderId, _value as duration',
'filter level = "METRIC" and _metric = "OrderProcessingTime"',
'sort duration desc',
'limit 10'
].join(" | "),
oneHourAgo,
now
).then(function(results) {
console.log("Slowest orders:");
results.forEach(function(row) {
var fields = {};
row.forEach(function(field) {
fields[field.field] = field.value;
});
console.log(fields);
});
});
// Aggregate error counts by error type over the last 24 hours
var oneDayAgo = new Date(now.getTime() - 86400000);
queryLogs(
"/aws/lambda/order-service",
[
'filter level = "ERROR"',
'stats count(*) as errorCount by message',
'sort errorCount desc',
'limit 20'
].join(" | "),
oneDayAgo,
now
).then(function(results) {
console.log("Error breakdown:");
results.forEach(function(row) {
var fields = {};
row.forEach(function(field) {
fields[field.field] = field.value;
});
console.log(fields.message, ":", fields.errorCount);
});
});
Embedded Metric Format (EMF)
The Embedded Metric Format lets you embed metric data directly in your structured log output. CloudWatch automatically extracts the metrics without requiring metric filters. This is the most efficient way to publish custom metrics from Lambda because there is no additional API call — the metrics ride along with your logs.
function emitEMFMetric(namespace, metrics, dimensions, properties) {
var metricDefinitions = Object.keys(metrics).map(function(name) {
return {
Name: name,
Unit: metrics[name].unit || "None"
};
});
var dimensionSets = [Object.keys(dimensions)];
var emfLog = {
_aws: {
Timestamp: Date.now(),
CloudWatchMetrics: [
{
Namespace: namespace,
Dimensions: dimensionSets,
Metrics: metricDefinitions
}
]
}
};
// Add dimension values
Object.keys(dimensions).forEach(function(key) {
emfLog[key] = dimensions[key];
});
// Add metric values
Object.keys(metrics).forEach(function(key) {
emfLog[key] = metrics[key].value;
});
// Add extra properties (searchable in logs but not published as metrics)
if (properties) {
Object.keys(properties).forEach(function(key) {
emfLog[key] = properties[key];
});
}
process.stdout.write(JSON.stringify(emfLog) + "\n");
}
// Usage in a Lambda function
exports.handler = function(event, context) {
var startTime = Date.now();
// Process order...
var duration = Date.now() - startTime;
var success = true;
emitEMFMetric(
"MyApp/Business",
{
OrderProcessingTime: { value: duration, unit: "Milliseconds" },
OrderCount: { value: 1, unit: "Count" },
OrderSuccess: { value: success ? 1 : 0, unit: "Count" }
},
{
Environment: "production",
Service: "order-service"
},
{
orderId: event.orderId,
customerId: event.customerId,
requestId: context.awsRequestId
}
);
return { statusCode: 200 };
};
EMF is the preferred approach for Lambda functions. No SDK calls, no batching, no buffering — just write JSON to stdout and CloudWatch handles the rest.
Anomaly Detection Alarms
Static thresholds are a pain to maintain. Anomaly detection alarms use machine learning to establish a baseline for your metric and alert when the metric deviates from that baseline. This is particularly useful for metrics with natural variance — request counts that follow daily patterns, for example.
var { CloudWatchClient, PutAnomalyDetectorCommand, PutMetricAlarmCommand } = require("@aws-sdk/client-cloudwatch");
var cwClient = new CloudWatchClient({ region: "us-east-1" });
function createAnomalyDetectionAlarm(options) {
// First, create the anomaly detector
var detectorParams = {
Namespace: options.namespace,
MetricName: options.metricName,
Stat: options.statistic || "Average",
Dimensions: options.dimensions || []
};
return cwClient.send(new PutAnomalyDetectorCommand(detectorParams))
.then(function() {
// Then create an alarm that uses the anomaly detector band
var alarmParams = {
AlarmName: options.alarmName,
AlarmDescription: options.description,
ActionsEnabled: true,
AlarmActions: options.alarmActions || [],
EvaluationPeriods: options.evaluationPeriods || 3,
DatapointsToAlarm: options.datapointsToAlarm || 2,
ComparisonOperator: "LessThanLowerOrGreaterThanUpperThreshold",
TreatMissingData: "notBreaching",
Metrics: [
{
Id: "m1",
MetricStat: {
Metric: {
Namespace: options.namespace,
MetricName: options.metricName,
Dimensions: options.dimensions || []
},
Period: options.period || 300,
Stat: options.statistic || "Average"
},
ReturnData: true
},
{
Id: "ad1",
Expression: "ANOMALY_DETECTION_BAND(m1, " + (options.bandWidth || 2) + ")",
Label: options.metricName + " (expected)",
ReturnData: true
}
],
ThresholdMetricId: "ad1"
};
return cwClient.send(new PutMetricAlarmCommand(alarmParams));
});
}
createAnomalyDetectionAlarm({
alarmName: "AnomalousOrderVolume",
description: "Order volume is outside expected range",
namespace: "MyApp/Business",
metricName: "OrderCount",
statistic: "Sum",
dimensions: [{ Name: "Environment", Value: "production" }],
period: 3600,
evaluationPeriods: 3,
datapointsToAlarm: 2,
bandWidth: 2,
alarmActions: ["arn:aws:sns:us-east-1:123456789012:ops-alerts"]
}).then(function() {
console.log("Anomaly detection alarm created");
});
The bandWidth parameter controls the sensitivity. A value of 2 means the alarm triggers when the metric is more than 2 standard deviations from the predicted baseline. Lower values mean more sensitive (more alerts); higher values mean less sensitive.
Cross-Account Monitoring
In organizations with multiple AWS accounts, you want a central monitoring account that aggregates metrics from all workload accounts. CloudWatch supports cross-account observability through a sharing configuration.
Enable this in your monitoring account:
var { CloudWatchClient, PutMetricDataCommand } = require("@aws-sdk/client-cloudwatch");
// Assume a role in the workload account to read metrics
var { STSClient, AssumeRoleCommand } = require("@aws-sdk/client-sts");
var stsClient = new STSClient({ region: "us-east-1" });
function getClientForAccount(accountId, roleName) {
var roleArn = "arn:aws:iam::" + accountId + ":role/" + roleName;
return stsClient.send(new AssumeRoleCommand({
RoleArn: roleArn,
RoleSessionName: "cross-account-monitoring"
})).then(function(result) {
var credentials = {
accessKeyId: result.Credentials.AccessKeyId,
secretAccessKey: result.Credentials.SecretAccessKey,
sessionToken: result.Credentials.SessionToken
};
return new CloudWatchClient({
region: "us-east-1",
credentials: credentials
});
});
}
// Query metrics from another account
function getMetricsFromAccount(accountId, roleName, namespace, metricName) {
var { GetMetricDataCommand } = require("@aws-sdk/client-cloudwatch");
return getClientForAccount(accountId, roleName)
.then(function(client) {
var now = new Date();
var oneHourAgo = new Date(now.getTime() - 3600000);
return client.send(new GetMetricDataCommand({
StartTime: oneHourAgo,
EndTime: now,
MetricDataQueries: [
{
Id: "m1",
MetricStat: {
Metric: {
Namespace: namespace,
MetricName: metricName
},
Period: 300,
Stat: "Sum"
}
}
]
}));
});
}
getMetricsFromAccount("987654321098", "CloudWatchReadRole", "MyApp/Business", "OrderCount")
.then(function(data) {
console.log("Cross-account metrics:", JSON.stringify(data.MetricDataResults, null, 2));
});
The workload account needs an IAM role that trusts the monitoring account, with permissions for cloudwatch:GetMetricData, cloudwatch:ListMetrics, logs:StartQuery, and logs:GetQueryResults.
Complete Working Example: Monitoring Library
Here is a complete monitoring library that ties everything together. Drop this into any Node.js project and call the methods to publish metrics, set up alarms, and query logs.
var { CloudWatchClient, PutMetricDataCommand, PutMetricAlarmCommand } = require("@aws-sdk/client-cloudwatch");
var { CloudWatchLogsClient, StartQueryCommand, GetQueryResultsCommand, PutMetricFilterCommand } = require("@aws-sdk/client-cloudwatch-logs");
var { SNSClient, PublishCommand } = require("@aws-sdk/client-sns");
function MonitoringService(config) {
this.namespace = config.namespace;
this.environment = config.environment || "production";
this.region = config.region || "us-east-1";
this.snsTopicArn = config.snsTopicArn;
this.cwClient = new CloudWatchClient({ region: this.region });
this.logsClient = new CloudWatchLogsClient({ region: this.region });
this.snsClient = new SNSClient({ region: this.region });
this.metricBuffer = [];
this.flushIntervalMs = config.flushIntervalMs || 30000;
var self = this;
this.flushTimer = setInterval(function() {
self.flushMetrics();
}, this.flushIntervalMs);
}
// -- Metrics --
MonitoringService.prototype.recordMetric = function(name, value, unit, dimensions) {
var allDimensions = [{ Name: "Environment", Value: this.environment }];
if (dimensions) {
allDimensions = allDimensions.concat(dimensions);
}
this.metricBuffer.push({
MetricName: name,
Value: value,
Unit: unit || "Count",
Timestamp: new Date(),
Dimensions: allDimensions
});
if (this.metricBuffer.length >= 500) {
return this.flushMetrics();
}
return Promise.resolve();
};
MonitoringService.prototype.recordDuration = function(name, startTime, dimensions) {
var duration = Date.now() - startTime;
return this.recordMetric(name, duration, "Milliseconds", dimensions);
};
MonitoringService.prototype.flushMetrics = function() {
if (this.metricBuffer.length === 0) return Promise.resolve();
var batches = [];
while (this.metricBuffer.length > 0) {
batches.push(this.metricBuffer.splice(0, 1000));
}
var self = this;
return Promise.all(batches.map(function(batch) {
return self.cwClient.send(new PutMetricDataCommand({
Namespace: self.namespace,
MetricData: batch
}));
})).catch(function(err) {
console.error("[MonitoringService] Failed to flush metrics:", err.message);
});
};
// -- EMF Logging --
MonitoringService.prototype.emitEMF = function(metrics, dimensions, properties) {
var allDimensions = { Environment: this.environment };
if (dimensions) {
Object.keys(dimensions).forEach(function(k) {
allDimensions[k] = dimensions[k];
});
}
var metricDefs = Object.keys(metrics).map(function(name) {
return { Name: name, Unit: metrics[name].unit || "None" };
});
var entry = {
_aws: {
Timestamp: Date.now(),
CloudWatchMetrics: [{
Namespace: this.namespace,
Dimensions: [Object.keys(allDimensions)],
Metrics: metricDefs
}]
}
};
Object.keys(allDimensions).forEach(function(k) { entry[k] = allDimensions[k]; });
Object.keys(metrics).forEach(function(k) { entry[k] = metrics[k].value; });
if (properties) {
Object.keys(properties).forEach(function(k) { entry[k] = properties[k]; });
}
process.stdout.write(JSON.stringify(entry) + "\n");
};
// -- Alarms --
MonitoringService.prototype.createAlarm = function(options) {
var params = {
AlarmName: this.environment + "-" + options.name,
AlarmDescription: options.description || "",
Namespace: this.namespace,
MetricName: options.metricName,
Dimensions: [{ Name: "Environment", Value: this.environment }],
Statistic: options.statistic || "Sum",
Period: options.period || 300,
EvaluationPeriods: options.evaluationPeriods || 1,
Threshold: options.threshold,
ComparisonOperator: options.comparisonOperator || "GreaterThanOrEqualToThreshold",
TreatMissingData: options.treatMissingData || "notBreaching",
ActionsEnabled: true,
AlarmActions: this.snsTopicArn ? [this.snsTopicArn] : []
};
return this.cwClient.send(new PutMetricAlarmCommand(params));
};
// -- Log Queries --
MonitoringService.prototype.queryLogs = function(logGroupName, query, hours) {
var now = new Date();
var startTime = new Date(now.getTime() - (hours || 1) * 3600000);
var self = this;
return this.logsClient.send(new StartQueryCommand({
logGroupName: logGroupName,
startTime: Math.floor(startTime.getTime() / 1000),
endTime: Math.floor(now.getTime() / 1000),
queryString: query,
limit: 100
})).then(function(result) {
return self._pollQuery(result.queryId, 0);
});
};
MonitoringService.prototype._pollQuery = function(queryId, attempt) {
if (attempt > 30) return Promise.reject(new Error("Query timed out"));
var self = this;
return new Promise(function(resolve) {
setTimeout(resolve, 1000);
}).then(function() {
return self.logsClient.send(new GetQueryResultsCommand({ queryId: queryId }));
}).then(function(result) {
if (result.status === "Complete") return result.results;
return self._pollQuery(queryId, attempt + 1);
});
};
// -- Metric Filters --
MonitoringService.prototype.createMetricFilter = function(logGroupName, filterName, pattern, metricName, metricValue) {
return this.logsClient.send(new PutMetricFilterCommand({
logGroupName: logGroupName,
filterName: filterName,
filterPattern: pattern,
metricTransformations: [{
metricNamespace: this.namespace,
metricName: metricName,
metricValue: metricValue || "1",
defaultValue: 0
}]
}));
};
// -- Shutdown --
MonitoringService.prototype.shutdown = function() {
clearInterval(this.flushTimer);
return this.flushMetrics();
};
module.exports = MonitoringService;
Usage:
var MonitoringService = require("./monitoring-service");
var monitor = new MonitoringService({
namespace: "MyApp/OrderService",
environment: "production",
region: "us-east-1",
snsTopicArn: "arn:aws:sns:us-east-1:123456789012:ops-alerts",
flushIntervalMs: 30000
});
// Record business metrics
monitor.recordMetric("OrderPlaced", 1, "Count", [
{ Name: "OrderType", Value: "subscription" }
]);
// Record durations
var start = Date.now();
// ... do work ...
monitor.recordDuration("APICallLatency", start, [
{ Name: "Endpoint", Value: "/api/orders" }
]);
// Create alarms
monitor.createAlarm({
name: "HighErrorRate",
metricName: "ErrorCount",
threshold: 50,
period: 300,
evaluationPeriods: 2
});
// Query recent errors
monitor.queryLogs(
"/aws/lambda/order-service",
'filter level = "ERROR" | stats count(*) by message | sort count desc',
24
).then(function(results) {
console.log("Error summary:", results);
});
// Graceful shutdown
process.on("SIGTERM", function() {
monitor.shutdown().then(function() {
process.exit(0);
});
});
Common Issues and Troubleshooting
1. "Metric not appearing in CloudWatch console"
Error: No metrics found for namespace "MyApp/Business"
Custom metrics take up to 2 minutes to appear after the first PutMetricData call. Check the namespace spelling — it is case-sensitive. Also verify your IAM role has cloudwatch:PutMetricData permission. A common mistake is publishing to us-east-1 but looking in us-west-2 in the console.
2. "AccessDeniedException when creating alarms"
AccessDeniedException: User: arn:aws:sts::123456789012:assumed-role/my-role/session is not
authorized to perform: cloudwatch:PutMetricAlarm on resource: arn:aws:cloudwatch:us-east-1:...
Your IAM policy needs cloudwatch:PutMetricAlarm explicitly. If you are using alarm actions with SNS, you also need sns:Publish permission on the topic ARN, and the SNS topic must have a resource policy that allows CloudWatch to publish to it.
3. "Metric filter not generating metrics"
Filter pattern: { $.level = "ERROR" }
Expected metric: ErrorCount
Actual: No data points
Test your filter pattern in the CloudWatch console's "Test pattern" feature before deploying. Common issues: your logs are not JSON (the { $.field } syntax only works with JSON), the field name has a typo, or the log group name is wrong. Also remember metric filters only process log events received after the filter is created — they do not retroactively process existing logs.
4. "Logs Insights query returns empty results"
Query: fields @timestamp, @message | filter @message like /error/
Status: Complete
Results: []
Check the time range — Logs Insights defaults to the last 20 minutes in the console, and your programmatic query might have the wrong start/end times. Verify the log group name and that your Lambda function (or service) actually produced logs in that time range. Also note that filter @message like /error/ is case-sensitive; use filter @message like /(?i)error/ for case-insensitive matching.
5. "PutMetricData throttling"
ThrottlingException: Rate exceeded for PutMetricData
The PutMetricData API has a default limit of 500 TPS per account per region. If you are publishing metrics from many services, you will hit this. The fix is to batch metrics (up to 1000 data points per call) and buffer with periodic flushes rather than publishing every individual data point immediately. If you still hit the limit, use the Embedded Metric Format instead — EMF does not count against the PutMetricData quota because the metrics are extracted from logs.
Best Practices
Use the Embedded Metric Format in Lambda. It is cheaper, faster, and more reliable than calling
PutMetricData. You get both logs and metrics from a singleconsole.logcall with zero additional latency.Set retention policies on every log group. The default is to keep logs forever, which gets expensive fast. Set 30 days for development, 90 days for staging, and 1 year for production. Anything older should be archived to S3 via a subscription filter.
Always set
TreatMissingDatatonotBreachingunless you have a reason not to. Missing data usually means no traffic, not an outage. Setting it tobreachingguarantees false positives at 3 AM.Use composite alarms to reduce noise. A single service having high latency for one data point is not an incident. High latency AND high error rate sustained for 10 minutes — that is an incident. Composite alarms let you encode this logic.
Standardize your metric namespace and dimensions across services. Every service should publish metrics to the same namespace pattern (
Company/ServiceName) with the same base dimensions (Environment,Region). This makes cross-service dashboards and queries possible.Buffer and batch metric publishing. Never call
PutMetricDatasynchronously in a request path. Buffer metrics in memory and flush asynchronously on an interval or when the buffer reaches a size threshold. This reduces both latency and cost.Use Logs Insights instead of log stream browsing. Scrolling through log streams is a waste of time. Learn the Logs Insights query syntax — it takes 30 minutes and saves hundreds of hours.
Tag your alarms with team ownership. When an alarm fires at 2 AM, the on-call engineer needs to know who owns the service. Use the alarm description and tags to encode ownership, runbook links, and escalation paths.
Monitor your monitoring. Set up a canary metric that your application publishes every minute. Create an alarm for when that metric is missing. If your monitoring pipeline itself is broken, this is how you find out.