Serverless Monitoring and Debugging
Monitor and debug serverless applications with CloudWatch, X-Ray distributed tracing, custom metrics, and automated alerting
Serverless Monitoring and Debugging
Serverless architectures trade infrastructure management for observability challenges. When you no longer own the runtime, traditional debugging tools like SSH, process managers, and APM agents fall apart. Monitoring and debugging serverless applications requires a fundamentally different approach built around structured logging, distributed tracing, and metric-driven alerting.
Prerequisites
- AWS account with Lambda, CloudWatch, and X-Ray access
- Node.js 18+ runtime
- AWS CLI configured locally
- Basic familiarity with AWS Lambda and API Gateway
- SAM CLI or Serverless Framework installed
Structured Logging for Lambda
The single most impactful thing you can do for serverless observability is adopt structured logging from day one. Unstructured console.log("something happened") calls are nearly useless when you have hundreds of Lambda invocations per minute flowing into CloudWatch.
Structured logs are JSON objects with consistent fields that CloudWatch Logs Insights can query, filter, and aggregate.
// lib/logger.js
var os = require("os");
var LOG_LEVELS = {
DEBUG: 0,
INFO: 1,
WARN: 2,
ERROR: 3
};
var currentLevel = LOG_LEVELS[process.env.LOG_LEVEL || "INFO"];
function createLogger(context) {
var baseFields = {
service: process.env.SERVICE_NAME || "unknown",
functionName: context.functionName,
functionVersion: context.functionVersion,
requestId: context.awsRequestId,
memoryLimit: context.memoryLimitInMB,
region: process.env.AWS_REGION
};
function log(level, message, data) {
if (LOG_LEVELS[level] < currentLevel) return;
var entry = Object.assign({}, baseFields, {
level: level,
message: message,
timestamp: new Date().toISOString(),
data: data || {}
});
// Lambda sends stdout to CloudWatch
console.log(JSON.stringify(entry));
}
return {
debug: function(msg, data) { log("DEBUG", msg, data); },
info: function(msg, data) { log("INFO", msg, data); },
warn: function(msg, data) { log("WARN", msg, data); },
error: function(msg, data) { log("ERROR", msg, data); }
};
}
module.exports = { createLogger: createLogger };
Use the logger in every handler:
// handlers/processOrder.js
var logger = require("../lib/logger");
exports.handler = function(event, context) {
var log = logger.createLogger(context);
log.info("Order processing started", {
orderId: event.orderId,
customerId: event.customerId,
itemCount: event.items.length
});
var startTime = Date.now();
return processOrder(event)
.then(function(result) {
log.info("Order processed successfully", {
orderId: event.orderId,
duration: Date.now() - startTime,
totalAmount: result.total
});
return result;
})
.catch(function(err) {
log.error("Order processing failed", {
orderId: event.orderId,
duration: Date.now() - startTime,
errorMessage: err.message,
errorStack: err.stack,
errorCode: err.code || "UNKNOWN"
});
throw err;
});
};
One rule I enforce on every team: never log sensitive data. No credit card numbers, no passwords, no PII. Mask or omit those fields before they hit CloudWatch.
CloudWatch Logs Insights Queries
Structured logging pays off when you start querying. CloudWatch Logs Insights uses a purpose-built query language that can scan gigabytes of logs in seconds.
Find the slowest Lambda invocations over the last hour:
fields @timestamp, @duration, @requestId, @billedDuration
| filter @type = "REPORT"
| sort @duration desc
| limit 25
Search structured logs for failed orders:
fields @timestamp, data.orderId, data.errorMessage, data.duration
| filter level = "ERROR" and message = "Order processing failed"
| sort @timestamp desc
| limit 50
Aggregate error rates by function version:
fields functionVersion, level
| filter level = "ERROR"
| stats count(*) as errorCount by functionVersion
| sort errorCount desc
Calculate P95 and P99 latencies:
fields @duration
| filter @type = "REPORT"
| stats avg(@duration) as avgDuration,
pct(@duration, 95) as p95,
pct(@duration, 99) as p99,
max(@duration) as maxDuration
by bin(1h)
Find cold starts:
fields @timestamp, @duration, @initDuration, @requestId
| filter ispresent(@initDuration)
| sort @initDuration desc
| limit 20
I run these queries constantly during incident response. Save your most useful queries as named queries in the CloudWatch console so the entire team can access them.
Custom CloudWatch Metrics
Lambda's built-in metrics (Invocations, Errors, Duration, Throttles) cover the basics. Custom metrics let you track business-level signals that actually matter.
// lib/metrics.js
var AWS = require("aws-sdk");
var cloudwatch = new AWS.CloudWatch();
var metricBuffer = [];
function putMetric(namespace, metricName, value, unit, dimensions) {
metricBuffer.push({
MetricName: metricName,
Value: value,
Unit: unit || "Count",
Timestamp: new Date(),
Dimensions: dimensions || []
});
}
function flushMetrics(namespace) {
if (metricBuffer.length === 0) return Promise.resolve();
var params = {
Namespace: namespace,
MetricData: metricBuffer.splice(0, 20) // CloudWatch limit: 20 per call
};
return cloudwatch.putMetricData(params).promise()
.then(function() {
if (metricBuffer.length > 0) {
return flushMetrics(namespace);
}
});
}
module.exports = {
putMetric: putMetric,
flushMetrics: flushMetrics
};
However, putMetricData API calls cost money and add latency. For high-throughput functions, use the Embedded Metric Format (EMF) instead. EMF lets you embed metrics directly in log output, and CloudWatch extracts them automatically with zero API calls:
// lib/emfMetrics.js
function emitMetric(namespace, metricName, value, unit, dimensions) {
var metricPayload = {
_aws: {
Timestamp: Date.now(),
CloudWatchMetrics: [
{
Namespace: namespace,
Dimensions: [Object.keys(dimensions || {})],
Metrics: [
{
Name: metricName,
Unit: unit || "Count"
}
]
}
]
}
};
// Add dimension values as top-level fields
var dims = dimensions || {};
Object.keys(dims).forEach(function(key) {
metricPayload[key] = dims[key];
});
// Add metric value as top-level field
metricPayload[metricName] = value;
console.log(JSON.stringify(metricPayload));
}
module.exports = { emitMetric: emitMetric };
Usage in a handler:
var emf = require("../lib/emfMetrics");
exports.handler = function(event, context) {
var startTime = Date.now();
return processPayment(event)
.then(function(result) {
emf.emitMetric("OrderService", "PaymentProcessed", 1, "Count", {
PaymentMethod: event.paymentMethod,
Currency: event.currency
});
emf.emitMetric("OrderService", "PaymentAmount", result.amount, "None", {
PaymentMethod: event.paymentMethod
});
emf.emitMetric("OrderService", "ProcessingLatency", Date.now() - startTime, "Milliseconds", {
PaymentMethod: event.paymentMethod
});
return result;
});
};
EMF is the right choice for most workloads. Reserve direct putMetricData calls for cases where you need metric resolution finer than what EMF supports.
AWS X-Ray Distributed Tracing
X-Ray is essential for serverless architectures because a single user request often touches API Gateway, Lambda, DynamoDB, SQS, SNS, and other services. Without distributed tracing, debugging latency issues across these boundaries is guesswork.
Enable X-Ray tracing in your SAM template:
# template.yaml
Globals:
Function:
Tracing: Active
Environment:
Variables:
AWS_XRAY_CONTEXT_MISSING: LOG_ERROR
Resources:
ProcessOrderFunction:
Type: AWS::Serverless::Function
Properties:
Handler: handlers/processOrder.handler
Runtime: nodejs18.x
Tracing: Active
Policies:
- AWSXRayDaemonWriteAccess
X-Ray SDK Instrumentation in Node.js
The X-Ray SDK wraps AWS SDK clients and HTTP calls to capture trace segments automatically. Install it:
npm install aws-xray-sdk-core aws-xray-sdk-express
Instrument the AWS SDK:
// lib/aws.js
var AWSXRay = require("aws-xray-sdk-core");
var AWS = AWSXRay.captureAWS(require("aws-sdk"));
// All AWS SDK calls now generate X-Ray subsegments
var dynamodb = new AWS.DynamoDB.DocumentClient();
var sqs = new AWS.SQS();
var s3 = new AWS.S3();
module.exports = {
dynamodb: dynamodb,
sqs: sqs,
s3: s3
};
Capture HTTP calls to external APIs:
var AWSXRay = require("aws-xray-sdk-core");
var http = AWSXRay.captureHTTPs(require("http"));
var https = AWSXRay.captureHTTPs(require("https"));
function callPaymentGateway(paymentData) {
return new Promise(function(resolve, reject) {
var options = {
hostname: "api.stripe.com",
path: "/v1/charges",
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": "Bearer " + process.env.STRIPE_KEY
}
};
var req = https.request(options, function(res) {
var body = "";
res.on("data", function(chunk) { body += chunk; });
res.on("end", function() {
resolve(JSON.parse(body));
});
});
req.on("error", reject);
req.write(JSON.stringify(paymentData));
req.end();
});
}
Add custom subsegments around critical business logic:
var AWSXRay = require("aws-xray-sdk-core");
function validateOrder(order) {
var segment = AWSXRay.getSegment();
var subsegment = segment.addNewSubsegment("validateOrder");
try {
subsegment.addAnnotation("orderId", order.id);
subsegment.addAnnotation("itemCount", order.items.length);
subsegment.addMetadata("orderDetails", order);
// Validation logic
if (!order.items || order.items.length === 0) {
throw new Error("Order must contain at least one item");
}
var total = order.items.reduce(function(sum, item) {
return sum + (item.price * item.quantity);
}, 0);
subsegment.addAnnotation("orderTotal", total);
subsegment.close();
return { valid: true, total: total };
} catch (err) {
subsegment.addError(err);
subsegment.close();
throw err;
}
}
The distinction between annotations and metadata matters. Annotations are indexed and searchable in the X-Ray console. Metadata is stored but not indexed. Use annotations for fields you will filter on (orderId, customerId, status) and metadata for larger payloads you want to inspect during debugging.
Lambda Insights for Enhanced Monitoring
Lambda Insights is a CloudWatch extension that collects system-level metrics from your functions: CPU usage, memory utilization, network throughput, and disk I/O. These are metrics Lambda does not expose natively.
Enable it by adding the Lambda Insights layer:
# template.yaml
Resources:
ProcessOrderFunction:
Type: AWS::Serverless::Function
Properties:
Handler: handlers/processOrder.handler
Runtime: nodejs18.x
Layers:
- !Sub "arn:aws:lambda:${AWS::Region}:580247275435:layer:LambdaInsightsExtension:38"
Policies:
- CloudWatchLambdaInsightsExecutionRolePolicy
Lambda Insights is particularly valuable for diagnosing memory leaks. If your function's memory usage climbs invocation over invocation within the same execution environment, you have a leak. The memory utilization metric from Lambda Insights shows this clearly, while the standard Lambda metrics only show the configured memory limit.
Error Tracking and Alerting
Errors in serverless functions should trigger alerts immediately, not sit in log files waiting for someone to notice. Set up CloudWatch Alarms on the key failure signals:
# template.yaml
Resources:
ErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: ProcessOrder-ErrorRate
AlarmDescription: "Order processing error rate exceeds 5%"
Namespace: AWS/Lambda
MetricName: Errors
Dimensions:
- Name: FunctionName
Value: !Ref ProcessOrderFunction
Statistic: Sum
Period: 300
EvaluationPeriods: 2
Threshold: 5
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions:
- !Ref AlertSNSTopic
ThrottleAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: ProcessOrder-Throttles
AlarmDescription: "Function is being throttled"
Namespace: AWS/Lambda
MetricName: Throttles
Dimensions:
- Name: FunctionName
Value: !Ref ProcessOrderFunction
Statistic: Sum
Period: 60
EvaluationPeriods: 1
Threshold: 0
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref AlertSNSTopic
DurationAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: ProcessOrder-HighLatency
AlarmDescription: "P95 latency exceeds 5 seconds"
Namespace: AWS/Lambda
MetricName: Duration
Dimensions:
- Name: FunctionName
Value: !Ref ProcessOrderFunction
ExtendedStatistic: p95
Period: 300
EvaluationPeriods: 3
Threshold: 5000
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref AlertSNSTopic
AlertSNSTopic:
Type: AWS::SNS::Topic
Properties:
TopicName: serverless-alerts
Subscription:
- Protocol: email
Endpoint: [email protected]
For more sophisticated error tracking, capture unhandled rejections and exceptions at the handler level:
// lib/errorHandler.js
var logger = require("./logger");
var emf = require("./emfMetrics");
function wrapHandler(handler) {
return function(event, context) {
var log = logger.createLogger(context);
return Promise.resolve()
.then(function() {
return handler(event, context, log);
})
.catch(function(err) {
log.error("Unhandled exception in Lambda handler", {
errorName: err.name,
errorMessage: err.message,
errorStack: err.stack,
event: sanitizeEvent(event)
});
emf.emitMetric("OrderService", "UnhandledException", 1, "Count", {
FunctionName: context.functionName,
ErrorType: err.name
});
throw err;
});
};
}
function sanitizeEvent(event) {
var safe = JSON.parse(JSON.stringify(event));
// Strip sensitive fields
if (safe.body) {
try {
var body = JSON.parse(safe.body);
delete body.password;
delete body.creditCard;
delete body.ssn;
safe.body = JSON.stringify(body);
} catch (e) {
// body is not JSON, leave it
}
}
return safe;
}
module.exports = { wrapHandler: wrapHandler };
Performance Profiling
Lambda execution time directly impacts cost. Profiling helps you find the bottlenecks.
Build a simple timing utility:
// lib/timer.js
function Timer(log) {
this.log = log;
this.marks = {};
this.start = Date.now();
}
Timer.prototype.mark = function(label) {
this.marks[label] = Date.now();
};
Timer.prototype.measure = function(label, startMark) {
var startTime = startMark ? this.marks[startMark] : this.start;
var duration = Date.now() - startTime;
this.log.info("Performance measurement", {
label: label,
durationMs: duration,
fromMark: startMark || "start"
});
return duration;
};
Timer.prototype.summary = function() {
var totalDuration = Date.now() - this.start;
this.log.info("Performance summary", {
totalDurationMs: totalDuration,
marks: this.marks
});
};
module.exports = Timer;
Use it to profile handler execution:
var Timer = require("../lib/timer");
var errorHandler = require("../lib/errorHandler");
exports.handler = errorHandler.wrapHandler(function(event, context, log) {
var timer = new Timer(log);
timer.mark("dbQueryStart");
return fetchOrderFromDB(event.orderId)
.then(function(order) {
timer.measure("dbQuery", "dbQueryStart");
timer.mark("validationStart");
var validated = validateOrder(order);
timer.measure("validation", "validationStart");
timer.mark("paymentStart");
return processPayment(validated);
})
.then(function(result) {
timer.measure("payment", "paymentStart");
timer.mark("notificationStart");
return sendConfirmation(result);
})
.then(function(finalResult) {
timer.measure("notification", "notificationStart");
timer.summary();
return finalResult;
});
});
Debugging Cold Starts
Cold starts are the most common performance complaint in serverless. A cold start occurs when Lambda creates a new execution environment: downloading your code, starting the runtime, running your initialization code, and then executing the handler.
Measure cold starts explicitly:
// Record module load time (runs during cold start init)
var initStart = Date.now();
var AWS = require("aws-sdk");
var AWSXRay = require("aws-xray-sdk-core");
var logger = require("./lib/logger");
var dynamodb = new AWS.DynamoDB.DocumentClient();
var initDuration = Date.now() - initStart;
exports.handler = function(event, context) {
var log = logger.createLogger(context);
var isColdStart = !global._lambdaWarmed;
if (isColdStart) {
global._lambdaWarmed = true;
log.info("Cold start detected", {
initDurationMs: initDuration,
runtimeVersion: process.version
});
}
log.info("Handler invoked", {
coldStart: isColdStart
});
// ... handler logic
};
Cold start mitigation strategies that actually work:
Reduce bundle size. Fewer dependencies means faster code download and parsing. Audit your
node_modulesruthlessly. Useaws-sdkv3 modular imports instead of the entire v2 SDK.Move initialization outside the handler. SDK client creation, database connection pooling, and configuration loading should all happen at module scope, not inside the handler function.
Provisioned Concurrency. For latency-sensitive functions, provisioned concurrency keeps warm execution environments ready. It costs more but eliminates cold starts entirely.
Avoid VPC unless necessary. VPC-attached Lambdas historically had much longer cold starts. AWS improved this with Hyperplane ENIs, but there is still overhead.
# Provisioned concurrency in SAM
Resources:
ProcessOrderFunction:
Type: AWS::Serverless::Function
Properties:
Handler: handlers/processOrder.handler
Runtime: nodejs18.x
AutoPublishAlias: live
ProvisionedConcurrencyConfig:
ProvisionedConcurrentExecutions: 5
Correlation IDs Across Services
In a microservices architecture, a single user action can trigger a chain of Lambda invocations through API Gateway, SQS, SNS, EventBridge, and Step Functions. Without correlation IDs, tracing a request across these boundaries is nearly impossible.
// lib/correlation.js
var crypto = require("crypto");
function extractCorrelationId(event) {
// Check API Gateway headers
if (event.headers) {
var headers = normalizeHeaders(event.headers);
if (headers["x-correlation-id"]) return headers["x-correlation-id"];
if (headers["x-request-id"]) return headers["x-request-id"];
if (headers["x-amzn-trace-id"]) return headers["x-amzn-trace-id"];
}
// Check SQS message attributes
if (event.Records && event.Records[0] && event.Records[0].messageAttributes) {
var attrs = event.Records[0].messageAttributes;
if (attrs.correlationId) return attrs.correlationId.stringValue;
}
// Check SNS message attributes
if (event.Records && event.Records[0] && event.Records[0].Sns) {
var snsAttrs = event.Records[0].Sns.MessageAttributes;
if (snsAttrs && snsAttrs.correlationId) {
return snsAttrs.correlationId.Value;
}
}
// Generate a new correlation ID
return crypto.randomUUID();
}
function normalizeHeaders(headers) {
var normalized = {};
Object.keys(headers).forEach(function(key) {
normalized[key.toLowerCase()] = headers[key];
});
return normalized;
}
function propagateToSQS(correlationId, messageParams) {
if (!messageParams.MessageAttributes) {
messageParams.MessageAttributes = {};
}
messageParams.MessageAttributes.correlationId = {
DataType: "String",
StringValue: correlationId
};
return messageParams;
}
function propagateToHTTP(correlationId, requestOptions) {
if (!requestOptions.headers) {
requestOptions.headers = {};
}
requestOptions.headers["x-correlation-id"] = correlationId;
return requestOptions;
}
module.exports = {
extractCorrelationId: extractCorrelationId,
propagateToSQS: propagateToSQS,
propagateToHTTP: propagateToHTTP
};
Every log entry should include the correlation ID. This lets you run a single CloudWatch Logs Insights query across multiple log groups to reconstruct the full request path:
fields @timestamp, @logGroup, message, data.correlationId, data.orderId
| filter data.correlationId = "abc-123-def-456"
| sort @timestamp asc
Third-Party Monitoring Tools
AWS-native tooling gets you far, but third-party platforms add value in specific areas.
Datadog provides a unified view across Lambda functions, containers, and traditional infrastructure. Their Lambda layer auto-instruments functions and correlates traces with logs and metrics without code changes. The Datadog Forwarder Lambda ships CloudWatch logs to their platform for indexing:
# Datadog Lambda layer in SAM
Resources:
MyFunction:
Type: AWS::Serverless::Function
Properties:
Handler: datadog_lambda.handler.handler
Runtime: nodejs18.x
Layers:
- !Sub "arn:aws:lambda:${AWS::Region}:464622532012:layer:Datadog-Node18-x:100"
Environment:
Variables:
DD_LAMBDA_HANDLER: handlers/processOrder.handler
DD_API_KEY: !Ref DatadogApiKey
DD_TRACE_ENABLED: true
DD_MERGE_XRAY_TRACES: true
Lumigo is purpose-built for serverless and excels at transaction tracing. It auto-discovers your architecture, visualizes request flows, and surfaces payload data at each step. For debugging complex event-driven architectures, Lumigo often finds issues faster than X-Ray because it captures the actual payloads passing between services.
My recommendation: start with AWS-native tools. Add a third-party tool when you hit specific pain points -- typically when you have more than 20 Lambda functions and debugging cross-service issues becomes a regular occurrence.
Creating Operational Dashboards
A good operational dashboard answers three questions at a glance: Is the system healthy? What changed recently? Where should I look first?
// scripts/createDashboard.js
var AWS = require("aws-sdk");
var cloudwatch = new AWS.CloudWatch();
var dashboardBody = {
widgets: [
{
type: "metric",
x: 0, y: 0, width: 12, height: 6,
properties: {
title: "Invocations & Errors",
metrics: [
["AWS/Lambda", "Invocations", "FunctionName", "ProcessOrder", { stat: "Sum" }],
["AWS/Lambda", "Errors", "FunctionName", "ProcessOrder", { stat: "Sum", color: "#d62728" }],
["AWS/Lambda", "Throttles", "FunctionName", "ProcessOrder", { stat: "Sum", color: "#ff7f0e" }]
],
period: 300,
view: "timeSeries",
region: "us-east-1"
}
},
{
type: "metric",
x: 12, y: 0, width: 12, height: 6,
properties: {
title: "Latency (P50, P95, P99)",
metrics: [
["AWS/Lambda", "Duration", "FunctionName", "ProcessOrder", { stat: "p50", label: "P50" }],
["AWS/Lambda", "Duration", "FunctionName", "ProcessOrder", { stat: "p95", label: "P95" }],
["AWS/Lambda", "Duration", "FunctionName", "ProcessOrder", { stat: "p99", label: "P99" }]
],
period: 300,
view: "timeSeries"
}
},
{
type: "metric",
x: 0, y: 6, width: 12, height: 6,
properties: {
title: "Concurrent Executions",
metrics: [
["AWS/Lambda", "ConcurrentExecutions", "FunctionName", "ProcessOrder", { stat: "Maximum" }]
],
period: 60,
view: "timeSeries"
}
},
{
type: "log",
x: 12, y: 6, width: 12, height: 6,
properties: {
title: "Recent Errors",
query: "fields @timestamp, data.errorMessage, data.orderId\n| filter level = 'ERROR'\n| sort @timestamp desc\n| limit 10",
region: "us-east-1",
stacked: false,
view: "table"
}
},
{
type: "metric",
x: 0, y: 12, width: 24, height: 6,
properties: {
title: "Custom Business Metrics",
metrics: [
["OrderService", "PaymentProcessed", { stat: "Sum", label: "Payments" }],
["OrderService", "UnhandledException", { stat: "Sum", label: "Unhandled Errors", color: "#d62728" }]
],
period: 300,
view: "timeSeries"
}
}
]
};
var params = {
DashboardName: "OrderService-Operations",
DashboardBody: JSON.stringify(dashboardBody)
};
cloudwatch.putDashboard(params).promise()
.then(function() {
console.log("Dashboard created successfully");
})
.catch(function(err) {
console.error("Failed to create dashboard:", err);
});
Runbook Automation with SSM
When alerts fire at 3 AM, you want automated runbooks, not groggy engineers manually running commands. AWS Systems Manager (SSM) Automation documents define step-by-step procedures that execute automatically:
# ssm-runbook.yaml
description: "Investigate and remediate high Lambda error rate"
schemaVersion: "0.3"
assumeRole: "{{ AutomationAssumeRole }}"
parameters:
FunctionName:
type: String
description: "Lambda function name"
AutomationAssumeRole:
type: String
mainSteps:
- name: GetRecentErrors
action: aws:executeAwsApi
inputs:
Service: logs
Api: StartQuery
logGroupName: !Sub "/aws/lambda/{{ FunctionName }}"
startTime: "{{ global:DATE_TIME_MINUS_1H }}"
endTime: "{{ global:DATE_TIME }}"
queryString: "fields @timestamp, data.errorMessage | filter level = 'ERROR' | sort @timestamp desc | limit 20"
- name: CheckThrottling
action: aws:executeAwsApi
inputs:
Service: cloudwatch
Api: GetMetricStatistics
Namespace: AWS/Lambda
MetricName: Throttles
Dimensions:
- Name: FunctionName
Value: "{{ FunctionName }}"
StartTime: "{{ global:DATE_TIME_MINUS_1H }}"
EndTime: "{{ global:DATE_TIME }}"
Period: 300
Statistics:
- Sum
- name: IncreaseConcurrency
action: aws:executeAwsApi
inputs:
Service: lambda
Api: PutFunctionConcurrency
FunctionName: "{{ FunctionName }}"
ReservedConcurrentExecutions: 500
onFailure: Continue
- name: NotifyTeam
action: aws:executeAwsApi
inputs:
Service: sns
Api: Publish
TopicArn: "arn:aws:sns:us-east-1:123456789:oncall-notifications"
Subject: "Runbook executed for {{ FunctionName }}"
Message: "Automated investigation complete. Concurrency increased. Review CloudWatch dashboard for details."
Complete Working Example
Here is a fully instrumented serverless order processing function that combines structured logging, X-Ray tracing, custom metrics, correlation IDs, and error handling:
// handlers/processOrder.js
var initStart = Date.now();
var AWSXRay = require("aws-xray-sdk-core");
var AWS = AWSXRay.captureAWS(require("aws-sdk"));
var logger = require("../lib/logger");
var emf = require("../lib/emfMetrics");
var correlation = require("../lib/correlation");
var Timer = require("../lib/timer");
var dynamodb = new AWS.DynamoDB.DocumentClient();
var sqs = new AWS.SQS();
var initDuration = Date.now() - initStart;
exports.handler = function(event, context) {
var log = logger.createLogger(context);
var timer = new Timer(log);
var correlationId = correlation.extractCorrelationId(event);
var isColdStart = !global._orderHandlerWarmed;
if (isColdStart) {
global._orderHandlerWarmed = true;
log.info("Cold start", { initDurationMs: initDuration });
emf.emitMetric("OrderService", "ColdStart", 1, "Count", {
FunctionName: context.functionName
});
}
// Parse request body
var body;
try {
body = JSON.parse(event.body);
} catch (err) {
log.warn("Invalid request body", { correlationId: correlationId });
return Promise.resolve({
statusCode: 400,
body: JSON.stringify({ error: "Invalid JSON" })
});
}
log.info("Processing order", {
correlationId: correlationId,
orderId: body.orderId,
customerId: body.customerId,
itemCount: body.items ? body.items.length : 0
});
var segment = AWSXRay.getSegment();
var subsegment = segment.addNewSubsegment("processOrder");
subsegment.addAnnotation("correlationId", correlationId);
subsegment.addAnnotation("orderId", body.orderId);
timer.mark("dbWrite");
// Store order in DynamoDB
var orderItem = {
TableName: process.env.ORDERS_TABLE,
Item: {
orderId: body.orderId,
customerId: body.customerId,
items: body.items,
status: "PROCESSING",
correlationId: correlationId,
createdAt: new Date().toISOString()
}
};
return dynamodb.put(orderItem).promise()
.then(function() {
timer.measure("dbWrite");
timer.mark("sqsSend");
// Send to fulfillment queue
var sqsParams = {
QueueUrl: process.env.FULFILLMENT_QUEUE_URL,
MessageBody: JSON.stringify({
orderId: body.orderId,
customerId: body.customerId,
items: body.items
})
};
sqsParams = correlation.propagateToSQS(correlationId, sqsParams);
return sqs.sendMessage(sqsParams).promise();
})
.then(function(sqsResult) {
timer.measure("sqsSend");
log.info("Order submitted to fulfillment", {
correlationId: correlationId,
orderId: body.orderId,
messageId: sqsResult.MessageId
});
emf.emitMetric("OrderService", "OrderCreated", 1, "Count", {
FunctionName: context.functionName
});
emf.emitMetric("OrderService", "OrderLatency",
Date.now() - timer.start, "Milliseconds", {
FunctionName: context.functionName
});
subsegment.close();
timer.summary();
return {
statusCode: 201,
headers: {
"Content-Type": "application/json",
"x-correlation-id": correlationId
},
body: JSON.stringify({
orderId: body.orderId,
status: "PROCESSING",
correlationId: correlationId
})
};
})
.catch(function(err) {
log.error("Failed to process order", {
correlationId: correlationId,
orderId: body.orderId,
errorMessage: err.message,
errorStack: err.stack,
errorCode: err.code
});
emf.emitMetric("OrderService", "OrderFailed", 1, "Count", {
FunctionName: context.functionName,
ErrorType: err.code || "Unknown"
});
subsegment.addError(err);
subsegment.close();
return {
statusCode: 500,
headers: {
"Content-Type": "application/json",
"x-correlation-id": correlationId
},
body: JSON.stringify({
error: "Failed to process order",
correlationId: correlationId
})
};
});
};
And the SAM template to deploy it:
AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: Instrumented Order Service
Globals:
Function:
Timeout: 30
Runtime: nodejs18.x
Tracing: Active
Environment:
Variables:
SERVICE_NAME: OrderService
LOG_LEVEL: INFO
Resources:
ProcessOrderFunction:
Type: AWS::Serverless::Function
Properties:
Handler: handlers/processOrder.handler
MemorySize: 256
Layers:
- !Sub "arn:aws:lambda:${AWS::Region}:580247275435:layer:LambdaInsightsExtension:38"
Environment:
Variables:
ORDERS_TABLE: !Ref OrdersTable
FULFILLMENT_QUEUE_URL: !Ref FulfillmentQueue
Policies:
- DynamoDBCrudPolicy:
TableName: !Ref OrdersTable
- SQSSendMessagePolicy:
QueueName: !GetAtt FulfillmentQueue.QueueName
- CloudWatchLambdaInsightsExecutionRolePolicy
Events:
ApiEvent:
Type: Api
Properties:
Path: /orders
Method: POST
OrdersTable:
Type: AWS::DynamoDB::Table
Properties:
TableName: Orders
BillingMode: PAY_PER_REQUEST
AttributeDefinitions:
- AttributeName: orderId
AttributeType: S
KeySchema:
- AttributeName: orderId
KeyType: HASH
FulfillmentQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: fulfillment-queue
VisibilityTimeout: 60
RedrivePolicy:
deadLetterTargetArn: !GetAtt FulfillmentDLQ.Arn
maxReceiveCount: 3
FulfillmentDLQ:
Type: AWS::SQS::Queue
Properties:
QueueName: fulfillment-dlq
ErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: OrderService-Errors
Namespace: AWS/Lambda
MetricName: Errors
Dimensions:
- Name: FunctionName
Value: !Ref ProcessOrderFunction
Statistic: Sum
Period: 300
EvaluationPeriods: 2
Threshold: 5
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref AlertTopic
AlertTopic:
Type: AWS::SNS::Topic
Properties:
TopicName: order-service-alerts
Common Issues and Troubleshooting
1. X-Ray segment not found
Error: Failed to get the current sub/segment from the context.
This happens when you call X-Ray SDK methods outside a traced context. Set the environment variable AWS_XRAY_CONTEXT_MISSING=LOG_ERROR to downgrade the exception to a log warning. This is essential for local development and unit testing where the X-Ray daemon is not running.
2. CloudWatch Logs Insights query returns no results
No log events found for the selected time range
Check three things: the log group name matches exactly (it is case-sensitive and follows the pattern /aws/lambda/FunctionName), the time range covers the period when your function ran, and your structured log fields are valid JSON. A single malformed log line can cause the parser to skip the entire log stream. Validate your JSON with JSON.parse before logging in development.
3. EMF metrics not appearing in CloudWatch
Embedded metric format validation failed: No metrics found in the document
The EMF payload must include the _aws key with a CloudWatchMetrics array, and the metric name referenced in the Metrics array must exist as a top-level key in the JSON object. Also verify that the metric Unit value is a valid CloudWatch unit string (Count, Milliseconds, Bytes, etc.), not a custom string.
4. Lambda timeout with no useful logs
Task timed out after 30.03 seconds
When Lambda times out, it kills the process immediately. Any buffered log entries that have not been flushed to stdout are lost. This is why you should log at the beginning of each major operation, not just at the end. If your function is timing out during downstream calls, add logging before each SDK call so you know which one is hanging. Also check for DynamoDB hot partition throttling or SQS visibility timeout conflicts that can cause silent delays.
5. X-Ray traces missing subsegments
Subsegment "validateOrder" missing from trace
X-Ray drops subsegments that are not properly closed. Always use try/finally or catch blocks to call subsegment.close() even when exceptions occur. Unclosed subsegments are silently discarded and will not appear in the X-Ray console. The X-Ray SDK also samples traces -- by default it captures the first request each second and then 5% of additional requests. In development, set the sampling rate to 100% in your xray-sampling-rules.json.
Best Practices
Log at function boundaries, not just at the end. Log when the handler starts, before each external call, and after each external call. When a timeout kills your function, the last log entry tells you exactly where it was stuck.
Use Embedded Metric Format over putMetricData for high-throughput functions. EMF publishes metrics through log output with zero additional API calls. Direct
putMetricDatacalls add latency and cost money at scale.Always propagate correlation IDs across service boundaries. Whether it flows through HTTP headers, SQS message attributes, or SNS message attributes, the correlation ID is the only thing that lets you reconstruct a request path across dozens of Lambda functions.
Set up dead letter queues on every async Lambda trigger. When a function fails repeatedly, the event must go somewhere you can inspect it. DLQs with CloudWatch alarms on their message count give you a safety net.
Monitor the right percentile, not the average. P99 latency is what your worst-experience users see. Average latency hides cold starts, retry storms, and downstream timeouts behind a comfortable number.
Keep Lambda packages small. Every megabyte of deployment package increases cold start time. Use tree-shaking, exclude dev dependencies, and consider Lambda Layers for shared dependencies that change infrequently.
Test your monitoring before you need it. Deliberately trigger errors, timeouts, and throttles in staging. Verify that alarms fire, dashboards update, and X-Ray traces appear. Discovering that your alerting is broken during a production incident is the worst possible time.
Use log levels and make them configurable via environment variables. DEBUG-level logging in production generates enormous CloudWatch costs. Set INFO as the default and flip to DEBUG only when actively investigating an issue.
References
- AWS Lambda Monitoring and Observability -- Official AWS documentation on Lambda metrics and logging.
- AWS X-Ray SDK for Node.js -- X-Ray instrumentation guide for Node.js applications.
- CloudWatch Embedded Metric Format -- Specification for publishing metrics through structured log output.
- CloudWatch Logs Insights Query Syntax -- Full query language reference for log analysis.
- Lambda Insights -- Enhanced monitoring extension for system-level Lambda metrics.
- AWS Well-Architected Serverless Lens -- Operational excellence patterns for serverless architectures.