Serverless

Serverless Monitoring and Debugging

Monitor and debug serverless applications with CloudWatch, X-Ray distributed tracing, custom metrics, and automated alerting

Serverless Monitoring and Debugging

Serverless architectures trade infrastructure management for observability challenges. When you no longer own the runtime, traditional debugging tools like SSH, process managers, and APM agents fall apart. Monitoring and debugging serverless applications requires a fundamentally different approach built around structured logging, distributed tracing, and metric-driven alerting.

Prerequisites

  • AWS account with Lambda, CloudWatch, and X-Ray access
  • Node.js 18+ runtime
  • AWS CLI configured locally
  • Basic familiarity with AWS Lambda and API Gateway
  • SAM CLI or Serverless Framework installed

Structured Logging for Lambda

The single most impactful thing you can do for serverless observability is adopt structured logging from day one. Unstructured console.log("something happened") calls are nearly useless when you have hundreds of Lambda invocations per minute flowing into CloudWatch.

Structured logs are JSON objects with consistent fields that CloudWatch Logs Insights can query, filter, and aggregate.

// lib/logger.js
var os = require("os");

var LOG_LEVELS = {
  DEBUG: 0,
  INFO: 1,
  WARN: 2,
  ERROR: 3
};

var currentLevel = LOG_LEVELS[process.env.LOG_LEVEL || "INFO"];

function createLogger(context) {
  var baseFields = {
    service: process.env.SERVICE_NAME || "unknown",
    functionName: context.functionName,
    functionVersion: context.functionVersion,
    requestId: context.awsRequestId,
    memoryLimit: context.memoryLimitInMB,
    region: process.env.AWS_REGION
  };

  function log(level, message, data) {
    if (LOG_LEVELS[level] < currentLevel) return;

    var entry = Object.assign({}, baseFields, {
      level: level,
      message: message,
      timestamp: new Date().toISOString(),
      data: data || {}
    });

    // Lambda sends stdout to CloudWatch
    console.log(JSON.stringify(entry));
  }

  return {
    debug: function(msg, data) { log("DEBUG", msg, data); },
    info: function(msg, data) { log("INFO", msg, data); },
    warn: function(msg, data) { log("WARN", msg, data); },
    error: function(msg, data) { log("ERROR", msg, data); }
  };
}

module.exports = { createLogger: createLogger };

Use the logger in every handler:

// handlers/processOrder.js
var logger = require("../lib/logger");

exports.handler = function(event, context) {
  var log = logger.createLogger(context);

  log.info("Order processing started", {
    orderId: event.orderId,
    customerId: event.customerId,
    itemCount: event.items.length
  });

  var startTime = Date.now();

  return processOrder(event)
    .then(function(result) {
      log.info("Order processed successfully", {
        orderId: event.orderId,
        duration: Date.now() - startTime,
        totalAmount: result.total
      });
      return result;
    })
    .catch(function(err) {
      log.error("Order processing failed", {
        orderId: event.orderId,
        duration: Date.now() - startTime,
        errorMessage: err.message,
        errorStack: err.stack,
        errorCode: err.code || "UNKNOWN"
      });
      throw err;
    });
};

One rule I enforce on every team: never log sensitive data. No credit card numbers, no passwords, no PII. Mask or omit those fields before they hit CloudWatch.

CloudWatch Logs Insights Queries

Structured logging pays off when you start querying. CloudWatch Logs Insights uses a purpose-built query language that can scan gigabytes of logs in seconds.

Find the slowest Lambda invocations over the last hour:

fields @timestamp, @duration, @requestId, @billedDuration
| filter @type = "REPORT"
| sort @duration desc
| limit 25

Search structured logs for failed orders:

fields @timestamp, data.orderId, data.errorMessage, data.duration
| filter level = "ERROR" and message = "Order processing failed"
| sort @timestamp desc
| limit 50

Aggregate error rates by function version:

fields functionVersion, level
| filter level = "ERROR"
| stats count(*) as errorCount by functionVersion
| sort errorCount desc

Calculate P95 and P99 latencies:

fields @duration
| filter @type = "REPORT"
| stats avg(@duration) as avgDuration,
        pct(@duration, 95) as p95,
        pct(@duration, 99) as p99,
        max(@duration) as maxDuration
  by bin(1h)

Find cold starts:

fields @timestamp, @duration, @initDuration, @requestId
| filter ispresent(@initDuration)
| sort @initDuration desc
| limit 20

I run these queries constantly during incident response. Save your most useful queries as named queries in the CloudWatch console so the entire team can access them.

Custom CloudWatch Metrics

Lambda's built-in metrics (Invocations, Errors, Duration, Throttles) cover the basics. Custom metrics let you track business-level signals that actually matter.

// lib/metrics.js
var AWS = require("aws-sdk");
var cloudwatch = new AWS.CloudWatch();

var metricBuffer = [];

function putMetric(namespace, metricName, value, unit, dimensions) {
  metricBuffer.push({
    MetricName: metricName,
    Value: value,
    Unit: unit || "Count",
    Timestamp: new Date(),
    Dimensions: dimensions || []
  });
}

function flushMetrics(namespace) {
  if (metricBuffer.length === 0) return Promise.resolve();

  var params = {
    Namespace: namespace,
    MetricData: metricBuffer.splice(0, 20) // CloudWatch limit: 20 per call
  };

  return cloudwatch.putMetricData(params).promise()
    .then(function() {
      if (metricBuffer.length > 0) {
        return flushMetrics(namespace);
      }
    });
}

module.exports = {
  putMetric: putMetric,
  flushMetrics: flushMetrics
};

However, putMetricData API calls cost money and add latency. For high-throughput functions, use the Embedded Metric Format (EMF) instead. EMF lets you embed metrics directly in log output, and CloudWatch extracts them automatically with zero API calls:

// lib/emfMetrics.js
function emitMetric(namespace, metricName, value, unit, dimensions) {
  var metricPayload = {
    _aws: {
      Timestamp: Date.now(),
      CloudWatchMetrics: [
        {
          Namespace: namespace,
          Dimensions: [Object.keys(dimensions || {})],
          Metrics: [
            {
              Name: metricName,
              Unit: unit || "Count"
            }
          ]
        }
      ]
    }
  };

  // Add dimension values as top-level fields
  var dims = dimensions || {};
  Object.keys(dims).forEach(function(key) {
    metricPayload[key] = dims[key];
  });

  // Add metric value as top-level field
  metricPayload[metricName] = value;

  console.log(JSON.stringify(metricPayload));
}

module.exports = { emitMetric: emitMetric };

Usage in a handler:

var emf = require("../lib/emfMetrics");

exports.handler = function(event, context) {
  var startTime = Date.now();

  return processPayment(event)
    .then(function(result) {
      emf.emitMetric("OrderService", "PaymentProcessed", 1, "Count", {
        PaymentMethod: event.paymentMethod,
        Currency: event.currency
      });

      emf.emitMetric("OrderService", "PaymentAmount", result.amount, "None", {
        PaymentMethod: event.paymentMethod
      });

      emf.emitMetric("OrderService", "ProcessingLatency", Date.now() - startTime, "Milliseconds", {
        PaymentMethod: event.paymentMethod
      });

      return result;
    });
};

EMF is the right choice for most workloads. Reserve direct putMetricData calls for cases where you need metric resolution finer than what EMF supports.

AWS X-Ray Distributed Tracing

X-Ray is essential for serverless architectures because a single user request often touches API Gateway, Lambda, DynamoDB, SQS, SNS, and other services. Without distributed tracing, debugging latency issues across these boundaries is guesswork.

Enable X-Ray tracing in your SAM template:

# template.yaml
Globals:
  Function:
    Tracing: Active
    Environment:
      Variables:
        AWS_XRAY_CONTEXT_MISSING: LOG_ERROR

Resources:
  ProcessOrderFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: handlers/processOrder.handler
      Runtime: nodejs18.x
      Tracing: Active
      Policies:
        - AWSXRayDaemonWriteAccess

X-Ray SDK Instrumentation in Node.js

The X-Ray SDK wraps AWS SDK clients and HTTP calls to capture trace segments automatically. Install it:

npm install aws-xray-sdk-core aws-xray-sdk-express

Instrument the AWS SDK:

// lib/aws.js
var AWSXRay = require("aws-xray-sdk-core");
var AWS = AWSXRay.captureAWS(require("aws-sdk"));

// All AWS SDK calls now generate X-Ray subsegments
var dynamodb = new AWS.DynamoDB.DocumentClient();
var sqs = new AWS.SQS();
var s3 = new AWS.S3();

module.exports = {
  dynamodb: dynamodb,
  sqs: sqs,
  s3: s3
};

Capture HTTP calls to external APIs:

var AWSXRay = require("aws-xray-sdk-core");
var http = AWSXRay.captureHTTPs(require("http"));
var https = AWSXRay.captureHTTPs(require("https"));

function callPaymentGateway(paymentData) {
  return new Promise(function(resolve, reject) {
    var options = {
      hostname: "api.stripe.com",
      path: "/v1/charges",
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        "Authorization": "Bearer " + process.env.STRIPE_KEY
      }
    };

    var req = https.request(options, function(res) {
      var body = "";
      res.on("data", function(chunk) { body += chunk; });
      res.on("end", function() {
        resolve(JSON.parse(body));
      });
    });

    req.on("error", reject);
    req.write(JSON.stringify(paymentData));
    req.end();
  });
}

Add custom subsegments around critical business logic:

var AWSXRay = require("aws-xray-sdk-core");

function validateOrder(order) {
  var segment = AWSXRay.getSegment();
  var subsegment = segment.addNewSubsegment("validateOrder");

  try {
    subsegment.addAnnotation("orderId", order.id);
    subsegment.addAnnotation("itemCount", order.items.length);
    subsegment.addMetadata("orderDetails", order);

    // Validation logic
    if (!order.items || order.items.length === 0) {
      throw new Error("Order must contain at least one item");
    }

    var total = order.items.reduce(function(sum, item) {
      return sum + (item.price * item.quantity);
    }, 0);

    subsegment.addAnnotation("orderTotal", total);
    subsegment.close();
    return { valid: true, total: total };
  } catch (err) {
    subsegment.addError(err);
    subsegment.close();
    throw err;
  }
}

The distinction between annotations and metadata matters. Annotations are indexed and searchable in the X-Ray console. Metadata is stored but not indexed. Use annotations for fields you will filter on (orderId, customerId, status) and metadata for larger payloads you want to inspect during debugging.

Lambda Insights for Enhanced Monitoring

Lambda Insights is a CloudWatch extension that collects system-level metrics from your functions: CPU usage, memory utilization, network throughput, and disk I/O. These are metrics Lambda does not expose natively.

Enable it by adding the Lambda Insights layer:

# template.yaml
Resources:
  ProcessOrderFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: handlers/processOrder.handler
      Runtime: nodejs18.x
      Layers:
        - !Sub "arn:aws:lambda:${AWS::Region}:580247275435:layer:LambdaInsightsExtension:38"
      Policies:
        - CloudWatchLambdaInsightsExecutionRolePolicy

Lambda Insights is particularly valuable for diagnosing memory leaks. If your function's memory usage climbs invocation over invocation within the same execution environment, you have a leak. The memory utilization metric from Lambda Insights shows this clearly, while the standard Lambda metrics only show the configured memory limit.

Error Tracking and Alerting

Errors in serverless functions should trigger alerts immediately, not sit in log files waiting for someone to notice. Set up CloudWatch Alarms on the key failure signals:

# template.yaml
Resources:
  ErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: ProcessOrder-ErrorRate
      AlarmDescription: "Order processing error rate exceeds 5%"
      Namespace: AWS/Lambda
      MetricName: Errors
      Dimensions:
        - Name: FunctionName
          Value: !Ref ProcessOrderFunction
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold
      TreatMissingData: notBreaching
      AlarmActions:
        - !Ref AlertSNSTopic

  ThrottleAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: ProcessOrder-Throttles
      AlarmDescription: "Function is being throttled"
      Namespace: AWS/Lambda
      MetricName: Throttles
      Dimensions:
        - Name: FunctionName
          Value: !Ref ProcessOrderFunction
      Statistic: Sum
      Period: 60
      EvaluationPeriods: 1
      Threshold: 0
      ComparisonOperator: GreaterThanThreshold
      AlarmActions:
        - !Ref AlertSNSTopic

  DurationAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: ProcessOrder-HighLatency
      AlarmDescription: "P95 latency exceeds 5 seconds"
      Namespace: AWS/Lambda
      MetricName: Duration
      Dimensions:
        - Name: FunctionName
          Value: !Ref ProcessOrderFunction
      ExtendedStatistic: p95
      Period: 300
      EvaluationPeriods: 3
      Threshold: 5000
      ComparisonOperator: GreaterThanThreshold
      AlarmActions:
        - !Ref AlertSNSTopic

  AlertSNSTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: serverless-alerts
      Subscription:
        - Protocol: email
          Endpoint: [email protected]

For more sophisticated error tracking, capture unhandled rejections and exceptions at the handler level:

// lib/errorHandler.js
var logger = require("./logger");
var emf = require("./emfMetrics");

function wrapHandler(handler) {
  return function(event, context) {
    var log = logger.createLogger(context);

    return Promise.resolve()
      .then(function() {
        return handler(event, context, log);
      })
      .catch(function(err) {
        log.error("Unhandled exception in Lambda handler", {
          errorName: err.name,
          errorMessage: err.message,
          errorStack: err.stack,
          event: sanitizeEvent(event)
        });

        emf.emitMetric("OrderService", "UnhandledException", 1, "Count", {
          FunctionName: context.functionName,
          ErrorType: err.name
        });

        throw err;
      });
  };
}

function sanitizeEvent(event) {
  var safe = JSON.parse(JSON.stringify(event));
  // Strip sensitive fields
  if (safe.body) {
    try {
      var body = JSON.parse(safe.body);
      delete body.password;
      delete body.creditCard;
      delete body.ssn;
      safe.body = JSON.stringify(body);
    } catch (e) {
      // body is not JSON, leave it
    }
  }
  return safe;
}

module.exports = { wrapHandler: wrapHandler };

Performance Profiling

Lambda execution time directly impacts cost. Profiling helps you find the bottlenecks.

Build a simple timing utility:

// lib/timer.js
function Timer(log) {
  this.log = log;
  this.marks = {};
  this.start = Date.now();
}

Timer.prototype.mark = function(label) {
  this.marks[label] = Date.now();
};

Timer.prototype.measure = function(label, startMark) {
  var startTime = startMark ? this.marks[startMark] : this.start;
  var duration = Date.now() - startTime;

  this.log.info("Performance measurement", {
    label: label,
    durationMs: duration,
    fromMark: startMark || "start"
  });

  return duration;
};

Timer.prototype.summary = function() {
  var totalDuration = Date.now() - this.start;
  this.log.info("Performance summary", {
    totalDurationMs: totalDuration,
    marks: this.marks
  });
};

module.exports = Timer;

Use it to profile handler execution:

var Timer = require("../lib/timer");
var errorHandler = require("../lib/errorHandler");

exports.handler = errorHandler.wrapHandler(function(event, context, log) {
  var timer = new Timer(log);

  timer.mark("dbQueryStart");
  return fetchOrderFromDB(event.orderId)
    .then(function(order) {
      timer.measure("dbQuery", "dbQueryStart");

      timer.mark("validationStart");
      var validated = validateOrder(order);
      timer.measure("validation", "validationStart");

      timer.mark("paymentStart");
      return processPayment(validated);
    })
    .then(function(result) {
      timer.measure("payment", "paymentStart");

      timer.mark("notificationStart");
      return sendConfirmation(result);
    })
    .then(function(finalResult) {
      timer.measure("notification", "notificationStart");
      timer.summary();
      return finalResult;
    });
});

Debugging Cold Starts

Cold starts are the most common performance complaint in serverless. A cold start occurs when Lambda creates a new execution environment: downloading your code, starting the runtime, running your initialization code, and then executing the handler.

Measure cold starts explicitly:

// Record module load time (runs during cold start init)
var initStart = Date.now();

var AWS = require("aws-sdk");
var AWSXRay = require("aws-xray-sdk-core");
var logger = require("./lib/logger");

var dynamodb = new AWS.DynamoDB.DocumentClient();

var initDuration = Date.now() - initStart;

exports.handler = function(event, context) {
  var log = logger.createLogger(context);
  var isColdStart = !global._lambdaWarmed;

  if (isColdStart) {
    global._lambdaWarmed = true;
    log.info("Cold start detected", {
      initDurationMs: initDuration,
      runtimeVersion: process.version
    });
  }

  log.info("Handler invoked", {
    coldStart: isColdStart
  });

  // ... handler logic
};

Cold start mitigation strategies that actually work:

  1. Reduce bundle size. Fewer dependencies means faster code download and parsing. Audit your node_modules ruthlessly. Use aws-sdk v3 modular imports instead of the entire v2 SDK.

  2. Move initialization outside the handler. SDK client creation, database connection pooling, and configuration loading should all happen at module scope, not inside the handler function.

  3. Provisioned Concurrency. For latency-sensitive functions, provisioned concurrency keeps warm execution environments ready. It costs more but eliminates cold starts entirely.

  4. Avoid VPC unless necessary. VPC-attached Lambdas historically had much longer cold starts. AWS improved this with Hyperplane ENIs, but there is still overhead.

# Provisioned concurrency in SAM
Resources:
  ProcessOrderFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: handlers/processOrder.handler
      Runtime: nodejs18.x
      AutoPublishAlias: live
      ProvisionedConcurrencyConfig:
        ProvisionedConcurrentExecutions: 5

Correlation IDs Across Services

In a microservices architecture, a single user action can trigger a chain of Lambda invocations through API Gateway, SQS, SNS, EventBridge, and Step Functions. Without correlation IDs, tracing a request across these boundaries is nearly impossible.

// lib/correlation.js
var crypto = require("crypto");

function extractCorrelationId(event) {
  // Check API Gateway headers
  if (event.headers) {
    var headers = normalizeHeaders(event.headers);
    if (headers["x-correlation-id"]) return headers["x-correlation-id"];
    if (headers["x-request-id"]) return headers["x-request-id"];
    if (headers["x-amzn-trace-id"]) return headers["x-amzn-trace-id"];
  }

  // Check SQS message attributes
  if (event.Records && event.Records[0] && event.Records[0].messageAttributes) {
    var attrs = event.Records[0].messageAttributes;
    if (attrs.correlationId) return attrs.correlationId.stringValue;
  }

  // Check SNS message attributes
  if (event.Records && event.Records[0] && event.Records[0].Sns) {
    var snsAttrs = event.Records[0].Sns.MessageAttributes;
    if (snsAttrs && snsAttrs.correlationId) {
      return snsAttrs.correlationId.Value;
    }
  }

  // Generate a new correlation ID
  return crypto.randomUUID();
}

function normalizeHeaders(headers) {
  var normalized = {};
  Object.keys(headers).forEach(function(key) {
    normalized[key.toLowerCase()] = headers[key];
  });
  return normalized;
}

function propagateToSQS(correlationId, messageParams) {
  if (!messageParams.MessageAttributes) {
    messageParams.MessageAttributes = {};
  }
  messageParams.MessageAttributes.correlationId = {
    DataType: "String",
    StringValue: correlationId
  };
  return messageParams;
}

function propagateToHTTP(correlationId, requestOptions) {
  if (!requestOptions.headers) {
    requestOptions.headers = {};
  }
  requestOptions.headers["x-correlation-id"] = correlationId;
  return requestOptions;
}

module.exports = {
  extractCorrelationId: extractCorrelationId,
  propagateToSQS: propagateToSQS,
  propagateToHTTP: propagateToHTTP
};

Every log entry should include the correlation ID. This lets you run a single CloudWatch Logs Insights query across multiple log groups to reconstruct the full request path:

fields @timestamp, @logGroup, message, data.correlationId, data.orderId
| filter data.correlationId = "abc-123-def-456"
| sort @timestamp asc

Third-Party Monitoring Tools

AWS-native tooling gets you far, but third-party platforms add value in specific areas.

Datadog provides a unified view across Lambda functions, containers, and traditional infrastructure. Their Lambda layer auto-instruments functions and correlates traces with logs and metrics without code changes. The Datadog Forwarder Lambda ships CloudWatch logs to their platform for indexing:

# Datadog Lambda layer in SAM
Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: datadog_lambda.handler.handler
      Runtime: nodejs18.x
      Layers:
        - !Sub "arn:aws:lambda:${AWS::Region}:464622532012:layer:Datadog-Node18-x:100"
      Environment:
        Variables:
          DD_LAMBDA_HANDLER: handlers/processOrder.handler
          DD_API_KEY: !Ref DatadogApiKey
          DD_TRACE_ENABLED: true
          DD_MERGE_XRAY_TRACES: true

Lumigo is purpose-built for serverless and excels at transaction tracing. It auto-discovers your architecture, visualizes request flows, and surfaces payload data at each step. For debugging complex event-driven architectures, Lumigo often finds issues faster than X-Ray because it captures the actual payloads passing between services.

My recommendation: start with AWS-native tools. Add a third-party tool when you hit specific pain points -- typically when you have more than 20 Lambda functions and debugging cross-service issues becomes a regular occurrence.

Creating Operational Dashboards

A good operational dashboard answers three questions at a glance: Is the system healthy? What changed recently? Where should I look first?

// scripts/createDashboard.js
var AWS = require("aws-sdk");
var cloudwatch = new AWS.CloudWatch();

var dashboardBody = {
  widgets: [
    {
      type: "metric",
      x: 0, y: 0, width: 12, height: 6,
      properties: {
        title: "Invocations & Errors",
        metrics: [
          ["AWS/Lambda", "Invocations", "FunctionName", "ProcessOrder", { stat: "Sum" }],
          ["AWS/Lambda", "Errors", "FunctionName", "ProcessOrder", { stat: "Sum", color: "#d62728" }],
          ["AWS/Lambda", "Throttles", "FunctionName", "ProcessOrder", { stat: "Sum", color: "#ff7f0e" }]
        ],
        period: 300,
        view: "timeSeries",
        region: "us-east-1"
      }
    },
    {
      type: "metric",
      x: 12, y: 0, width: 12, height: 6,
      properties: {
        title: "Latency (P50, P95, P99)",
        metrics: [
          ["AWS/Lambda", "Duration", "FunctionName", "ProcessOrder", { stat: "p50", label: "P50" }],
          ["AWS/Lambda", "Duration", "FunctionName", "ProcessOrder", { stat: "p95", label: "P95" }],
          ["AWS/Lambda", "Duration", "FunctionName", "ProcessOrder", { stat: "p99", label: "P99" }]
        ],
        period: 300,
        view: "timeSeries"
      }
    },
    {
      type: "metric",
      x: 0, y: 6, width: 12, height: 6,
      properties: {
        title: "Concurrent Executions",
        metrics: [
          ["AWS/Lambda", "ConcurrentExecutions", "FunctionName", "ProcessOrder", { stat: "Maximum" }]
        ],
        period: 60,
        view: "timeSeries"
      }
    },
    {
      type: "log",
      x: 12, y: 6, width: 12, height: 6,
      properties: {
        title: "Recent Errors",
        query: "fields @timestamp, data.errorMessage, data.orderId\n| filter level = 'ERROR'\n| sort @timestamp desc\n| limit 10",
        region: "us-east-1",
        stacked: false,
        view: "table"
      }
    },
    {
      type: "metric",
      x: 0, y: 12, width: 24, height: 6,
      properties: {
        title: "Custom Business Metrics",
        metrics: [
          ["OrderService", "PaymentProcessed", { stat: "Sum", label: "Payments" }],
          ["OrderService", "UnhandledException", { stat: "Sum", label: "Unhandled Errors", color: "#d62728" }]
        ],
        period: 300,
        view: "timeSeries"
      }
    }
  ]
};

var params = {
  DashboardName: "OrderService-Operations",
  DashboardBody: JSON.stringify(dashboardBody)
};

cloudwatch.putDashboard(params).promise()
  .then(function() {
    console.log("Dashboard created successfully");
  })
  .catch(function(err) {
    console.error("Failed to create dashboard:", err);
  });

Runbook Automation with SSM

When alerts fire at 3 AM, you want automated runbooks, not groggy engineers manually running commands. AWS Systems Manager (SSM) Automation documents define step-by-step procedures that execute automatically:

# ssm-runbook.yaml
description: "Investigate and remediate high Lambda error rate"
schemaVersion: "0.3"
assumeRole: "{{ AutomationAssumeRole }}"
parameters:
  FunctionName:
    type: String
    description: "Lambda function name"
  AutomationAssumeRole:
    type: String

mainSteps:
  - name: GetRecentErrors
    action: aws:executeAwsApi
    inputs:
      Service: logs
      Api: StartQuery
      logGroupName: !Sub "/aws/lambda/{{ FunctionName }}"
      startTime: "{{ global:DATE_TIME_MINUS_1H }}"
      endTime: "{{ global:DATE_TIME }}"
      queryString: "fields @timestamp, data.errorMessage | filter level = 'ERROR' | sort @timestamp desc | limit 20"

  - name: CheckThrottling
    action: aws:executeAwsApi
    inputs:
      Service: cloudwatch
      Api: GetMetricStatistics
      Namespace: AWS/Lambda
      MetricName: Throttles
      Dimensions:
        - Name: FunctionName
          Value: "{{ FunctionName }}"
      StartTime: "{{ global:DATE_TIME_MINUS_1H }}"
      EndTime: "{{ global:DATE_TIME }}"
      Period: 300
      Statistics:
        - Sum

  - name: IncreaseConcurrency
    action: aws:executeAwsApi
    inputs:
      Service: lambda
      Api: PutFunctionConcurrency
      FunctionName: "{{ FunctionName }}"
      ReservedConcurrentExecutions: 500
    onFailure: Continue

  - name: NotifyTeam
    action: aws:executeAwsApi
    inputs:
      Service: sns
      Api: Publish
      TopicArn: "arn:aws:sns:us-east-1:123456789:oncall-notifications"
      Subject: "Runbook executed for {{ FunctionName }}"
      Message: "Automated investigation complete. Concurrency increased. Review CloudWatch dashboard for details."

Complete Working Example

Here is a fully instrumented serverless order processing function that combines structured logging, X-Ray tracing, custom metrics, correlation IDs, and error handling:

// handlers/processOrder.js
var initStart = Date.now();

var AWSXRay = require("aws-xray-sdk-core");
var AWS = AWSXRay.captureAWS(require("aws-sdk"));
var logger = require("../lib/logger");
var emf = require("../lib/emfMetrics");
var correlation = require("../lib/correlation");
var Timer = require("../lib/timer");

var dynamodb = new AWS.DynamoDB.DocumentClient();
var sqs = new AWS.SQS();
var initDuration = Date.now() - initStart;

exports.handler = function(event, context) {
  var log = logger.createLogger(context);
  var timer = new Timer(log);
  var correlationId = correlation.extractCorrelationId(event);
  var isColdStart = !global._orderHandlerWarmed;

  if (isColdStart) {
    global._orderHandlerWarmed = true;
    log.info("Cold start", { initDurationMs: initDuration });
    emf.emitMetric("OrderService", "ColdStart", 1, "Count", {
      FunctionName: context.functionName
    });
  }

  // Parse request body
  var body;
  try {
    body = JSON.parse(event.body);
  } catch (err) {
    log.warn("Invalid request body", { correlationId: correlationId });
    return Promise.resolve({
      statusCode: 400,
      body: JSON.stringify({ error: "Invalid JSON" })
    });
  }

  log.info("Processing order", {
    correlationId: correlationId,
    orderId: body.orderId,
    customerId: body.customerId,
    itemCount: body.items ? body.items.length : 0
  });

  var segment = AWSXRay.getSegment();
  var subsegment = segment.addNewSubsegment("processOrder");
  subsegment.addAnnotation("correlationId", correlationId);
  subsegment.addAnnotation("orderId", body.orderId);

  timer.mark("dbWrite");

  // Store order in DynamoDB
  var orderItem = {
    TableName: process.env.ORDERS_TABLE,
    Item: {
      orderId: body.orderId,
      customerId: body.customerId,
      items: body.items,
      status: "PROCESSING",
      correlationId: correlationId,
      createdAt: new Date().toISOString()
    }
  };

  return dynamodb.put(orderItem).promise()
    .then(function() {
      timer.measure("dbWrite");
      timer.mark("sqsSend");

      // Send to fulfillment queue
      var sqsParams = {
        QueueUrl: process.env.FULFILLMENT_QUEUE_URL,
        MessageBody: JSON.stringify({
          orderId: body.orderId,
          customerId: body.customerId,
          items: body.items
        })
      };

      sqsParams = correlation.propagateToSQS(correlationId, sqsParams);
      return sqs.sendMessage(sqsParams).promise();
    })
    .then(function(sqsResult) {
      timer.measure("sqsSend");

      log.info("Order submitted to fulfillment", {
        correlationId: correlationId,
        orderId: body.orderId,
        messageId: sqsResult.MessageId
      });

      emf.emitMetric("OrderService", "OrderCreated", 1, "Count", {
        FunctionName: context.functionName
      });

      emf.emitMetric("OrderService", "OrderLatency",
        Date.now() - timer.start, "Milliseconds", {
          FunctionName: context.functionName
        });

      subsegment.close();
      timer.summary();

      return {
        statusCode: 201,
        headers: {
          "Content-Type": "application/json",
          "x-correlation-id": correlationId
        },
        body: JSON.stringify({
          orderId: body.orderId,
          status: "PROCESSING",
          correlationId: correlationId
        })
      };
    })
    .catch(function(err) {
      log.error("Failed to process order", {
        correlationId: correlationId,
        orderId: body.orderId,
        errorMessage: err.message,
        errorStack: err.stack,
        errorCode: err.code
      });

      emf.emitMetric("OrderService", "OrderFailed", 1, "Count", {
        FunctionName: context.functionName,
        ErrorType: err.code || "Unknown"
      });

      subsegment.addError(err);
      subsegment.close();

      return {
        statusCode: 500,
        headers: {
          "Content-Type": "application/json",
          "x-correlation-id": correlationId
        },
        body: JSON.stringify({
          error: "Failed to process order",
          correlationId: correlationId
        })
      };
    });
};

And the SAM template to deploy it:

AWSTemplateFormatVersion: "2010-09-09"
Transform: AWS::Serverless-2016-10-31
Description: Instrumented Order Service

Globals:
  Function:
    Timeout: 30
    Runtime: nodejs18.x
    Tracing: Active
    Environment:
      Variables:
        SERVICE_NAME: OrderService
        LOG_LEVEL: INFO

Resources:
  ProcessOrderFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: handlers/processOrder.handler
      MemorySize: 256
      Layers:
        - !Sub "arn:aws:lambda:${AWS::Region}:580247275435:layer:LambdaInsightsExtension:38"
      Environment:
        Variables:
          ORDERS_TABLE: !Ref OrdersTable
          FULFILLMENT_QUEUE_URL: !Ref FulfillmentQueue
      Policies:
        - DynamoDBCrudPolicy:
            TableName: !Ref OrdersTable
        - SQSSendMessagePolicy:
            QueueName: !GetAtt FulfillmentQueue.QueueName
        - CloudWatchLambdaInsightsExecutionRolePolicy
      Events:
        ApiEvent:
          Type: Api
          Properties:
            Path: /orders
            Method: POST

  OrdersTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: Orders
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - AttributeName: orderId
          AttributeType: S
      KeySchema:
        - AttributeName: orderId
          KeyType: HASH

  FulfillmentQueue:
    Type: AWS::SQS::Queue
    Properties:
      QueueName: fulfillment-queue
      VisibilityTimeout: 60
      RedrivePolicy:
        deadLetterTargetArn: !GetAtt FulfillmentDLQ.Arn
        maxReceiveCount: 3

  FulfillmentDLQ:
    Type: AWS::SQS::Queue
    Properties:
      QueueName: fulfillment-dlq

  ErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: OrderService-Errors
      Namespace: AWS/Lambda
      MetricName: Errors
      Dimensions:
        - Name: FunctionName
          Value: !Ref ProcessOrderFunction
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold
      AlarmActions:
        - !Ref AlertTopic

  AlertTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: order-service-alerts

Common Issues and Troubleshooting

1. X-Ray segment not found

Error: Failed to get the current sub/segment from the context.

This happens when you call X-Ray SDK methods outside a traced context. Set the environment variable AWS_XRAY_CONTEXT_MISSING=LOG_ERROR to downgrade the exception to a log warning. This is essential for local development and unit testing where the X-Ray daemon is not running.

2. CloudWatch Logs Insights query returns no results

No log events found for the selected time range

Check three things: the log group name matches exactly (it is case-sensitive and follows the pattern /aws/lambda/FunctionName), the time range covers the period when your function ran, and your structured log fields are valid JSON. A single malformed log line can cause the parser to skip the entire log stream. Validate your JSON with JSON.parse before logging in development.

3. EMF metrics not appearing in CloudWatch

Embedded metric format validation failed: No metrics found in the document

The EMF payload must include the _aws key with a CloudWatchMetrics array, and the metric name referenced in the Metrics array must exist as a top-level key in the JSON object. Also verify that the metric Unit value is a valid CloudWatch unit string (Count, Milliseconds, Bytes, etc.), not a custom string.

4. Lambda timeout with no useful logs

Task timed out after 30.03 seconds

When Lambda times out, it kills the process immediately. Any buffered log entries that have not been flushed to stdout are lost. This is why you should log at the beginning of each major operation, not just at the end. If your function is timing out during downstream calls, add logging before each SDK call so you know which one is hanging. Also check for DynamoDB hot partition throttling or SQS visibility timeout conflicts that can cause silent delays.

5. X-Ray traces missing subsegments

Subsegment "validateOrder" missing from trace

X-Ray drops subsegments that are not properly closed. Always use try/finally or catch blocks to call subsegment.close() even when exceptions occur. Unclosed subsegments are silently discarded and will not appear in the X-Ray console. The X-Ray SDK also samples traces -- by default it captures the first request each second and then 5% of additional requests. In development, set the sampling rate to 100% in your xray-sampling-rules.json.

Best Practices

  • Log at function boundaries, not just at the end. Log when the handler starts, before each external call, and after each external call. When a timeout kills your function, the last log entry tells you exactly where it was stuck.

  • Use Embedded Metric Format over putMetricData for high-throughput functions. EMF publishes metrics through log output with zero additional API calls. Direct putMetricData calls add latency and cost money at scale.

  • Always propagate correlation IDs across service boundaries. Whether it flows through HTTP headers, SQS message attributes, or SNS message attributes, the correlation ID is the only thing that lets you reconstruct a request path across dozens of Lambda functions.

  • Set up dead letter queues on every async Lambda trigger. When a function fails repeatedly, the event must go somewhere you can inspect it. DLQs with CloudWatch alarms on their message count give you a safety net.

  • Monitor the right percentile, not the average. P99 latency is what your worst-experience users see. Average latency hides cold starts, retry storms, and downstream timeouts behind a comfortable number.

  • Keep Lambda packages small. Every megabyte of deployment package increases cold start time. Use tree-shaking, exclude dev dependencies, and consider Lambda Layers for shared dependencies that change infrequently.

  • Test your monitoring before you need it. Deliberately trigger errors, timeouts, and throttles in staging. Verify that alarms fire, dashboards update, and X-Ray traces appear. Discovering that your alerting is broken during a production incident is the worst possible time.

  • Use log levels and make them configurable via environment variables. DEBUG-level logging in production generates enormous CloudWatch costs. Set INFO as the default and flip to DEBUG only when actively investigating an issue.

References

Powered by Contentful