Infrastructure As Code

Drift Detection and Remediation

Detect and remediate infrastructure drift with Terraform, CloudFormation, AWS Config, and custom Node.js monitoring tools

Drift Detection and Remediation

Infrastructure drift is the silent killer of reliable deployments. It happens when the actual state of your cloud resources diverges from what your Infrastructure as Code definitions say they should be, and if you are not actively watching for it, you will not know until something breaks in production. This article covers practical detection strategies using Terraform, CloudFormation, AWS Config, and custom Node.js tooling that runs on a schedule and alerts your team before drift becomes a disaster.

Prerequisites

  • Working knowledge of Terraform or CloudFormation
  • AWS account with CLI access configured
  • Node.js 18+ installed locally
  • Basic familiarity with AWS Lambda and EventBridge
  • A Slack workspace with an incoming webhook URL

What Infrastructure Drift Is

Infrastructure drift occurs when the real state of a deployed resource no longer matches what is declared in your IaC configuration files. Your Terraform says the security group allows port 443 only, but someone opened port 22 through the AWS console last Tuesday. Your CloudFormation template says the RDS instance is db.t3.medium, but an engineer scaled it up to db.r5.large during an incident and never updated the template.

Drift is not a bug in your tooling. It is a process failure. The tools deployed the infrastructure correctly at apply time. Something changed after that, and your code was not updated to reflect it. The gap between declared state and actual state is drift, and it compounds over time.

There are two categories worth distinguishing. Unintentional drift is the dangerous kind: someone made a manual change without realizing it would cause problems downstream. Intentional drift happens when an engineer deliberately changes something in an emergency but has not yet backported the change to code. Both need detection. Only one needs immediate remediation.

Common Causes of Drift

Drift does not appear randomly. It comes from predictable sources:

Console cowboys. Engineers making changes through the AWS Management Console, Azure Portal, or GCP Console instead of through IaC pipelines. This is the number one cause of drift across every organization I have worked with.

Emergency incident response. During a production outage, nobody is going to wait for a Terraform PR to get reviewed. Engineers scale up instances, open security group rules, and modify configurations directly. The problem is not the emergency change. The problem is that nobody updates the code afterward.

Auto-scaling and dynamic resources. Some resources are designed to change. Auto-scaling groups modify instance counts. Kubernetes controllers adjust pod replicas. These are expected deviations, and your drift detection needs to account for them.

Third-party integrations. AWS services like GuardDuty, SecurityHub, and Config can modify resources on your behalf. Tag policies might add tags. AWS Backup might modify retention settings. These changes are legitimate but still represent drift from your declared state.

Terraform state corruption. If your remote state file gets out of sync, perhaps from a failed apply or a state manipulation gone wrong, Terraform will report drift that does not actually exist in the infrastructure.

Module version upgrades. Updating a Terraform module version can change default values for parameters you were not explicitly setting. The next plan shows "drift" that is really a change in defaults.

Terraform Plan as Drift Detector

The simplest drift detection tool you already have is terraform plan. When run against an unchanged configuration, any differences Terraform reports are drift.

#!/bin/bash
# drift-check.sh - Run Terraform plan and capture drift

set -euo pipefail

WORKSPACE=${1:-"production"}
OUTPUT_FILE="/tmp/drift-report-${WORKSPACE}-$(date +%Y%m%d).txt"

cd /opt/terraform/infrastructure

terraform workspace select "$WORKSPACE"
terraform init -input=false -no-color

# Plan with no changes - any output is drift
PLAN_OUTPUT=$(terraform plan -detailed-exitcode -no-color 2>&1) || EXIT_CODE=$?

if [ "${EXIT_CODE:-0}" -eq 2 ]; then
    echo "DRIFT DETECTED in workspace: $WORKSPACE" > "$OUTPUT_FILE"
    echo "$PLAN_OUTPUT" >> "$OUTPUT_FILE"
    echo "drift_detected"
elif [ "${EXIT_CODE:-0}" -eq 1 ]; then
    echo "ERROR running plan in workspace: $WORKSPACE" > "$OUTPUT_FILE"
    echo "$PLAN_OUTPUT" >> "$OUTPUT_FILE"
    echo "plan_error"
else
    echo "No drift detected in workspace: $WORKSPACE"
    echo "no_drift"
fi

The -detailed-exitcode flag is the key here. Exit code 0 means no changes, exit code 2 means changes detected, and exit code 1 means an error occurred. This lets you script around drift detection cleanly.

For organizations running Terraform Cloud or Terraform Enterprise, you can use the API to trigger speculative plans:

# Trigger a speculative plan via Terraform Cloud API
curl -s \
  --header "Authorization: Bearer $TFC_TOKEN" \
  --header "Content-Type: application/vnd.api+json" \
  --request POST \
  --data '{
    "data": {
      "attributes": {
        "is-destroy": false,
        "message": "Scheduled drift detection",
        "speculative": true
      },
      "type": "runs",
      "relationships": {
        "workspace": {
          "data": {
            "type": "workspaces",
            "id": "'"$WORKSPACE_ID"'"
          }
        }
      }
    }
  }' \
  "https://app.terraform.io/api/v2/runs"

CloudFormation Drift Detection

AWS CloudFormation has built-in drift detection that you can trigger through the CLI or API:

# Start drift detection on a stack
DETECTION_ID=$(aws cloudformation detect-stack-drift \
  --stack-name production-vpc \
  --query 'StackDriftDetectionId' \
  --output text)

# Wait for detection to complete
aws cloudformation describe-stack-drift-detection-status \
  --stack-drift-detection-id "$DETECTION_ID"

# Get the detailed drift results
aws cloudformation describe-stack-resource-drifts \
  --stack-name production-vpc \
  --stack-resource-drift-status-filters MODIFIED DELETED

CloudFormation drift detection has a significant limitation: it only checks resources that support drift detection, and not all resource types do. As of this writing, about 70% of CloudFormation resource types support drift detection. Always check the AWS documentation for your specific resource types.

AWS Config Rules for Compliance Drift

AWS Config takes a different approach. Instead of comparing against IaC definitions, it evaluates resources against compliance rules. This catches drift that your IaC might not know about.

{
  "ConfigRuleName": "security-group-no-unrestricted-ssh",
  "Source": {
    "Owner": "AWS",
    "SourceIdentifier": "INCOMING_SSH_DISABLED"
  },
  "Scope": {
    "ComplianceResourceTypes": [
      "AWS::EC2::SecurityGroup"
    ]
  }
}

Config rules fire evaluations when resources change, giving you near-real-time drift detection for compliance-critical properties. You can combine managed rules (pre-built by AWS) with custom rules backed by Lambda functions that evaluate whatever logic you need.

The real power comes from Config's aggregation. In multi-account setups with AWS Organizations, you can aggregate Config compliance data across all accounts into a single dashboard. This gives you a centralized view of drift across your entire organization.

Building a Custom Drift Detection Tool with Node.js

Sometimes you need detection logic that goes beyond what Terraform or CloudFormation provides. Here is a Node.js module that checks for common drift patterns using the AWS SDK:

var AWS = require("aws-sdk");

var EXPECTED_STATE = {
  securityGroups: {
    "sg-prod-web": {
      allowedPorts: [443, 80],
      allowedCidrs: ["10.0.0.0/8", "0.0.0.0/0"]
    }
  },
  rdsInstances: {
    "prod-database": {
      instanceClass: "db.t3.large",
      multiAz: true,
      encrypted: true
    }
  },
  s3Buckets: {
    "prod-assets": {
      publicAccessBlocked: true,
      versioningEnabled: true,
      encryptionEnabled: true
    }
  }
};

function checkSecurityGroupDrift(ec2, groupId, expected) {
  return new Promise(function (resolve, reject) {
    var params = { GroupIds: [groupId] };
    ec2.describeSecurityGroups(params, function (err, data) {
      if (err) return reject(err);

      var sg = data.SecurityGroups[0];
      var drifts = [];

      sg.IpPermissions.forEach(function (rule) {
        var port = rule.FromPort;
        if (expected.allowedPorts.indexOf(port) === -1) {
          drifts.push({
            resource: groupId,
            type: "UNEXPECTED_PORT",
            detail: "Port " + port + " is open but not in expected state",
            severity: "HIGH"
          });
        }

        rule.IpRanges.forEach(function (range) {
          if (expected.allowedCidrs.indexOf(range.CidrIp) === -1) {
            drifts.push({
              resource: groupId,
              type: "UNEXPECTED_CIDR",
              detail: "CIDR " + range.CidrIp + " on port " + port + " not expected",
              severity: "CRITICAL"
            });
          }
        });
      });

      resolve(drifts);
    });
  });
}

function checkRdsDrift(rds, instanceId, expected) {
  return new Promise(function (resolve, reject) {
    var params = { DBInstanceIdentifier: instanceId };
    rds.describeDBInstances(params, function (err, data) {
      if (err) return reject(err);

      var instance = data.DBInstances[0];
      var drifts = [];

      if (instance.DBInstanceClass !== expected.instanceClass) {
        drifts.push({
          resource: instanceId,
          type: "INSTANCE_CLASS_MISMATCH",
          detail: "Expected " + expected.instanceClass + ", found " + instance.DBInstanceClass,
          severity: "MEDIUM"
        });
      }

      if (instance.MultiAZ !== expected.multiAz) {
        drifts.push({
          resource: instanceId,
          type: "MULTI_AZ_MISMATCH",
          detail: "Expected MultiAZ=" + expected.multiAz + ", found " + instance.MultiAZ,
          severity: "HIGH"
        });
      }

      if (instance.StorageEncrypted !== expected.encrypted) {
        drifts.push({
          resource: instanceId,
          type: "ENCRYPTION_MISMATCH",
          detail: "Expected encrypted=" + expected.encrypted + ", found " + instance.StorageEncrypted,
          severity: "CRITICAL"
        });
      }

      resolve(drifts);
    });
  });
}

function checkS3Drift(s3, bucketName, expected) {
  return new Promise(function (resolve, reject) {
    var drifts = [];
    var checks = [];

    checks.push(
      new Promise(function (res, rej) {
        s3.getPublicAccessBlock({ Bucket: bucketName }, function (err, data) {
          if (err && err.code === "NoSuchPublicAccessBlockConfiguration") {
            if (expected.publicAccessBlocked) {
              drifts.push({
                resource: bucketName,
                type: "PUBLIC_ACCESS_NOT_BLOCKED",
                detail: "Public access block is not configured",
                severity: "CRITICAL"
              });
            }
            return res();
          }
          if (err) return rej(err);

          var config = data.PublicAccessBlockConfiguration;
          var allBlocked = config.BlockPublicAcls &&
            config.IgnorePublicAcls &&
            config.BlockPublicPolicy &&
            config.RestrictPublicBuckets;

          if (expected.publicAccessBlocked && !allBlocked) {
            drifts.push({
              resource: bucketName,
              type: "PUBLIC_ACCESS_PARTIALLY_BLOCKED",
              detail: "Not all public access block settings are enabled",
              severity: "HIGH"
            });
          }
          res();
        });
      })
    );

    checks.push(
      new Promise(function (res, rej) {
        s3.getBucketVersioning({ Bucket: bucketName }, function (err, data) {
          if (err) return rej(err);
          var isEnabled = data.Status === "Enabled";
          if (expected.versioningEnabled && !isEnabled) {
            drifts.push({
              resource: bucketName,
              type: "VERSIONING_DISABLED",
              detail: "Bucket versioning is not enabled",
              severity: "MEDIUM"
            });
          }
          res();
        });
      })
    );

    Promise.all(checks)
      .then(function () { resolve(drifts); })
      .catch(reject);
  });
}

function runAllChecks(region) {
  var ec2 = new AWS.EC2({ region: region });
  var rds = new AWS.RDS({ region: region });
  var s3 = new AWS.S3({ region: region });

  var allChecks = [];

  Object.keys(EXPECTED_STATE.securityGroups).forEach(function (sgId) {
    allChecks.push(checkSecurityGroupDrift(ec2, sgId, EXPECTED_STATE.securityGroups[sgId]));
  });

  Object.keys(EXPECTED_STATE.rdsInstances).forEach(function (instanceId) {
    allChecks.push(checkRdsDrift(rds, instanceId, EXPECTED_STATE.rdsInstances[instanceId]));
  });

  Object.keys(EXPECTED_STATE.s3Buckets).forEach(function (bucket) {
    allChecks.push(checkS3Drift(s3, bucket, EXPECTED_STATE.s3Buckets[bucket]));
  });

  return Promise.all(allChecks).then(function (results) {
    var flat = [];
    results.forEach(function (driftArray) {
      driftArray.forEach(function (d) { flat.push(d); });
    });
    return flat;
  });
}

module.exports = {
  runAllChecks: runAllChecks,
  checkSecurityGroupDrift: checkSecurityGroupDrift,
  checkRdsDrift: checkRdsDrift,
  checkS3Drift: checkS3Drift
};

This approach is useful when you want to check specific properties that matter to your organization without running a full Terraform plan. It is faster, more targeted, and you can add custom severity levels that map to your incident response workflows.

Scheduled Drift Checks with Lambda

Wrap the drift detection in a Lambda function triggered by EventBridge on a schedule:

var AWS = require("aws-sdk");
var https = require("https");
var url = require("url");

var SLACK_WEBHOOK_URL = process.env.SLACK_WEBHOOK_URL;
var REGIONS = (process.env.CHECK_REGIONS || "us-east-1").split(",");

function sendSlackNotification(drifts) {
  return new Promise(function (resolve, reject) {
    var criticalCount = drifts.filter(function (d) { return d.severity === "CRITICAL"; }).length;
    var highCount = drifts.filter(function (d) { return d.severity === "HIGH"; }).length;
    var mediumCount = drifts.filter(function (d) { return d.severity === "MEDIUM"; }).length;

    var color = criticalCount > 0 ? "#FF0000" : highCount > 0 ? "#FFA500" : "#FFFF00";
    var emoji = criticalCount > 0 ? ":rotating_light:" : ":warning:";

    var fields = drifts.map(function (d) {
      return {
        title: d.resource + " - " + d.type,
        value: d.detail + " (Severity: " + d.severity + ")",
        short: false
      };
    });

    var payload = {
      text: emoji + " Infrastructure Drift Detected",
      attachments: [{
        color: color,
        title: "Drift Report - " + new Date().toISOString().split("T")[0],
        text: "Found " + drifts.length + " drift issues: " +
          criticalCount + " critical, " + highCount + " high, " + mediumCount + " medium",
        fields: fields.slice(0, 20),
        footer: "Drift Detection Lambda"
      }]
    };

    var parsed = url.parse(SLACK_WEBHOOK_URL);
    var options = {
      hostname: parsed.hostname,
      path: parsed.path,
      method: "POST",
      headers: { "Content-Type": "application/json" }
    };

    var req = https.request(options, function (res) {
      if (res.statusCode === 200) {
        resolve();
      } else {
        reject(new Error("Slack returned status " + res.statusCode));
      }
    });

    req.on("error", reject);
    req.write(JSON.stringify(payload));
    req.end();
  });
}

exports.handler = function (event, context) {
  var driftDetector = require("./drift-detector");
  var allDrifts = [];

  var regionChecks = REGIONS.map(function (region) {
    return driftDetector.runAllChecks(region.trim()).then(function (drifts) {
      drifts.forEach(function (d) {
        d.region = region.trim();
        allDrifts.push(d);
      });
    });
  });

  return Promise.all(regionChecks).then(function () {
    console.log("Total drifts found: " + allDrifts.length);

    if (allDrifts.length > 0) {
      return sendSlackNotification(allDrifts).then(function () {
        return {
          statusCode: 200,
          body: JSON.stringify({
            driftsFound: allDrifts.length,
            drifts: allDrifts
          })
        };
      });
    }

    return {
      statusCode: 200,
      body: JSON.stringify({ driftsFound: 0, message: "No drift detected" })
    };
  });
};

Deploy this with an EventBridge rule that runs every 6 hours:

{
  "schedule": "rate(6 hours)",
  "target": {
    "arn": "arn:aws:lambda:us-east-1:123456789:function:drift-detector",
    "input": "{}"
  }
}

Drift Remediation Strategies

When drift is detected, you have two paths: reconcile the infrastructure back to match the code, or update the code to match what the infrastructure has become.

Reconcile: Push Code State to Infrastructure

This is the default approach. Run terraform apply or update the CloudFormation stack. The infrastructure returns to its declared state.

# Reconcile drift by re-applying Terraform
terraform apply -auto-approve -no-color

This works when the drift was unintentional. Someone opened a port they should not have. An instance type was changed without approval. The code is correct, and the infrastructure is wrong.

But be careful. Reconciling can cause downtime. If someone scaled up an RDS instance during an incident because the database was running out of memory, forcing it back to the smaller size will recreate the original problem.

Update Code: Accept the Drift

Sometimes the infrastructure is right and the code is wrong. Emergency changes that fixed real problems should be captured in code, not reverted.

# Import the current state into Terraform
terraform import aws_security_group_rule.emergency_ssh \
  sg-abc123_ingress_tcp_22_22_10.0.0.0/8

# Or refresh state and update config to match
terraform plan -refresh-only

The decision framework is simple: if the change fixed a real problem, update the code. If the change was unauthorized or accidental, reconcile the infrastructure. Document the decision either way.

Hybrid Approach

For complex drift, you might need both. Accept some changes (the instance scaling) while reverting others (the overly permissive security group rule). Terraform lets you target specific resources:

# Only reconcile the security group, leave the RDS instance alone
terraform apply -target=aws_security_group.web -auto-approve

Preventing Drift

The best drift detection is not needing it. Prevention is cheaper than remediation.

Service Control Policies (SCPs)

Use AWS Organizations SCPs to prevent console modifications to critical resources:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyManualSecurityGroupChanges",
      "Effect": "Deny",
      "Action": [
        "ec2:AuthorizeSecurityGroupIngress",
        "ec2:AuthorizeSecurityGroupEgress",
        "ec2:RevokeSecurityGroupIngress",
        "ec2:RevokeSecurityGroupEgress"
      ],
      "Resource": "*",
      "Condition": {
        "StringNotLike": {
          "aws:PrincipalArn": "arn:aws:iam::*:role/TerraformExecutionRole"
        }
      }
    }
  ]
}

This SCP allows security group modifications only through the Terraform execution role. Engineers cannot make manual changes even if they have full admin access in their account.

IAM Restrictions

For less critical resources where SCPs are too heavy-handed, use IAM policies with explicit deny statements on the modify/delete actions, except for the CI/CD pipeline role.

Break-Glass Procedures

Prevention needs an escape hatch. Define a documented break-glass procedure that allows engineers to bypass restrictions during genuine emergencies. This typically involves assuming a special emergency role that logs all actions to a dedicated CloudTrail trail. The key is that using the break-glass role triggers an alert, creating accountability and ensuring someone follows up to update the code.

Drift Reporting and Alerting

Beyond Slack notifications, you need a persistent record of drift over time. Here is a simple DynamoDB-backed reporting layer:

var AWS = require("aws-sdk");
var dynamodb = new AWS.DynamoDB.DocumentClient();

var TABLE_NAME = process.env.DRIFT_TABLE || "drift-reports";

function saveDriftReport(drifts, region) {
  var timestamp = new Date().toISOString();
  var reportId = region + "-" + Date.now();

  var params = {
    TableName: TABLE_NAME,
    Item: {
      reportId: reportId,
      timestamp: timestamp,
      region: region,
      driftCount: drifts.length,
      criticalCount: drifts.filter(function (d) { return d.severity === "CRITICAL"; }).length,
      highCount: drifts.filter(function (d) { return d.severity === "HIGH"; }).length,
      drifts: drifts,
      resolved: false
    }
  };

  return dynamodb.put(params).promise();
}

function getUnresolvedDrifts() {
  var params = {
    TableName: TABLE_NAME,
    FilterExpression: "resolved = :r",
    ExpressionAttributeValues: { ":r": false }
  };

  return dynamodb.scan(params).promise().then(function (result) {
    return result.Items;
  });
}

function markResolved(reportId) {
  var params = {
    TableName: TABLE_NAME,
    Key: { reportId: reportId },
    UpdateExpression: "SET resolved = :r, resolvedAt = :t",
    ExpressionAttributeValues: {
      ":r": true,
      ":t": new Date().toISOString()
    }
  };

  return dynamodb.update(params).promise();
}

module.exports = {
  saveDriftReport: saveDriftReport,
  getUnresolvedDrifts: getUnresolvedDrifts,
  markResolved: markResolved
};

This gives you a historical record for auditing, lets you track mean time to remediation, and prevents alert fatigue by only notifying on new drift, not re-alerting on known unresolved issues.

Handling Intentional Drift

Not all drift is bad. Some resources are expected to diverge from their declared state. Auto-scaling groups change instance counts. ECS services adjust task counts. Kubernetes creates dynamic resources.

The solution is to explicitly exclude expected drift from your detection:

var IGNORE_RULES = [
  { resource: /^asg-/, attribute: "desired_capacity" },
  { resource: /^ecs-service-/, attribute: "desired_count" },
  { resource: /.*/, attribute: "tags.LastModifiedBy" },
  { resource: /.*/, attribute: "tags.UpdatedAt" }
];

function filterIntentionalDrift(drifts) {
  return drifts.filter(function (drift) {
    var dominated = IGNORE_RULES.some(function (rule) {
      var resourceMatch = rule.resource.test(drift.resource);
      var attrMatch = !rule.attribute || drift.attribute === rule.attribute;
      return resourceMatch && attrMatch;
    });
    return !dominated;
  });
}

In Terraform, you can use lifecycle blocks to ignore specific attributes:

resource "aws_autoscaling_group" "web" {
  # ...
  lifecycle {
    ignore_changes = [desired_capacity]
  }
}

This tells Terraform that changes to desired_capacity are expected and should not be reported as drift.

Drift in Multi-Account Environments

Enterprise environments with dozens or hundreds of AWS accounts need a centralized drift detection strategy. The architecture looks like this:

  1. Hub account runs the drift detection Lambda
  2. Spoke accounts have a cross-account IAM role the Lambda can assume
  3. Results aggregate to a central DynamoDB table and S3 bucket
  4. Dashboards pull from the central store
var AWS = require("aws-sdk");
var sts = new AWS.STS();

function assumeRole(accountId, roleName) {
  var roleArn = "arn:aws:iam::" + accountId + ":role/" + roleName;

  var params = {
    RoleArn: roleArn,
    RoleSessionName: "drift-detection-" + accountId,
    DurationSeconds: 900
  };

  return sts.assumeRole(params).promise().then(function (data) {
    return {
      accessKeyId: data.Credentials.AccessKeyId,
      secretAccessKey: data.Credentials.SecretAccessKey,
      sessionToken: data.Credentials.SessionToken
    };
  });
}

function checkAccountDrift(accountId, region) {
  return assumeRole(accountId, "DriftDetectionRole").then(function (credentials) {
    var ec2 = new AWS.EC2({ region: region, credentials: credentials });
    var rds = new AWS.RDS({ region: region, credentials: credentials });
    var s3 = new AWS.S3({ region: region, credentials: credentials });

    // Run checks using temporary credentials
    // ... same check logic as before, but with assumed-role credentials
    return runChecksWithClients(ec2, rds, s3);
  });
}

function checkAllAccounts(accounts, regions) {
  var checks = [];

  accounts.forEach(function (accountId) {
    regions.forEach(function (region) {
      checks.push(
        checkAccountDrift(accountId, region)
          .then(function (drifts) {
            return drifts.map(function (d) {
              d.accountId = accountId;
              d.region = region;
              return d;
            });
          })
          .catch(function (err) {
            console.error("Failed checking account " + accountId + " in " + region + ": " + err.message);
            return [{
              resource: "account-" + accountId,
              type: "CHECK_FAILED",
              detail: err.message,
              severity: "HIGH",
              accountId: accountId,
              region: region
            }];
          })
      );
    });
  });

  return Promise.all(checks).then(function (results) {
    var flat = [];
    results.forEach(function (arr) {
      arr.forEach(function (d) { flat.push(d); });
    });
    return flat;
  });
}

module.exports = {
  assumeRole: assumeRole,
  checkAccountDrift: checkAccountDrift,
  checkAllAccounts: checkAllAccounts
};

Complete Working Example

Here is the full Lambda function that ties everything together. It runs scheduled drift detection across multiple Terraform workspaces using the Terraform Cloud API, checks AWS resources directly, and sends detailed Slack notifications with remediation suggestions.

// index.js - Drift Detection Lambda
var https = require("https");
var url = require("url");

var SLACK_WEBHOOK_URL = process.env.SLACK_WEBHOOK_URL;
var TFC_TOKEN = process.env.TFC_TOKEN;
var TFC_ORG = process.env.TFC_ORG || "my-org";
var WORKSPACES = (process.env.WORKSPACES || "prod-networking,prod-compute,prod-data").split(",");

function makeHttpsRequest(options, postData) {
  return new Promise(function (resolve, reject) {
    var req = https.request(options, function (res) {
      var body = "";
      res.on("data", function (chunk) { body += chunk; });
      res.on("end", function () {
        resolve({ statusCode: res.statusCode, body: body });
      });
    });
    req.on("error", reject);
    if (postData) {
      req.write(typeof postData === "string" ? postData : JSON.stringify(postData));
    }
    req.end();
  });
}

function getWorkspaceId(workspaceName) {
  var options = {
    hostname: "app.terraform.io",
    path: "/api/v2/organizations/" + TFC_ORG + "/workspaces/" + workspaceName,
    method: "GET",
    headers: {
      "Authorization": "Bearer " + TFC_TOKEN,
      "Content-Type": "application/vnd.api+json"
    }
  };

  return makeHttpsRequest(options).then(function (response) {
    var data = JSON.parse(response.body);
    return data.data.id;
  });
}

function triggerSpeculativePlan(workspaceId, workspaceName) {
  var payload = {
    data: {
      attributes: {
        "is-destroy": false,
        message: "Scheduled drift detection - " + new Date().toISOString(),
        "plan-only": true
      },
      type: "runs",
      relationships: {
        workspace: {
          data: { type: "workspaces", id: workspaceId }
        }
      }
    }
  };

  var options = {
    hostname: "app.terraform.io",
    path: "/api/v2/runs",
    method: "POST",
    headers: {
      "Authorization": "Bearer " + TFC_TOKEN,
      "Content-Type": "application/vnd.api+json"
    }
  };

  return makeHttpsRequest(options, payload).then(function (response) {
    var data = JSON.parse(response.body);
    return {
      runId: data.data.id,
      workspace: workspaceName
    };
  });
}

function waitForPlan(runId, maxAttempts) {
  var attempts = 0;
  maxAttempts = maxAttempts || 30;

  function poll() {
    attempts++;
    var options = {
      hostname: "app.terraform.io",
      path: "/api/v2/runs/" + runId,
      method: "GET",
      headers: {
        "Authorization": "Bearer " + TFC_TOKEN,
        "Content-Type": "application/vnd.api+json"
      }
    };

    return makeHttpsRequest(options).then(function (response) {
      var data = JSON.parse(response.body);
      var status = data.data.attributes.status;

      if (status === "planned" || status === "planned_and_finished") {
        return {
          status: status,
          hasChanges: data.data.attributes["has-changes"],
          resourceAdditions: data.data.attributes["resource-additions"] || 0,
          resourceChanges: data.data.attributes["resource-changes"] || 0,
          resourceDestructions: data.data.attributes["resource-destructions"] || 0
        };
      }

      if (status === "errored" || status === "canceled" || status === "force_canceled") {
        return { status: status, hasChanges: false, error: true };
      }

      if (attempts >= maxAttempts) {
        return { status: "timeout", hasChanges: false, error: true };
      }

      return new Promise(function (resolve) {
        setTimeout(function () { resolve(poll()); }, 10000);
      });
    });
  }

  return poll();
}

function generateRemediationSuggestion(workspace, planResult) {
  var suggestions = [];

  if (planResult.resourceAdditions > 0) {
    suggestions.push("Resources need to be created. Someone may have deleted them manually. Run `terraform apply` in " + workspace + " to recreate.");
  }

  if (planResult.resourceChanges > 0) {
    suggestions.push("Resources have been modified outside Terraform. Review the plan output in TFC, then either `terraform apply` to revert or `terraform import` to accept changes.");
  }

  if (planResult.resourceDestructions > 0) {
    suggestions.push("Resources need to be destroyed. New resources may have been created manually. Import them with `terraform import` or remove them via console.");
  }

  if (suggestions.length === 0) {
    suggestions.push("Review the workspace " + workspace + " in Terraform Cloud for details.");
  }

  return suggestions.join("\n");
}

function sendSlackReport(results) {
  var driftedWorkspaces = results.filter(function (r) { return r.planResult && r.planResult.hasChanges; });
  var erroredWorkspaces = results.filter(function (r) { return r.planResult && r.planResult.error; });

  if (driftedWorkspaces.length === 0 && erroredWorkspaces.length === 0) {
    console.log("No drift detected across all workspaces. Skipping Slack notification.");
    return Promise.resolve();
  }

  var blocks = [
    {
      type: "header",
      text: { type: "plain_text", text: ":mag: Drift Detection Report" }
    },
    {
      type: "section",
      text: {
        type: "mrkdwn",
        text: "*Date:* " + new Date().toISOString().split("T")[0] + "\n" +
          "*Workspaces checked:* " + results.length + "\n" +
          "*Drift found:* " + driftedWorkspaces.length + "\n" +
          "*Errors:* " + erroredWorkspaces.length
      }
    },
    { type: "divider" }
  ];

  driftedWorkspaces.forEach(function (r) {
    var pr = r.planResult;
    blocks.push({
      type: "section",
      text: {
        type: "mrkdwn",
        text: ":warning: *" + r.workspace + "*\n" +
          "+" + pr.resourceAdditions + " additions, " +
          "~" + pr.resourceChanges + " changes, " +
          "-" + pr.resourceDestructions + " destructions\n\n" +
          "*Remediation:*\n" + generateRemediationSuggestion(r.workspace, pr)
      }
    });
  });

  erroredWorkspaces.forEach(function (r) {
    blocks.push({
      type: "section",
      text: {
        type: "mrkdwn",
        text: ":x: *" + r.workspace + "* - Plan " + r.planResult.status +
          ". Check workspace logs in Terraform Cloud."
      }
    });
  });

  var payload = { blocks: blocks };
  var parsed = url.parse(SLACK_WEBHOOK_URL);
  var options = {
    hostname: parsed.hostname,
    path: parsed.path,
    method: "POST",
    headers: { "Content-Type": "application/json" }
  };

  return makeHttpsRequest(options, payload);
}

exports.handler = function (event, context) {
  console.log("Starting drift detection for workspaces: " + WORKSPACES.join(", "));

  var workspaceChecks = WORKSPACES.map(function (workspace) {
    workspace = workspace.trim();

    return getWorkspaceId(workspace)
      .then(function (workspaceId) {
        console.log("Triggering plan for " + workspace + " (ID: " + workspaceId + ")");
        return triggerSpeculativePlan(workspaceId, workspace);
      })
      .then(function (run) {
        console.log("Waiting for plan " + run.runId + " in " + workspace);
        return waitForPlan(run.runId).then(function (planResult) {
          return { workspace: workspace, planResult: planResult };
        });
      })
      .catch(function (err) {
        console.error("Error checking " + workspace + ": " + err.message);
        return {
          workspace: workspace,
          planResult: { status: "error", error: true, hasChanges: false }
        };
      });
  });

  return Promise.all(workspaceChecks)
    .then(function (results) {
      console.log("All workspace checks complete. Generating report.");

      var summary = results.map(function (r) {
        return r.workspace + ": " + (r.planResult.hasChanges ? "DRIFT DETECTED" : "OK");
      });
      console.log("Summary:\n" + summary.join("\n"));

      return sendSlackReport(results).then(function () {
        return {
          statusCode: 200,
          body: JSON.stringify({
            workspacesChecked: results.length,
            driftDetected: results.filter(function (r) { return r.planResult.hasChanges; }).length,
            results: results
          })
        };
      });
    });
};

Deploy with the SAM template:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

Resources:
  DriftDetectorFunction:
    Type: AWS::Serverless::Function
    Properties:
      Handler: index.handler
      Runtime: nodejs18.x
      Timeout: 600
      MemorySize: 256
      Environment:
        Variables:
          SLACK_WEBHOOK_URL: !Ref SlackWebhookUrl
          TFC_TOKEN: !Ref TfcToken
          TFC_ORG: !Ref TfcOrg
          WORKSPACES: "prod-networking,prod-compute,prod-data,staging-networking,staging-compute"
      Events:
        ScheduledCheck:
          Type: Schedule
          Properties:
            Schedule: rate(6 hours)
            Description: Run drift detection every 6 hours

Parameters:
  SlackWebhookUrl:
    Type: String
    NoEcho: true
  TfcToken:
    Type: String
    NoEcho: true
  TfcOrg:
    Type: String
    Default: my-org

Common Issues and Troubleshooting

Terraform state lock prevents plan. If another process holds the state lock, your drift detection plan will fail. Use -lock=false for read-only drift checks in Terraform OSS, or use plan-only runs in Terraform Cloud which do not acquire a state lock. Never use -lock=false in automation that runs apply.

CloudFormation drift detection times out on large stacks. Stacks with hundreds of resources can take 10+ minutes for drift detection to complete. Split large stacks into smaller, focused stacks. This is good practice regardless of drift detection. If splitting is not possible, increase your polling timeout and run detection less frequently.

False positives from AWS-managed changes. AWS sometimes modifies resources on your behalf. For example, ECS may update task definition revisions, or RDS may apply minor engine patches. These show up as drift but are expected. Build an ignore list for known AWS-managed attributes and update it as you encounter new false positives.

Cross-account AssumeRole failures. In multi-account setups, drift detection fails silently when the cross-account role trust policy is misconfigured or the role does not exist in a new account. Always check for and log AssumeRole errors explicitly. Set up an alert for check failures separately from drift alerts so you know when your detection is broken, not just when it finds drift.

Rate limiting on AWS API calls. Checking many resources across many accounts will hit API rate limits. Implement exponential backoff in your SDK configuration and stagger checks across accounts rather than running them all simultaneously. The AWS SDK maxRetries configuration helps, but you also need to pace your requests.

Terraform Cloud API rate limits. Terraform Cloud limits API requests to 30 per second. When checking many workspaces, add a delay between triggering plans. The complete example above runs plans in parallel, which works for a handful of workspaces but needs sequential execution with delays for larger organizations.

Best Practices

  • Run drift detection on a schedule, not just in CI pipelines. Drift happens between deployments. A weekly or daily check catches problems that pipeline-only detection misses entirely.

  • Alert on new drift only, not recurring drift. If you alert every 6 hours on the same unresolved drift, your team will start ignoring the alerts. Track known drift and only notify on newly discovered deviations.

  • Separate detection from remediation. Automatic remediation sounds appealing but is dangerous. A detection system that automatically reverts changes could undo emergency fixes during an active incident. Detect automatically, remediate manually with human approval.

  • Use severity levels to prioritize response. Not all drift is equal. An open SSH port to 0.0.0.0/0 is critical. A missing tag is low priority. Assign severity at detection time and route alerts accordingly, with critical drift going to PagerDuty and low-severity drift going to a weekly report.

  • Version your expected state definitions. The expected state configuration used by your custom drift detector should be in version control, reviewed in PRs, and deployed alongside your IaC. Drift in your drift detector is a real problem.

  • Document intentional drift with expiration dates. When you add an ignore rule for intentional drift, add a comment with the date and the person who approved it. Review ignore rules quarterly and remove stale ones. Intentional drift has a way of becoming permanent drift.

  • Test your drift detection in staging first. Intentionally create drift in staging by making manual changes, then verify your detection catches it. This validates your tool works before you rely on it in production.

  • Keep detection fast enough to run frequently. If drift detection takes 2 hours, you will run it once a day at most. Optimize for speed by checking only the resources that matter most and parallelizing where possible.

References

Powered by Contentful