Infrastructure As Code

CI/CD for Infrastructure Deployments

Automate infrastructure deployments with CI/CD pipelines for Terraform using GitHub Actions, approval workflows, and drift detection

CI/CD for Infrastructure Deployments

Overview

Manually running terraform apply from a developer's laptop is how infrastructure breaks at 2 AM on a Friday. CI/CD for infrastructure deployments brings the same discipline we apply to application code — version control, peer review, automated testing, and controlled rollouts — to the resources that run our applications. This article walks through building production-grade pipelines for Terraform using GitHub Actions and Azure Pipelines, with plan previews on pull requests, manual approval gates, environment promotion, drift detection, and rollback strategies.

Prerequisites

  • Terraform 1.5+ installed and basic familiarity with HCL syntax
  • A GitHub or Azure DevOps account with repository access
  • Cloud provider credentials (AWS, Azure, or GCP)
  • Understanding of CI/CD concepts (pipelines, stages, triggers)
  • Node.js 18+ for any scripting examples
  • Basic understanding of remote state backends (S3, Azure Blob, GCS)

Why Automate Infrastructure Deployments

There is a common progression every infrastructure team goes through. First, someone runs Terraform from their laptop. It works. Then a second person joins, and now you have two people running applies against the same state file. One overwrites the other's changes. Someone forgets to pull the latest code before applying. A junior engineer runs terraform destroy in production because their environment variable pointed at the wrong workspace.

Automating infrastructure deployments solves these problems systematically. A CI/CD pipeline becomes the single point of execution. No one runs applies locally. Every change goes through version control, gets reviewed, shows a plan diff, and requires approval before touching production. You get audit trails, repeatable processes, and the confidence that what is in your main branch matches what is actually deployed.

The ROI is not hypothetical. Teams that automate infrastructure deployments report fewer outages caused by configuration drift, faster incident recovery because rollbacks are scripted, and better compliance posture because every change is tracked.

GitOps Workflow for IaC

GitOps treats your Git repository as the single source of truth for infrastructure state. The workflow is straightforward:

  1. A developer creates a feature branch and modifies Terraform configurations
  2. They open a pull request against the main branch
  3. The CI pipeline runs terraform plan and posts the output as a PR comment
  4. Reviewers examine the plan diff alongside the code changes
  5. After approval, the PR merges to main
  6. The CD pipeline runs terraform apply automatically (for non-production) or waits for manual approval (for production)

This model works because infrastructure changes are inherently reviewable. A plan output shows exactly what will be created, modified, or destroyed. Reviewers can catch dangerous changes — like a security group opening port 22 to the world — before they happen.

feature-branch → PR (plan preview) → review → merge → apply (dev)
                                                      → approve → apply (staging)
                                                      → approve → apply (prod)

GitHub Actions Pipeline for Terraform

GitHub Actions is a natural fit for Terraform pipelines because of its tight integration with pull requests. Here is a reusable workflow structure:

name: Terraform CI/CD

on:
  pull_request:
    branches: [main]
    paths:
      - 'infrastructure/**'
  push:
    branches: [main]
    paths:
      - 'infrastructure/**'

permissions:
  contents: read
  pull-requests: write
  id-token: write

env:
  TF_VERSION: '1.6.0'
  WORKING_DIR: './infrastructure'

jobs:
  plan:
    name: Terraform Plan
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: us-east-1

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Terraform Init
        working-directory: ${{ env.WORKING_DIR }}
        run: terraform init -input=false

      - name: Terraform Validate
        working-directory: ${{ env.WORKING_DIR }}
        run: terraform validate

      - name: Terraform Plan
        id: plan
        working-directory: ${{ env.WORKING_DIR }}
        run: |
          terraform plan -input=false -no-color -out=tfplan 2>&1 | tee plan_output.txt
          echo "plan_exit_code=$?" >> $GITHUB_OUTPUT

      - name: Post Plan to PR
        uses: actions/github-script@v7
        with:
          script: |
            var fs = require('fs');
            var planOutput = fs.readFileSync(
              '${{ env.WORKING_DIR }}/plan_output.txt',
              'utf8'
            );

            // Truncate if too long for GitHub comment
            var maxLength = 60000;
            if (planOutput.length > maxLength) {
              planOutput = planOutput.substring(0, maxLength) +
                '\n\n... (truncated, see full output in Actions log)';
            }

            var body = '## Terraform Plan Output\n\n' +
              '<details><summary>Click to expand</summary>\n\n' +
              '```hcl\n' + planOutput + '\n```\n\n' +
              '</details>';

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

The key here is the plan output posted as a PR comment. Reviewers see exactly what Terraform intends to do without needing to run anything locally.

Azure Pipelines for Terraform

Azure Pipelines uses a stage-based model that maps well to infrastructure promotion. The YAML structure differs from GitHub Actions but the concepts are identical:

trigger:
  branches:
    include:
      - main
  paths:
    include:
      - infrastructure/*

pool:
  vmImage: 'ubuntu-latest'

variables:
  terraformVersion: '1.6.0'
  workingDirectory: '$(System.DefaultWorkingDirectory)/infrastructure'

stages:
  - stage: Validate
    displayName: 'Validate & Plan'
    jobs:
      - job: TerraformPlan
        steps:
          - task: TerraformInstaller@1
            inputs:
              terraformVersion: $(terraformVersion)

          - task: TerraformTaskV4@4
            displayName: 'Terraform Init'
            inputs:
              provider: 'azurerm'
              command: 'init'
              workingDirectory: $(workingDirectory)
              backendServiceArm: 'azure-service-connection'
              backendAzureRmResourceGroupName: 'tfstate-rg'
              backendAzureRmStorageAccountName: 'tfstatestorage'
              backendAzureRmContainerName: 'tfstate'
              backendAzureRmKey: 'terraform.tfstate'

          - task: TerraformTaskV4@4
            displayName: 'Terraform Plan'
            inputs:
              provider: 'azurerm'
              command: 'plan'
              workingDirectory: $(workingDirectory)
              environmentServiceNameAzureRM: 'azure-service-connection'
              commandOptions: '-out=tfplan -input=false'

          - task: PublishPipelineArtifact@1
            displayName: 'Publish Plan Artifact'
            inputs:
              targetPath: '$(workingDirectory)/tfplan'
              artifact: 'terraform-plan'

  - stage: DeployDev
    displayName: 'Deploy to Dev'
    dependsOn: Validate
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
    jobs:
      - deployment: ApplyDev
        environment: 'terraform-dev'
        strategy:
          runOnce:
            deploy:
              steps:
                - download: current
                  artifact: terraform-plan
                - script: |
                    cd $(workingDirectory)
                    terraform init -input=false
                    terraform apply -auto-approve -input=false tfplan

  - stage: DeployProd
    displayName: 'Deploy to Production'
    dependsOn: DeployDev
    jobs:
      - deployment: ApplyProd
        environment: 'terraform-prod'
        strategy:
          runOnce:
            deploy:
              steps:
                - script: |
                    cd $(workingDirectory)
                    terraform init -input=false
                    terraform workspace select prod
                    terraform plan -input=false -out=prodplan
                    terraform apply -auto-approve -input=false prodplan

In Azure Pipelines, the environment resource handles approval gates. You configure required approvers on the terraform-prod environment in the Azure DevOps project settings.

Plan Approval Workflows

The plan-then-apply workflow is the backbone of safe infrastructure deployment. There are two approval patterns worth implementing:

PR-level approval: Reviewers approve the pull request after examining the plan output. Merging triggers the apply. This works well for development and staging environments where the blast radius is limited.

Deployment-level approval: After the plan runs on the main branch, a designated approver must explicitly approve the apply step. GitHub Actions supports this through environments with required reviewers. This is mandatory for production.

# GitHub Actions environment-based approval
jobs:
  apply-prod:
    runs-on: ubuntu-latest
    environment:
      name: production
      url: https://console.aws.amazon.com
    needs: [plan-prod]
    steps:
      - name: Terraform Apply
        working-directory: ${{ env.WORKING_DIR }}
        run: terraform apply -auto-approve tfplan

Configure the production environment in your repository settings with required reviewers. The pipeline will pause and wait for approval before executing the apply.

Environment Promotion (Dev to Staging to Prod)

Environment promotion for infrastructure should mirror your application deployment strategy. Use Terraform workspaces or separate state files per environment, with variables that differ between environments:

# environments/dev.tfvars
instance_type  = "t3.small"
min_capacity   = 1
max_capacity   = 2
enable_waf     = false
alert_email    = "[email protected]"

# environments/prod.tfvars
instance_type  = "t3.xlarge"
min_capacity   = 3
max_capacity   = 10
enable_waf     = true
alert_email    = "[email protected]"

The pipeline selects the correct vars file based on the target environment:

- name: Terraform Plan
  run: |
    terraform plan \
      -var-file="environments/${{ matrix.environment }}.tfvars" \
      -out=tfplan \
      -input=false

Use a matrix strategy to plan across all environments simultaneously but apply sequentially with gates between them. This catches environment-specific issues early.

State Locking in CI

State locking prevents concurrent pipeline runs from corrupting your Terraform state. Every remote backend supports it, and it is non-negotiable in CI environments.

For S3 backends, DynamoDB provides the lock table:

terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "infrastructure/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

For Azure, the blob lease mechanism handles locking automatically. For GCS, Cloud Storage object locking does the same.

In CI, you also need pipeline-level concurrency control. Two merged PRs should not trigger simultaneous applies:

# GitHub Actions concurrency control
concurrency:
  group: terraform-${{ github.ref }}
  cancel-in-progress: false

Setting cancel-in-progress: false is critical. You never want to cancel a running Terraform apply — that leaves resources in a partially created state.

Secrets Management in Pipelines

Infrastructure pipelines need access to cloud credentials, API keys, database passwords, and other sensitive values. The rules are simple:

  1. Never store secrets in Terraform code or state files
  2. Use your CI platform's secrets store (GitHub Secrets, Azure DevOps Variable Groups)
  3. Prefer OIDC/workload identity federation over long-lived credentials
  4. Use a secrets manager (HashiCorp Vault, AWS Secrets Manager) for values that Terraform provisions

OIDC eliminates static credentials entirely. Here is the GitHub Actions configuration for AWS:

- name: Configure AWS Credentials
  uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789:role/github-actions-terraform
    aws-region: us-east-1

For sensitive Terraform variables, pass them through environment variables:

- name: Terraform Plan
  env:
    TF_VAR_database_password: ${{ secrets.DB_PASSWORD }}
    TF_VAR_api_key: ${{ secrets.EXTERNAL_API_KEY }}
  run: terraform plan -input=false -out=tfplan

Terraform recognizes TF_VAR_ prefixed environment variables automatically. This keeps secrets out of command-line arguments where they would appear in process listings and logs.

Rollback Strategies

Rolling back infrastructure is harder than rolling back application code. There is no terraform rollback command. You have several options depending on the situation:

Git revert: The simplest approach. Revert the commit that introduced the bad change and let the pipeline apply the previous configuration. This works for additive changes and modifications but can fail for destructive changes where resources have already been deleted.

State-based rollback: If you version your state files (and you should), you can restore a previous state and run apply against the current code. This is a last resort because it can create drift between code and state.

Blue-green infrastructure: For critical resources, maintain two parallel stacks and switch traffic between them. This is more expensive but provides instant rollback:

variable "active_stack" {
  description = "Which stack receives traffic: blue or green"
  type        = string
  default     = "blue"
}

resource "aws_lb_target_group_attachment" "active" {
  target_group_arn = aws_lb_target_group.main.arn
  target_id        = var.active_stack == "blue" ? (
    aws_instance.blue.id
  ) : (
    aws_instance.green.id
  )
}

Feature flags in Terraform: Use count or for_each with feature toggle variables to enable or disable resources without destroying them:

resource "aws_waf_web_acl" "main" {
  count = var.enable_waf ? 1 : 0
  # ...
}

Drift Detection in CI

Configuration drift happens when someone makes changes through the console, another tool modifies resources, or an auto-scaling event creates new instances. Scheduled drift detection catches these discrepancies before they cause problems.

name: Drift Detection

on:
  schedule:
    - cron: '0 6 * * *'  # Daily at 6 AM UTC
  workflow_dispatch: {}

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform Init
        working-directory: ./infrastructure
        run: terraform init -input=false

      - name: Detect Drift
        id: drift
        working-directory: ./infrastructure
        run: |
          terraform plan -detailed-exitcode -input=false 2>&1 | tee drift_output.txt
          EXIT_CODE=${PIPESTATUS[0]}
          echo "exit_code=$EXIT_CODE" >> $GITHUB_OUTPUT
          # Exit code 2 means drift detected
          if [ $EXIT_CODE -eq 2 ]; then
            echo "drift_detected=true" >> $GITHUB_OUTPUT
          else
            echo "drift_detected=false" >> $GITHUB_OUTPUT
          fi

      - name: Notify on Drift
        if: steps.drift.outputs.drift_detected == 'true'
        uses: actions/github-script@v7
        with:
          script: |
            var fs = require('fs');
            var driftOutput = fs.readFileSync('./infrastructure/drift_output.txt', 'utf8');

            github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: 'Infrastructure Drift Detected - ' + new Date().toISOString().split('T')[0],
              body: '## Drift Report\n\n```\n' + driftOutput + '\n```\n\n' +
                'Review and either update Terraform config or reconcile manually.',
              labels: ['infrastructure', 'drift']
            });

The -detailed-exitcode flag is the key. Exit code 0 means no changes, exit code 1 means an error, and exit code 2 means drift detected. Your pipeline should treat exit code 2 as a warning, not a failure.

Pull Request Previews with Terraform Plan

Posting plan output to pull requests is the single most impactful improvement you can make to your infrastructure workflow. Reviewers should not have to run Terraform locally to understand the impact of a change.

A good PR comment includes:

  • A summary of resources to be added, changed, or destroyed
  • The full plan output in a collapsible section
  • A warning if any resources will be destroyed
  • A link to the full pipeline log
// scripts/format-plan-comment.js
var fs = require('fs');

function formatPlanComment(planFile, prNumber) {
  var planOutput = fs.readFileSync(planFile, 'utf8');

  var addCount = (planOutput.match(/will be created/g) || []).length;
  var changeCount = (planOutput.match(/will be updated/g) || []).length;
  var destroyCount = (planOutput.match(/will be destroyed/g) || []).length;

  var summary = '| Action | Count |\n|--------|-------|\n' +
    '| Create | ' + addCount + ' |\n' +
    '| Update | ' + changeCount + ' |\n' +
    '| Destroy | ' + destroyCount + ' |';

  var warning = '';
  if (destroyCount > 0) {
    warning = '\n\n> **Warning**: This plan includes resource destruction. ' +
      'Review carefully before approving.\n';
  }

  var body = '## Terraform Plan — PR #' + prNumber + '\n\n' +
    summary + warning +
    '\n\n<details><summary>Full Plan Output</summary>\n\n' +
    '```hcl\n' + planOutput + '\n```\n\n</details>';

  return body;
}

module.exports = { formatPlanComment: formatPlanComment };

CDK Pipeline Self-Mutation

AWS CDK Pipelines take a different approach — the pipeline updates itself. When you change the pipeline definition in your CDK code, the pipeline detects the change and mutates its own structure before deploying your infrastructure:

// lib/pipeline-stack.js
var cdk = require('aws-cdk-lib');
var pipelines = require('aws-cdk-lib/pipelines');

function PipelineStack(scope, id, props) {
  cdk.Stack.call(this, scope, id, props);

  var pipeline = new pipelines.CodePipeline(this, 'InfraPipeline', {
    pipelineName: 'InfrastructurePipeline',
    synth: new pipelines.ShellStep('Synth', {
      input: pipelines.CodePipelineSource.gitHub(
        'company/infrastructure',
        'main'
      ),
      commands: [
        'npm ci',
        'npx cdk synth'
      ]
    }),
    selfMutation: true
  });

  // Add deployment stages
  pipeline.addStage(new DeployStage(this, 'Dev', {
    env: { account: '111111111111', region: 'us-east-1' }
  }));

  pipeline.addStage(new DeployStage(this, 'Prod', {
    env: { account: '222222222222', region: 'us-east-1' }
  }), {
    pre: [
      new pipelines.ManualApprovalStep('PromoteToProd', {
        comment: 'Review the Dev deployment before promoting to Production'
      })
    ]
  });
}

Self-mutation means you never manually update your pipeline. Change the CDK code, push it, and the pipeline updates itself then deploys your infrastructure. It is elegant but requires you to be comfortable with a pipeline that modifies its own execution.

Compliance Gates

Regulated industries need proof that infrastructure changes meet compliance requirements. Add policy-as-code checks as pipeline stages:

- name: OPA Policy Check
  run: |
    terraform show -json tfplan > plan.json
    opa eval \
      --data policies/ \
      --input plan.json \
      'data.terraform.deny[msg]' \
      --format pretty

- name: Checkov Security Scan
  run: |
    pip install checkov
    checkov -d . --framework terraform --soft-fail-on LOW

- name: Cost Estimation
  run: |
    infracost breakdown \
      --path=. \
      --format=json \
      --out-file=/tmp/infracost.json
    infracost comment github \
      --path=/tmp/infracost.json \
      --repo=${{ github.repository }} \
      --pull-request=${{ github.event.pull_request.number }} \
      --github-token=${{ secrets.GITHUB_TOKEN }}

OPA (Open Policy Agent) evaluates custom rules against the plan JSON. Checkov scans for security misconfigurations. Infracost estimates the cost impact. Together, they form a compliance gate that blocks non-compliant changes before they reach production.

Notification and Audit Trails

Every infrastructure change should generate notifications and maintain an audit trail. Slack notifications on apply, audit logs in a central location, and change records that compliance teams can query:

- name: Notify Slack
  if: always()
  uses: actions/github-script@v7
  with:
    script: |
      var https = require('https');
      var status = '${{ job.status }}';
      var emoji = status === 'success' ? 'white_check_mark' : 'x';
      var color = status === 'success' ? '#36a64f' : '#dc3545';

      var payload = JSON.stringify({
        attachments: [{
          color: color,
          blocks: [{
            type: 'section',
            text: {
              type: 'mrkdwn',
              text: ':' + emoji + ': *Terraform Apply ' + status.toUpperCase() + '*\n' +
                '*Environment:* production\n' +
                '*Triggered by:* ${{ github.actor }}\n' +
                '*Commit:* <${{ github.event.head_commit.url }}|' +
                '${{ github.sha }}'.substring(0, 7) + '>'
            }
          }]
        }]
      });

      var options = {
        hostname: 'hooks.slack.com',
        path: '/services/${{ secrets.SLACK_WEBHOOK_PATH }}',
        method: 'POST',
        headers: { 'Content-Type': 'application/json' }
      };

      var req = https.request(options);
      req.write(payload);
      req.end();

For audit trails, write apply results to a dedicated log store. CloudWatch Logs, Azure Monitor, or even a simple append to an S3 object work. The goal is an immutable record of who changed what, when, and why.

Complete Working Example

Here is a full GitHub Actions pipeline that ties everything together. It provides plan comments on PRs, manual approval for production, environment promotion, and scheduled drift detection:

name: Infrastructure Pipeline

on:
  pull_request:
    branches: [main]
    paths: ['infrastructure/**']
  push:
    branches: [main]
    paths: ['infrastructure/**']
  schedule:
    - cron: '0 6 * * 1-5'  # Weekday drift detection
  workflow_dispatch:
    inputs:
      environment:
        description: 'Target environment'
        required: true
        default: 'dev'
        type: choice
        options: [dev, staging, prod]

permissions:
  contents: read
  pull-requests: write
  issues: write
  id-token: write

env:
  TF_VERSION: '1.6.0'
  TF_DIR: './infrastructure'

jobs:
  # ---------- Plan on Pull Requests ----------
  plan:
    name: 'Plan (${{ matrix.env }})'
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    strategy:
      matrix:
        env: [dev, staging, prod]
    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: us-east-1

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Init
        working-directory: ${{ env.TF_DIR }}
        run: |
          terraform init -input=false \
            -backend-config="key=${{ matrix.env }}/terraform.tfstate"

      - name: Plan
        id: plan
        working-directory: ${{ env.TF_DIR }}
        run: |
          terraform plan \
            -var-file="environments/${{ matrix.env }}.tfvars" \
            -input=false \
            -no-color \
            -out=tfplan 2>&1 | tee plan.txt

      - name: Comment on PR
        uses: actions/github-script@v7
        with:
          script: |
            var fs = require('fs');
            var plan = fs.readFileSync('${{ env.TF_DIR }}/plan.txt', 'utf8');
            var env = '${{ matrix.env }}';

            if (plan.length > 60000) {
              plan = plan.substring(0, 60000) + '\n... truncated';
            }

            var body = '## Terraform Plan — `' + env + '`\n\n' +
              '<details><summary>Show plan</summary>\n\n' +
              '```hcl\n' + plan + '\n```\n</details>';

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

  # ---------- Apply to Dev (auto on merge) ----------
  apply-dev:
    name: 'Apply Dev'
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    environment: dev
    concurrency:
      group: terraform-dev
      cancel-in-progress: false
    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: us-east-1

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Init & Apply
        working-directory: ${{ env.TF_DIR }}
        run: |
          terraform init -input=false \
            -backend-config="key=dev/terraform.tfstate"
          terraform apply \
            -var-file="environments/dev.tfvars" \
            -input=false \
            -auto-approve

  # ---------- Apply to Staging (manual gate) ----------
  apply-staging:
    name: 'Apply Staging'
    runs-on: ubuntu-latest
    needs: [apply-dev]
    environment: staging
    concurrency:
      group: terraform-staging
      cancel-in-progress: false
    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: us-east-1

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Init & Apply
        working-directory: ${{ env.TF_DIR }}
        run: |
          terraform init -input=false \
            -backend-config="key=staging/terraform.tfstate"
          terraform apply \
            -var-file="environments/staging.tfvars" \
            -input=false \
            -auto-approve

  # ---------- Apply to Production (manual gate) ----------
  apply-prod:
    name: 'Apply Production'
    runs-on: ubuntu-latest
    needs: [apply-staging]
    environment:
      name: production
      url: https://console.aws.amazon.com
    concurrency:
      group: terraform-prod
      cancel-in-progress: false
    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: us-east-1

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Init & Apply
        working-directory: ${{ env.TF_DIR }}
        run: |
          terraform init -input=false \
            -backend-config="key=prod/terraform.tfstate"
          terraform apply \
            -var-file="environments/prod.tfvars" \
            -input=false \
            -auto-approve

      - name: Notify Success
        if: success()
        uses: actions/github-script@v7
        with:
          script: |
            var https = require('https');
            var payload = JSON.stringify({
              text: 'Production infrastructure deployed successfully by ' +
                '${{ github.actor }} — commit ${{ github.sha }}'.substring(0, 7)
            });
            var options = {
              hostname: 'hooks.slack.com',
              path: '/services/${{ secrets.SLACK_WEBHOOK_PATH }}',
              method: 'POST',
              headers: { 'Content-Type': 'application/json' }
            };
            var req = https.request(options);
            req.write(payload);
            req.end();

  # ---------- Drift Detection (scheduled) ----------
  drift-detection:
    name: 'Drift Detection (${{ matrix.env }})'
    runs-on: ubuntu-latest
    if: github.event_name == 'schedule'
    strategy:
      matrix:
        env: [dev, staging, prod]
    steps:
      - uses: actions/checkout@v4

      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: us-east-1

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Init
        working-directory: ${{ env.TF_DIR }}
        run: |
          terraform init -input=false \
            -backend-config="key=${{ matrix.env }}/terraform.tfstate"

      - name: Check for Drift
        id: drift
        working-directory: ${{ env.TF_DIR }}
        continue-on-error: true
        run: |
          set +e
          terraform plan \
            -var-file="environments/${{ matrix.env }}.tfvars" \
            -detailed-exitcode \
            -input=false \
            -no-color 2>&1 | tee drift.txt
          EXIT_CODE=$?
          echo "exit_code=$EXIT_CODE" >> $GITHUB_OUTPUT
          exit 0

      - name: Create Issue on Drift
        if: steps.drift.outputs.exit_code == '2'
        uses: actions/github-script@v7
        with:
          script: |
            var fs = require('fs');
            var drift = fs.readFileSync('${{ env.TF_DIR }}/drift.txt', 'utf8');
            var env = '${{ matrix.env }}';
            var today = new Date().toISOString().split('T')[0];

            github.rest.issues.create({
              owner: context.repo.owner,
              repo: context.repo.repo,
              title: 'Drift detected in ' + env + ' — ' + today,
              body: '## Infrastructure Drift Report\n\n' +
                '**Environment:** `' + env + '`\n' +
                '**Detected:** ' + today + '\n\n' +
                '<details><summary>Drift Details</summary>\n\n' +
                '```\n' + drift + '\n```\n</details>\n\n' +
                'Investigate and reconcile this drift.',
              labels: ['infrastructure', 'drift', env]
            });

This pipeline gives you the full lifecycle: plan on every PR, auto-deploy to dev on merge, manual approval for staging and production, Slack notifications on production deploys, and weekday drift detection that creates GitHub issues when discrepancies are found.

Common Issues & Troubleshooting

State lock timeout in CI: Pipeline runs can timeout waiting for a state lock if a previous run crashed mid-apply. The fix is terraform force-unlock <LOCK_ID>, but never automate this. A stuck lock means something went wrong, and you need to investigate. Add a timeout to your apply step and alert on failure so a human can intervene.

Plan output differs between PR and apply: If time passes between the plan on a PR and the apply after merge, the plan can become stale. Other merges, external changes, or even time-based resources (certificates, expiring tokens) can cause divergence. The solution is to re-run terraform plan immediately before apply and fail if the plan does not match expectations. Some teams save the plan file as a pipeline artifact and apply that exact plan.

Credentials expiring mid-apply: Large infrastructure changes can take 30+ minutes. If your cloud credentials have a short session duration, the apply will fail partway through. For AWS OIDC roles, increase the duration-seconds to 3600 (1 hour). For Azure service principals, ensure the token lifetime accommodates your longest apply.

Terraform init fails with backend configuration changes: When you modify backend configuration, terraform init requires the -reconfigure or -migrate-state flag. CI pipelines usually run init with -input=false, which causes an immediate failure when backend changes are detected. Handle this by checking for backend changes in your pipeline and conditionally adding -reconfigure.

Provider version conflicts: Pinning provider versions in your required_providers block is essential. Without it, terraform init in CI may download a newer provider version that introduces breaking changes. Always use version constraints:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.30"
    }
  }
}

Parallel applies across environments corrupting state: If your dev and prod environments share a state file (do not do this), parallel matrix runs will corrupt it. Always use separate state files per environment, either through workspaces or distinct backend keys.

Best Practices

  • Never run terraform apply locally against shared environments. The CI pipeline is the only path to deployment. Remove cloud credentials from developer machines if necessary. Local applies are for personal dev sandboxes only.

  • Pin every version: Terraform, providers, and modules. Use .terraform.lock.hcl and commit it to version control. This ensures deterministic plans across all environments and team members.

  • Use separate state files per environment, not workspaces for environment separation. Workspaces are better suited for feature branches or ephemeral infrastructure. For long-lived environments, separate state files with separate backend configurations provide cleaner isolation.

  • Implement plan file passing between stages. Generate the plan in one stage, save it as an artifact, and apply that exact plan in the next stage. This eliminates drift between plan and apply and ensures reviewers approved exactly what gets deployed.

  • Add cost estimation to your pipeline. Tools like Infracost show the monthly cost impact of every PR. Engineers make better decisions when they see that their change adds $500/month to the cloud bill.

  • Run terraform fmt -check and terraform validate on every PR. Formatting inconsistencies and syntax errors should never make it to the plan stage. Catch them early.

  • Set concurrency controls on apply jobs. Two applies running simultaneously against the same environment will corrupt state at best and create conflicting resources at worst. Use concurrency groups with cancel-in-progress: false.

  • Treat drift detection alerts as incidents. Drift means someone bypassed the pipeline or an external process modified your infrastructure. Investigate the root cause, do not just re-apply to make the alert go away.

  • Keep Terraform runs fast by splitting large configurations into smaller root modules. A single root module managing 500 resources takes 10 minutes to plan. Split it into logical components (networking, compute, database) with separate state files. Each runs faster, and changes to networking do not require planning against every EC2 instance.

  • Rotate and audit pipeline credentials quarterly. Even with OIDC, the trust relationship between your CI provider and cloud account needs periodic review. Audit which repositories can assume which roles and tighten permissions using least-privilege policies.

References

Powered by Contentful