CI/CD for Infrastructure Deployments
Automate infrastructure deployments with CI/CD pipelines for Terraform using GitHub Actions, approval workflows, and drift detection
CI/CD for Infrastructure Deployments
Overview
Manually running terraform apply from a developer's laptop is how infrastructure breaks at 2 AM on a Friday. CI/CD for infrastructure deployments brings the same discipline we apply to application code — version control, peer review, automated testing, and controlled rollouts — to the resources that run our applications. This article walks through building production-grade pipelines for Terraform using GitHub Actions and Azure Pipelines, with plan previews on pull requests, manual approval gates, environment promotion, drift detection, and rollback strategies.
Prerequisites
- Terraform 1.5+ installed and basic familiarity with HCL syntax
- A GitHub or Azure DevOps account with repository access
- Cloud provider credentials (AWS, Azure, or GCP)
- Understanding of CI/CD concepts (pipelines, stages, triggers)
- Node.js 18+ for any scripting examples
- Basic understanding of remote state backends (S3, Azure Blob, GCS)
Why Automate Infrastructure Deployments
There is a common progression every infrastructure team goes through. First, someone runs Terraform from their laptop. It works. Then a second person joins, and now you have two people running applies against the same state file. One overwrites the other's changes. Someone forgets to pull the latest code before applying. A junior engineer runs terraform destroy in production because their environment variable pointed at the wrong workspace.
Automating infrastructure deployments solves these problems systematically. A CI/CD pipeline becomes the single point of execution. No one runs applies locally. Every change goes through version control, gets reviewed, shows a plan diff, and requires approval before touching production. You get audit trails, repeatable processes, and the confidence that what is in your main branch matches what is actually deployed.
The ROI is not hypothetical. Teams that automate infrastructure deployments report fewer outages caused by configuration drift, faster incident recovery because rollbacks are scripted, and better compliance posture because every change is tracked.
GitOps Workflow for IaC
GitOps treats your Git repository as the single source of truth for infrastructure state. The workflow is straightforward:
- A developer creates a feature branch and modifies Terraform configurations
- They open a pull request against the main branch
- The CI pipeline runs
terraform planand posts the output as a PR comment - Reviewers examine the plan diff alongside the code changes
- After approval, the PR merges to main
- The CD pipeline runs
terraform applyautomatically (for non-production) or waits for manual approval (for production)
This model works because infrastructure changes are inherently reviewable. A plan output shows exactly what will be created, modified, or destroyed. Reviewers can catch dangerous changes — like a security group opening port 22 to the world — before they happen.
feature-branch → PR (plan preview) → review → merge → apply (dev)
→ approve → apply (staging)
→ approve → apply (prod)
GitHub Actions Pipeline for Terraform
GitHub Actions is a natural fit for Terraform pipelines because of its tight integration with pull requests. Here is a reusable workflow structure:
name: Terraform CI/CD
on:
pull_request:
branches: [main]
paths:
- 'infrastructure/**'
push:
branches: [main]
paths:
- 'infrastructure/**'
permissions:
contents: read
pull-requests: write
id-token: write
env:
TF_VERSION: '1.6.0'
WORKING_DIR: './infrastructure'
jobs:
plan:
name: Terraform Plan
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
- uses: actions/checkout@v4
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Terraform Init
working-directory: ${{ env.WORKING_DIR }}
run: terraform init -input=false
- name: Terraform Validate
working-directory: ${{ env.WORKING_DIR }}
run: terraform validate
- name: Terraform Plan
id: plan
working-directory: ${{ env.WORKING_DIR }}
run: |
terraform plan -input=false -no-color -out=tfplan 2>&1 | tee plan_output.txt
echo "plan_exit_code=$?" >> $GITHUB_OUTPUT
- name: Post Plan to PR
uses: actions/github-script@v7
with:
script: |
var fs = require('fs');
var planOutput = fs.readFileSync(
'${{ env.WORKING_DIR }}/plan_output.txt',
'utf8'
);
// Truncate if too long for GitHub comment
var maxLength = 60000;
if (planOutput.length > maxLength) {
planOutput = planOutput.substring(0, maxLength) +
'\n\n... (truncated, see full output in Actions log)';
}
var body = '## Terraform Plan Output\n\n' +
'<details><summary>Click to expand</summary>\n\n' +
'```hcl\n' + planOutput + '\n```\n\n' +
'</details>';
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});
The key here is the plan output posted as a PR comment. Reviewers see exactly what Terraform intends to do without needing to run anything locally.
Azure Pipelines for Terraform
Azure Pipelines uses a stage-based model that maps well to infrastructure promotion. The YAML structure differs from GitHub Actions but the concepts are identical:
trigger:
branches:
include:
- main
paths:
include:
- infrastructure/*
pool:
vmImage: 'ubuntu-latest'
variables:
terraformVersion: '1.6.0'
workingDirectory: '$(System.DefaultWorkingDirectory)/infrastructure'
stages:
- stage: Validate
displayName: 'Validate & Plan'
jobs:
- job: TerraformPlan
steps:
- task: TerraformInstaller@1
inputs:
terraformVersion: $(terraformVersion)
- task: TerraformTaskV4@4
displayName: 'Terraform Init'
inputs:
provider: 'azurerm'
command: 'init'
workingDirectory: $(workingDirectory)
backendServiceArm: 'azure-service-connection'
backendAzureRmResourceGroupName: 'tfstate-rg'
backendAzureRmStorageAccountName: 'tfstatestorage'
backendAzureRmContainerName: 'tfstate'
backendAzureRmKey: 'terraform.tfstate'
- task: TerraformTaskV4@4
displayName: 'Terraform Plan'
inputs:
provider: 'azurerm'
command: 'plan'
workingDirectory: $(workingDirectory)
environmentServiceNameAzureRM: 'azure-service-connection'
commandOptions: '-out=tfplan -input=false'
- task: PublishPipelineArtifact@1
displayName: 'Publish Plan Artifact'
inputs:
targetPath: '$(workingDirectory)/tfplan'
artifact: 'terraform-plan'
- stage: DeployDev
displayName: 'Deploy to Dev'
dependsOn: Validate
condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
jobs:
- deployment: ApplyDev
environment: 'terraform-dev'
strategy:
runOnce:
deploy:
steps:
- download: current
artifact: terraform-plan
- script: |
cd $(workingDirectory)
terraform init -input=false
terraform apply -auto-approve -input=false tfplan
- stage: DeployProd
displayName: 'Deploy to Production'
dependsOn: DeployDev
jobs:
- deployment: ApplyProd
environment: 'terraform-prod'
strategy:
runOnce:
deploy:
steps:
- script: |
cd $(workingDirectory)
terraform init -input=false
terraform workspace select prod
terraform plan -input=false -out=prodplan
terraform apply -auto-approve -input=false prodplan
In Azure Pipelines, the environment resource handles approval gates. You configure required approvers on the terraform-prod environment in the Azure DevOps project settings.
Plan Approval Workflows
The plan-then-apply workflow is the backbone of safe infrastructure deployment. There are two approval patterns worth implementing:
PR-level approval: Reviewers approve the pull request after examining the plan output. Merging triggers the apply. This works well for development and staging environments where the blast radius is limited.
Deployment-level approval: After the plan runs on the main branch, a designated approver must explicitly approve the apply step. GitHub Actions supports this through environments with required reviewers. This is mandatory for production.
# GitHub Actions environment-based approval
jobs:
apply-prod:
runs-on: ubuntu-latest
environment:
name: production
url: https://console.aws.amazon.com
needs: [plan-prod]
steps:
- name: Terraform Apply
working-directory: ${{ env.WORKING_DIR }}
run: terraform apply -auto-approve tfplan
Configure the production environment in your repository settings with required reviewers. The pipeline will pause and wait for approval before executing the apply.
Environment Promotion (Dev to Staging to Prod)
Environment promotion for infrastructure should mirror your application deployment strategy. Use Terraform workspaces or separate state files per environment, with variables that differ between environments:
# environments/dev.tfvars
instance_type = "t3.small"
min_capacity = 1
max_capacity = 2
enable_waf = false
alert_email = "[email protected]"
# environments/prod.tfvars
instance_type = "t3.xlarge"
min_capacity = 3
max_capacity = 10
enable_waf = true
alert_email = "[email protected]"
The pipeline selects the correct vars file based on the target environment:
- name: Terraform Plan
run: |
terraform plan \
-var-file="environments/${{ matrix.environment }}.tfvars" \
-out=tfplan \
-input=false
Use a matrix strategy to plan across all environments simultaneously but apply sequentially with gates between them. This catches environment-specific issues early.
State Locking in CI
State locking prevents concurrent pipeline runs from corrupting your Terraform state. Every remote backend supports it, and it is non-negotiable in CI environments.
For S3 backends, DynamoDB provides the lock table:
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "infrastructure/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
For Azure, the blob lease mechanism handles locking automatically. For GCS, Cloud Storage object locking does the same.
In CI, you also need pipeline-level concurrency control. Two merged PRs should not trigger simultaneous applies:
# GitHub Actions concurrency control
concurrency:
group: terraform-${{ github.ref }}
cancel-in-progress: false
Setting cancel-in-progress: false is critical. You never want to cancel a running Terraform apply — that leaves resources in a partially created state.
Secrets Management in Pipelines
Infrastructure pipelines need access to cloud credentials, API keys, database passwords, and other sensitive values. The rules are simple:
- Never store secrets in Terraform code or state files
- Use your CI platform's secrets store (GitHub Secrets, Azure DevOps Variable Groups)
- Prefer OIDC/workload identity federation over long-lived credentials
- Use a secrets manager (HashiCorp Vault, AWS Secrets Manager) for values that Terraform provisions
OIDC eliminates static credentials entirely. Here is the GitHub Actions configuration for AWS:
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-actions-terraform
aws-region: us-east-1
For sensitive Terraform variables, pass them through environment variables:
- name: Terraform Plan
env:
TF_VAR_database_password: ${{ secrets.DB_PASSWORD }}
TF_VAR_api_key: ${{ secrets.EXTERNAL_API_KEY }}
run: terraform plan -input=false -out=tfplan
Terraform recognizes TF_VAR_ prefixed environment variables automatically. This keeps secrets out of command-line arguments where they would appear in process listings and logs.
Rollback Strategies
Rolling back infrastructure is harder than rolling back application code. There is no terraform rollback command. You have several options depending on the situation:
Git revert: The simplest approach. Revert the commit that introduced the bad change and let the pipeline apply the previous configuration. This works for additive changes and modifications but can fail for destructive changes where resources have already been deleted.
State-based rollback: If you version your state files (and you should), you can restore a previous state and run apply against the current code. This is a last resort because it can create drift between code and state.
Blue-green infrastructure: For critical resources, maintain two parallel stacks and switch traffic between them. This is more expensive but provides instant rollback:
variable "active_stack" {
description = "Which stack receives traffic: blue or green"
type = string
default = "blue"
}
resource "aws_lb_target_group_attachment" "active" {
target_group_arn = aws_lb_target_group.main.arn
target_id = var.active_stack == "blue" ? (
aws_instance.blue.id
) : (
aws_instance.green.id
)
}
Feature flags in Terraform: Use count or for_each with feature toggle variables to enable or disable resources without destroying them:
resource "aws_waf_web_acl" "main" {
count = var.enable_waf ? 1 : 0
# ...
}
Drift Detection in CI
Configuration drift happens when someone makes changes through the console, another tool modifies resources, or an auto-scaling event creates new instances. Scheduled drift detection catches these discrepancies before they cause problems.
name: Drift Detection
on:
schedule:
- cron: '0 6 * * *' # Daily at 6 AM UTC
workflow_dispatch: {}
jobs:
detect-drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Terraform Init
working-directory: ./infrastructure
run: terraform init -input=false
- name: Detect Drift
id: drift
working-directory: ./infrastructure
run: |
terraform plan -detailed-exitcode -input=false 2>&1 | tee drift_output.txt
EXIT_CODE=${PIPESTATUS[0]}
echo "exit_code=$EXIT_CODE" >> $GITHUB_OUTPUT
# Exit code 2 means drift detected
if [ $EXIT_CODE -eq 2 ]; then
echo "drift_detected=true" >> $GITHUB_OUTPUT
else
echo "drift_detected=false" >> $GITHUB_OUTPUT
fi
- name: Notify on Drift
if: steps.drift.outputs.drift_detected == 'true'
uses: actions/github-script@v7
with:
script: |
var fs = require('fs');
var driftOutput = fs.readFileSync('./infrastructure/drift_output.txt', 'utf8');
github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: 'Infrastructure Drift Detected - ' + new Date().toISOString().split('T')[0],
body: '## Drift Report\n\n```\n' + driftOutput + '\n```\n\n' +
'Review and either update Terraform config or reconcile manually.',
labels: ['infrastructure', 'drift']
});
The -detailed-exitcode flag is the key. Exit code 0 means no changes, exit code 1 means an error, and exit code 2 means drift detected. Your pipeline should treat exit code 2 as a warning, not a failure.
Pull Request Previews with Terraform Plan
Posting plan output to pull requests is the single most impactful improvement you can make to your infrastructure workflow. Reviewers should not have to run Terraform locally to understand the impact of a change.
A good PR comment includes:
- A summary of resources to be added, changed, or destroyed
- The full plan output in a collapsible section
- A warning if any resources will be destroyed
- A link to the full pipeline log
// scripts/format-plan-comment.js
var fs = require('fs');
function formatPlanComment(planFile, prNumber) {
var planOutput = fs.readFileSync(planFile, 'utf8');
var addCount = (planOutput.match(/will be created/g) || []).length;
var changeCount = (planOutput.match(/will be updated/g) || []).length;
var destroyCount = (planOutput.match(/will be destroyed/g) || []).length;
var summary = '| Action | Count |\n|--------|-------|\n' +
'| Create | ' + addCount + ' |\n' +
'| Update | ' + changeCount + ' |\n' +
'| Destroy | ' + destroyCount + ' |';
var warning = '';
if (destroyCount > 0) {
warning = '\n\n> **Warning**: This plan includes resource destruction. ' +
'Review carefully before approving.\n';
}
var body = '## Terraform Plan — PR #' + prNumber + '\n\n' +
summary + warning +
'\n\n<details><summary>Full Plan Output</summary>\n\n' +
'```hcl\n' + planOutput + '\n```\n\n</details>';
return body;
}
module.exports = { formatPlanComment: formatPlanComment };
CDK Pipeline Self-Mutation
AWS CDK Pipelines take a different approach — the pipeline updates itself. When you change the pipeline definition in your CDK code, the pipeline detects the change and mutates its own structure before deploying your infrastructure:
// lib/pipeline-stack.js
var cdk = require('aws-cdk-lib');
var pipelines = require('aws-cdk-lib/pipelines');
function PipelineStack(scope, id, props) {
cdk.Stack.call(this, scope, id, props);
var pipeline = new pipelines.CodePipeline(this, 'InfraPipeline', {
pipelineName: 'InfrastructurePipeline',
synth: new pipelines.ShellStep('Synth', {
input: pipelines.CodePipelineSource.gitHub(
'company/infrastructure',
'main'
),
commands: [
'npm ci',
'npx cdk synth'
]
}),
selfMutation: true
});
// Add deployment stages
pipeline.addStage(new DeployStage(this, 'Dev', {
env: { account: '111111111111', region: 'us-east-1' }
}));
pipeline.addStage(new DeployStage(this, 'Prod', {
env: { account: '222222222222', region: 'us-east-1' }
}), {
pre: [
new pipelines.ManualApprovalStep('PromoteToProd', {
comment: 'Review the Dev deployment before promoting to Production'
})
]
});
}
Self-mutation means you never manually update your pipeline. Change the CDK code, push it, and the pipeline updates itself then deploys your infrastructure. It is elegant but requires you to be comfortable with a pipeline that modifies its own execution.
Compliance Gates
Regulated industries need proof that infrastructure changes meet compliance requirements. Add policy-as-code checks as pipeline stages:
- name: OPA Policy Check
run: |
terraform show -json tfplan > plan.json
opa eval \
--data policies/ \
--input plan.json \
'data.terraform.deny[msg]' \
--format pretty
- name: Checkov Security Scan
run: |
pip install checkov
checkov -d . --framework terraform --soft-fail-on LOW
- name: Cost Estimation
run: |
infracost breakdown \
--path=. \
--format=json \
--out-file=/tmp/infracost.json
infracost comment github \
--path=/tmp/infracost.json \
--repo=${{ github.repository }} \
--pull-request=${{ github.event.pull_request.number }} \
--github-token=${{ secrets.GITHUB_TOKEN }}
OPA (Open Policy Agent) evaluates custom rules against the plan JSON. Checkov scans for security misconfigurations. Infracost estimates the cost impact. Together, they form a compliance gate that blocks non-compliant changes before they reach production.
Notification and Audit Trails
Every infrastructure change should generate notifications and maintain an audit trail. Slack notifications on apply, audit logs in a central location, and change records that compliance teams can query:
- name: Notify Slack
if: always()
uses: actions/github-script@v7
with:
script: |
var https = require('https');
var status = '${{ job.status }}';
var emoji = status === 'success' ? 'white_check_mark' : 'x';
var color = status === 'success' ? '#36a64f' : '#dc3545';
var payload = JSON.stringify({
attachments: [{
color: color,
blocks: [{
type: 'section',
text: {
type: 'mrkdwn',
text: ':' + emoji + ': *Terraform Apply ' + status.toUpperCase() + '*\n' +
'*Environment:* production\n' +
'*Triggered by:* ${{ github.actor }}\n' +
'*Commit:* <${{ github.event.head_commit.url }}|' +
'${{ github.sha }}'.substring(0, 7) + '>'
}
}]
}]
});
var options = {
hostname: 'hooks.slack.com',
path: '/services/${{ secrets.SLACK_WEBHOOK_PATH }}',
method: 'POST',
headers: { 'Content-Type': 'application/json' }
};
var req = https.request(options);
req.write(payload);
req.end();
For audit trails, write apply results to a dedicated log store. CloudWatch Logs, Azure Monitor, or even a simple append to an S3 object work. The goal is an immutable record of who changed what, when, and why.
Complete Working Example
Here is a full GitHub Actions pipeline that ties everything together. It provides plan comments on PRs, manual approval for production, environment promotion, and scheduled drift detection:
name: Infrastructure Pipeline
on:
pull_request:
branches: [main]
paths: ['infrastructure/**']
push:
branches: [main]
paths: ['infrastructure/**']
schedule:
- cron: '0 6 * * 1-5' # Weekday drift detection
workflow_dispatch:
inputs:
environment:
description: 'Target environment'
required: true
default: 'dev'
type: choice
options: [dev, staging, prod]
permissions:
contents: read
pull-requests: write
issues: write
id-token: write
env:
TF_VERSION: '1.6.0'
TF_DIR: './infrastructure'
jobs:
# ---------- Plan on Pull Requests ----------
plan:
name: 'Plan (${{ matrix.env }})'
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
strategy:
matrix:
env: [dev, staging, prod]
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Init
working-directory: ${{ env.TF_DIR }}
run: |
terraform init -input=false \
-backend-config="key=${{ matrix.env }}/terraform.tfstate"
- name: Plan
id: plan
working-directory: ${{ env.TF_DIR }}
run: |
terraform plan \
-var-file="environments/${{ matrix.env }}.tfvars" \
-input=false \
-no-color \
-out=tfplan 2>&1 | tee plan.txt
- name: Comment on PR
uses: actions/github-script@v7
with:
script: |
var fs = require('fs');
var plan = fs.readFileSync('${{ env.TF_DIR }}/plan.txt', 'utf8');
var env = '${{ matrix.env }}';
if (plan.length > 60000) {
plan = plan.substring(0, 60000) + '\n... truncated';
}
var body = '## Terraform Plan — `' + env + '`\n\n' +
'<details><summary>Show plan</summary>\n\n' +
'```hcl\n' + plan + '\n```\n</details>';
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});
# ---------- Apply to Dev (auto on merge) ----------
apply-dev:
name: 'Apply Dev'
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
environment: dev
concurrency:
group: terraform-dev
cancel-in-progress: false
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Init & Apply
working-directory: ${{ env.TF_DIR }}
run: |
terraform init -input=false \
-backend-config="key=dev/terraform.tfstate"
terraform apply \
-var-file="environments/dev.tfvars" \
-input=false \
-auto-approve
# ---------- Apply to Staging (manual gate) ----------
apply-staging:
name: 'Apply Staging'
runs-on: ubuntu-latest
needs: [apply-dev]
environment: staging
concurrency:
group: terraform-staging
cancel-in-progress: false
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Init & Apply
working-directory: ${{ env.TF_DIR }}
run: |
terraform init -input=false \
-backend-config="key=staging/terraform.tfstate"
terraform apply \
-var-file="environments/staging.tfvars" \
-input=false \
-auto-approve
# ---------- Apply to Production (manual gate) ----------
apply-prod:
name: 'Apply Production'
runs-on: ubuntu-latest
needs: [apply-staging]
environment:
name: production
url: https://console.aws.amazon.com
concurrency:
group: terraform-prod
cancel-in-progress: false
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Init & Apply
working-directory: ${{ env.TF_DIR }}
run: |
terraform init -input=false \
-backend-config="key=prod/terraform.tfstate"
terraform apply \
-var-file="environments/prod.tfvars" \
-input=false \
-auto-approve
- name: Notify Success
if: success()
uses: actions/github-script@v7
with:
script: |
var https = require('https');
var payload = JSON.stringify({
text: 'Production infrastructure deployed successfully by ' +
'${{ github.actor }} — commit ${{ github.sha }}'.substring(0, 7)
});
var options = {
hostname: 'hooks.slack.com',
path: '/services/${{ secrets.SLACK_WEBHOOK_PATH }}',
method: 'POST',
headers: { 'Content-Type': 'application/json' }
};
var req = https.request(options);
req.write(payload);
req.end();
# ---------- Drift Detection (scheduled) ----------
drift-detection:
name: 'Drift Detection (${{ matrix.env }})'
runs-on: ubuntu-latest
if: github.event_name == 'schedule'
strategy:
matrix:
env: [dev, staging, prod]
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Init
working-directory: ${{ env.TF_DIR }}
run: |
terraform init -input=false \
-backend-config="key=${{ matrix.env }}/terraform.tfstate"
- name: Check for Drift
id: drift
working-directory: ${{ env.TF_DIR }}
continue-on-error: true
run: |
set +e
terraform plan \
-var-file="environments/${{ matrix.env }}.tfvars" \
-detailed-exitcode \
-input=false \
-no-color 2>&1 | tee drift.txt
EXIT_CODE=$?
echo "exit_code=$EXIT_CODE" >> $GITHUB_OUTPUT
exit 0
- name: Create Issue on Drift
if: steps.drift.outputs.exit_code == '2'
uses: actions/github-script@v7
with:
script: |
var fs = require('fs');
var drift = fs.readFileSync('${{ env.TF_DIR }}/drift.txt', 'utf8');
var env = '${{ matrix.env }}';
var today = new Date().toISOString().split('T')[0];
github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: 'Drift detected in ' + env + ' — ' + today,
body: '## Infrastructure Drift Report\n\n' +
'**Environment:** `' + env + '`\n' +
'**Detected:** ' + today + '\n\n' +
'<details><summary>Drift Details</summary>\n\n' +
'```\n' + drift + '\n```\n</details>\n\n' +
'Investigate and reconcile this drift.',
labels: ['infrastructure', 'drift', env]
});
This pipeline gives you the full lifecycle: plan on every PR, auto-deploy to dev on merge, manual approval for staging and production, Slack notifications on production deploys, and weekday drift detection that creates GitHub issues when discrepancies are found.
Common Issues & Troubleshooting
State lock timeout in CI: Pipeline runs can timeout waiting for a state lock if a previous run crashed mid-apply. The fix is terraform force-unlock <LOCK_ID>, but never automate this. A stuck lock means something went wrong, and you need to investigate. Add a timeout to your apply step and alert on failure so a human can intervene.
Plan output differs between PR and apply: If time passes between the plan on a PR and the apply after merge, the plan can become stale. Other merges, external changes, or even time-based resources (certificates, expiring tokens) can cause divergence. The solution is to re-run terraform plan immediately before apply and fail if the plan does not match expectations. Some teams save the plan file as a pipeline artifact and apply that exact plan.
Credentials expiring mid-apply: Large infrastructure changes can take 30+ minutes. If your cloud credentials have a short session duration, the apply will fail partway through. For AWS OIDC roles, increase the duration-seconds to 3600 (1 hour). For Azure service principals, ensure the token lifetime accommodates your longest apply.
Terraform init fails with backend configuration changes: When you modify backend configuration, terraform init requires the -reconfigure or -migrate-state flag. CI pipelines usually run init with -input=false, which causes an immediate failure when backend changes are detected. Handle this by checking for backend changes in your pipeline and conditionally adding -reconfigure.
Provider version conflicts: Pinning provider versions in your required_providers block is essential. Without it, terraform init in CI may download a newer provider version that introduces breaking changes. Always use version constraints:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.30"
}
}
}
Parallel applies across environments corrupting state: If your dev and prod environments share a state file (do not do this), parallel matrix runs will corrupt it. Always use separate state files per environment, either through workspaces or distinct backend keys.
Best Practices
Never run terraform apply locally against shared environments. The CI pipeline is the only path to deployment. Remove cloud credentials from developer machines if necessary. Local applies are for personal dev sandboxes only.
Pin every version: Terraform, providers, and modules. Use
.terraform.lock.hcland commit it to version control. This ensures deterministic plans across all environments and team members.Use separate state files per environment, not workspaces for environment separation. Workspaces are better suited for feature branches or ephemeral infrastructure. For long-lived environments, separate state files with separate backend configurations provide cleaner isolation.
Implement plan file passing between stages. Generate the plan in one stage, save it as an artifact, and apply that exact plan in the next stage. This eliminates drift between plan and apply and ensures reviewers approved exactly what gets deployed.
Add cost estimation to your pipeline. Tools like Infracost show the monthly cost impact of every PR. Engineers make better decisions when they see that their change adds $500/month to the cloud bill.
Run
terraform fmt -checkandterraform validateon every PR. Formatting inconsistencies and syntax errors should never make it to the plan stage. Catch them early.Set concurrency controls on apply jobs. Two applies running simultaneously against the same environment will corrupt state at best and create conflicting resources at worst. Use concurrency groups with
cancel-in-progress: false.Treat drift detection alerts as incidents. Drift means someone bypassed the pipeline or an external process modified your infrastructure. Investigate the root cause, do not just re-apply to make the alert go away.
Keep Terraform runs fast by splitting large configurations into smaller root modules. A single root module managing 500 resources takes 10 minutes to plan. Split it into logical components (networking, compute, database) with separate state files. Each runs faster, and changes to networking do not require planning against every EC2 instance.
Rotate and audit pipeline credentials quarterly. Even with OIDC, the trust relationship between your CI provider and cloud account needs periodic review. Audit which repositories can assume which roles and tighten permissions using least-privilege policies.
References
- Terraform CI/CD Documentation — Official guide for GitHub Actions integration
- GitHub Actions Environments — Configuring approval gates and environment secrets
- Azure Pipelines for Terraform — Microsoft's documentation for Terraform in Azure DevOps
- Open Policy Agent Terraform — Policy-as-code for Terraform plans
- Infracost — Cloud cost estimation for Terraform pull requests
- AWS CDK Pipelines — Self-mutating pipelines with CDK
- Terraform State Locking — Backend locking mechanisms
- Checkov — Static analysis for infrastructure-as-code security