Infrastructure As Code

Terraform State Management Strategies

Master Terraform state management with remote backends, state locking, splitting strategies, and disaster recovery patterns

Terraform State Management Strategies

Overview

Terraform state is the single most critical artifact in your infrastructure-as-code pipeline, and mismanaging it will cost you production outages, data loss, and hours of painful recovery work. This article covers production-grade state management strategies including remote backends, state locking, splitting large projects, cross-stack references, and disaster recovery. If you are working on a team or managing anything beyond a toy project, you need to get state management right from day one.

Prerequisites

  • Terraform 1.5 or later installed
  • Basic understanding of Terraform resources and providers
  • AWS account with IAM permissions for S3, DynamoDB, and the resources you plan to manage
  • AWS CLI configured with credentials
  • Node.js 18+ (for automation scripts)
  • Familiarity with HCL syntax

What Terraform State Is and Why It Matters

Every time you run terraform apply, Terraform writes a JSON file that maps your declared resources to real infrastructure objects in the cloud. This file is your state. Without it, Terraform has no idea what it has already created, what needs updating, and what should be destroyed.

The state file contains:

  • Resource IDs and ARNs that map HCL declarations to actual cloud resources
  • Attribute values for every managed resource, including computed outputs
  • Dependency metadata that determines the order of operations
  • Provider configuration details

Here is a simplified look at what lives inside a state file:

{
  "version": 4,
  "terraform_version": "1.7.0",
  "serial": 42,
  "lineage": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "outputs": {
    "vpc_id": {
      "value": "vpc-0abc123def456789",
      "type": "string"
    }
  },
  "resources": [
    {
      "mode": "managed",
      "type": "aws_vpc",
      "name": "main",
      "provider": "provider[\"registry.terraform.io/hashicorp/aws\"]",
      "instances": [
        {
          "schema_version": 1,
          "attributes": {
            "id": "vpc-0abc123def456789",
            "cidr_block": "10.0.0.0/16",
            "tags": {
              "Name": "production-vpc"
            }
          }
        }
      ]
    }
  ]
}

The serial number increments with every write. The lineage is a UUID assigned at creation that prevents accidentally applying one project's state to another. These two fields are your first line of defense against state corruption.

If you lose your state file, Terraform treats every resource as new. Run terraform plan against production without state and you will see a plan to create duplicates of everything that already exists. That is a nightmare scenario, and it is entirely preventable.

Local vs Remote State

By default, Terraform stores state in a local file called terraform.tfstate in your working directory. This works fine when you are learning Terraform alone on your laptop. It falls apart the moment a second engineer touches the same infrastructure.

Problems with local state:

  • No locking. Two engineers can run terraform apply simultaneously and corrupt the state file.
  • No shared access. State lives on one person's machine.
  • No versioning. Accidental deletion means full manual reconstruction.
  • Sensitive data in plaintext sitting on a local filesystem.

Remote state solves all of these problems by storing the state file in a shared, versioned, encrypted backend with locking support.

# Local state (default, do not use in production)
# State lives at ./terraform.tfstate

# Remote state with S3
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/networking/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-locks"
  }
}

The rule is simple: if more than one person will ever touch this infrastructure, use remote state. Period.

S3 Backend with DynamoDB Locking

The S3 backend is the most common production setup for AWS-based teams. S3 provides durable, versioned, encrypted storage. DynamoDB provides a locking mechanism that prevents concurrent writes.

First, create the backend infrastructure. I recommend doing this with a separate, minimal Terraform configuration or even manually through the console, because this is the one piece of infrastructure that cannot be managed by the state it stores.

# backend-bootstrap/main.tf
# Run this ONCE to create the state backend infrastructure

provider "aws" {
  region = "us-east-1"
}

resource "aws_s3_bucket" "terraform_state" {
  bucket = "mycompany-terraform-state"

  tags = {
    Name        = "Terraform State"
    Environment = "management"
    ManagedBy   = "manual"
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "aws:kms"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-state-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  tags = {
    Name        = "Terraform State Locks"
    Environment = "management"
  }
}

output "state_bucket_name" {
  value = aws_s3_bucket.terraform_state.id
}

output "lock_table_name" {
  value = aws_dynamodb_table.terraform_locks.name
}

KMS encryption is non-negotiable. State files contain database passwords, private keys, and other secrets in plaintext. Versioning is equally critical because it gives you the ability to roll back to a previous state if something goes wrong.

Terraform Cloud as a Backend

If you do not want to manage your own state infrastructure, Terraform Cloud (now part of HCP Terraform) provides a managed backend with built-in locking, versioning, a web UI, and policy enforcement.

terraform {
  cloud {
    organization = "mycompany"

    workspaces {
      name = "production-networking"
    }
  }
}

Terraform Cloud is a solid choice for teams that want to centralize not just state but also the entire plan/apply workflow. The free tier supports up to 500 managed resources per workspace, which is enough for many small to medium projects.

The tradeoff is vendor lock-in and an external dependency for your infrastructure pipeline. If Terraform Cloud has an outage, you cannot run applies. With S3, you own the infrastructure and control the availability.

State File Structure and Sensitive Data

Terraform state stores every attribute of every resource, including sensitive values. If you create an aws_db_instance with a password, that password is in your state file in plaintext.

resource "aws_db_instance" "main" {
  identifier     = "production-db"
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = "db.t3.medium"
  username       = "admin"
  password       = var.db_password  # This ends up in state as plaintext
}

Mitigation strategies:

  1. Encrypt the backend. S3 with KMS encryption, Terraform Cloud with its built-in encryption.
  2. Restrict access. IAM policies on the S3 bucket should limit who can read state.
  3. Use sensitive markers. Mark outputs as sensitive to prevent them from appearing in CLI output, though they still exist in state.
  4. External secret management. Use AWS Secrets Manager or HashiCorp Vault and reference secrets at apply time rather than storing them in Terraform variables.
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "production/db/password"
}

resource "aws_db_instance" "main" {
  identifier     = "production-db"
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = "db.t3.medium"
  username       = "admin"
  password       = data.aws_secretsmanager_secret_version.db_password.secret_string
}

The password still ends up in state, but it is fetched from Secrets Manager at apply time rather than being hardcoded in your variables. Combined with an encrypted, access-controlled backend, this is the pragmatic approach most teams use.

State Locking and Concurrency

State locking prevents two people from modifying state at the same time. When you run terraform plan or terraform apply, Terraform acquires a lock in DynamoDB before reading or writing state. If another process already holds the lock, Terraform waits or fails.

$ terraform apply
Acquiring state lock. This may take a few moments...

Error: Error acquiring the state lock

Error message: ConditionalCheckFailedException: The conditional request failed
Lock Info:
  ID:        a1b2c3d4-e5f6-7890-abcd-ef1234567890
  Path:      mycompany-terraform-state/production/networking/terraform.tfstate
  Operation: OperationTypeApply
  Who:       jane@laptop
  Version:   1.7.0
  Created:   2026-01-15 14:23:01.123456 +0000 UTC
  Info:

This is the system working correctly. Do not force-unlock unless you are certain the other process has crashed. If a process crashes mid-apply, you may need to manually release the lock:

terraform force-unlock a1b2c3d4-e5f6-7890-abcd-ef1234567890

Use force-unlock with extreme caution. If the other process is still running and you release the lock, you will get concurrent writes and state corruption.

Terraform State Commands

Terraform provides a set of terraform state subcommands for inspecting and manipulating state. These are essential tools for day-to-day operations.

Listing Resources

# List all resources in state
terraform state list

# Filter by resource type
terraform state list aws_instance

# Filter by module
terraform state list module.networking

Showing Resource Details

# Show full details of a specific resource
terraform state show aws_vpc.main

# Output includes all attributes
# id          = "vpc-0abc123def456789"
# cidr_block  = "10.0.0.0/16"
# tags        = { "Name" = "production-vpc" }

Moving Resources

When you refactor your Terraform code and rename a resource or move it into a module, Terraform sees the old name as a destroy and the new name as a create. Use terraform state mv to update the state without destroying infrastructure.

# Rename a resource
terraform state mv aws_instance.web aws_instance.app_server

# Move a resource into a module
terraform state mv aws_vpc.main module.networking.aws_vpc.main

# Move between modules
terraform state mv module.old.aws_s3_bucket.data module.new.aws_s3_bucket.data

Starting with Terraform 1.1, you can also use the moved block in HCL, which is the preferred approach for team workflows because it is declarative and version-controlled:

moved {
  from = aws_instance.web
  to   = aws_instance.app_server
}

Removing Resources from State

Sometimes you need to remove a resource from Terraform management without destroying the actual infrastructure. This is common when migrating resources between state files.

# Remove from state (does NOT destroy the real resource)
terraform state rm aws_instance.legacy_server

# Remove an entire module
terraform state rm module.deprecated_service

Pull and Push

# Download remote state to a local file
terraform state pull > state_backup.json

# Push a local state file to the remote backend (dangerous)
terraform state push state_backup.json

state push is a last-resort operation. It overwrites the remote state entirely. Use it only for disaster recovery after careful verification.

Importing Existing Resources

When you have infrastructure that was created manually or by another tool, you can bring it under Terraform management with terraform import.

# Import an existing VPC
terraform import aws_vpc.main vpc-0abc123def456789

# Import an RDS instance
terraform import aws_db_instance.main production-db

# Import into a module
terraform import module.networking.aws_subnet.public subnet-0abc123def456789

You must write the matching resource block in your HCL first. Terraform import only updates the state; it does not generate configuration.

Starting with Terraform 1.5, you can use import blocks for a more declarative approach:

import {
  to = aws_vpc.main
  id = "vpc-0abc123def456789"
}

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"

  tags = {
    Name = "production-vpc"
  }
}

Run terraform plan and Terraform will show you what it intends to import. This is much safer than the CLI import command because you can review the plan before applying.

State Splitting Strategies for Large Projects

A single state file for your entire infrastructure is a ticking time bomb. As it grows, terraform plan gets slower, the blast radius of any mistake covers everything, and you cannot give different teams different access levels.

Split your state by concern:

infrastructure/
├── networking/          # VPC, subnets, route tables, NAT gateways
│   ├── main.tf
│   ├── outputs.tf
│   └── backend.tf       # s3 key: "production/networking/terraform.tfstate"
├── database/            # RDS, ElastiCache, DynamoDB
│   ├── main.tf
│   ├── outputs.tf
│   └── backend.tf       # s3 key: "production/database/terraform.tfstate"
├── compute/             # ECS, EC2, Lambda, ALB
│   ├── main.tf
│   ├── outputs.tf
│   └── backend.tf       # s3 key: "production/compute/terraform.tfstate"
├── dns/                 # Route53 zones and records
│   ├── main.tf
│   └── backend.tf       # s3 key: "production/dns/terraform.tfstate"
└── monitoring/          # CloudWatch, SNS, dashboards
    ├── main.tf
    └── backend.tf       # s3 key: "production/monitoring/terraform.tfstate"

The guiding principle is blast radius. A mistake in your monitoring configuration should not be able to destroy your VPC. Splitting state ensures that terraform destroy in one stack cannot touch resources in another.

A good heuristic: if two sets of resources change at different frequencies or are owned by different teams, they belong in different state files.

Cross-Stack State References with terraform_remote_state

When you split state, stacks need to reference each other. The networking stack creates the VPC and subnets; the compute stack needs those IDs to launch instances.

First, expose the values as outputs in the source stack:

# networking/outputs.tf
output "vpc_id" {
  value       = aws_vpc.main.id
  description = "The ID of the production VPC"
}

output "public_subnet_ids" {
  value       = aws_subnet.public[*].id
  description = "List of public subnet IDs"
}

output "private_subnet_ids" {
  value       = aws_subnet.private[*].id
  description = "List of private subnet IDs"
}

Then, read those outputs in the consuming stack:

# compute/data.tf
data "terraform_remote_state" "networking" {
  backend = "s3"

  config = {
    bucket = "mycompany-terraform-state"
    key    = "production/networking/terraform.tfstate"
    region = "us-east-1"
  }
}

# compute/main.tf
resource "aws_instance" "app" {
  ami           = "ami-0abcdef1234567890"
  instance_type = "t3.medium"
  subnet_id     = data.terraform_remote_state.networking.outputs.private_subnet_ids[0]

  vpc_security_group_ids = [aws_security_group.app.id]

  tags = {
    Name = "app-server"
  }
}

resource "aws_security_group" "app" {
  name   = "app-server-sg"
  vpc_id = data.terraform_remote_state.networking.outputs.vpc_id

  ingress {
    from_port   = 8080
    to_port     = 8080
    protocol    = "tcp"
    cidr_blocks = ["10.0.0.0/16"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

An important caveat: terraform_remote_state reads the entire state file of the source stack. The consuming stack's IAM role needs read access to the other stack's state. For stricter isolation, consider using SSM Parameter Store or Terraform data sources to share specific values instead.

# Alternative: Share values through SSM Parameter Store
# In the networking stack
resource "aws_ssm_parameter" "vpc_id" {
  name  = "/infrastructure/production/vpc_id"
  type  = "String"
  value = aws_vpc.main.id
}

# In the compute stack
data "aws_ssm_parameter" "vpc_id" {
  name = "/infrastructure/production/vpc_id"
}

resource "aws_security_group" "app" {
  vpc_id = data.aws_ssm_parameter.vpc_id.value
  # ...
}

The SSM approach is more work but provides better access control and decouples the stacks from each other's backend configuration.

State Migration Between Backends

When you need to move from local state to S3, or from S3 to Terraform Cloud, Terraform handles the migration during terraform init.

# Step 1: Add the new backend configuration
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/app/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-locks"
  }
}
# Step 2: Run init, Terraform detects the backend change
$ terraform init

Initializing the backend...
Backend configuration changed!

Terraform has detected that the configuration specified for the backend
has changed. Terraform will now check for existing state in the backends.

Do you want to copy existing state to the new backend?
  Enter a value: yes

Successfully configured the backend "s3"! Terraform will automatically
use this backend unless the backend configuration changes.

For migrating between remote backends (e.g., S3 to Terraform Cloud), the process is the same. Update the backend block and run terraform init. Terraform will offer to copy the state.

If the automatic migration fails, you can do it manually:

# Pull state from old backend
terraform state pull > migration_backup.json

# Update backend configuration in .tf files
# Run init with the new backend
terraform init

# If state did not copy automatically, push it
terraform state push migration_backup.json

Always back up your state before any migration. Always verify with terraform plan after migration to confirm zero changes.

Disaster Recovery for State Files

State loss is recoverable, but it is painful. Here is a layered disaster recovery strategy.

Layer 1: S3 Versioning. Every state write creates a new version in S3. To recover, list versions and restore:

# List state file versions
aws s3api list-object-versions \
  --bucket mycompany-terraform-state \
  --prefix production/networking/terraform.tfstate \
  --max-keys 10

# Restore a specific version
aws s3api get-object \
  --bucket mycompany-terraform-state \
  --key production/networking/terraform.tfstate \
  --version-id "abc123def456" \
  restored_state.json

# Verify the restored state
cat restored_state.json | python -m json.tool | head -20

# Push it back
terraform state push restored_state.json

Layer 2: Automated Backups. Run a scheduled backup script that copies state files to a separate account or region.

// scripts/backup-state.js
var AWS = require("aws-sdk");
var s3 = new AWS.S3({ region: "us-east-1" });

var SOURCE_BUCKET = "mycompany-terraform-state";
var BACKUP_BUCKET = "mycompany-terraform-state-backup";

var stateKeys = [
  "production/networking/terraform.tfstate",
  "production/database/terraform.tfstate",
  "production/compute/terraform.tfstate",
  "production/dns/terraform.tfstate",
  "production/monitoring/terraform.tfstate"
];

function backupState(key, callback) {
  var timestamp = new Date().toISOString().replace(/[:.]/g, "-");
  var backupKey = "backups/" + timestamp + "/" + key;

  var params = {
    CopySource: SOURCE_BUCKET + "/" + key,
    Bucket: BACKUP_BUCKET,
    Key: backupKey,
    ServerSideEncryption: "aws:kms"
  };

  s3.copyObject(params, function(err, data) {
    if (err) {
      console.error("Failed to backup " + key + ": " + err.message);
      callback(err);
      return;
    }
    console.log("Backed up " + key + " to " + backupKey);
    callback(null, data);
  });
}

function runBackups() {
  var completed = 0;
  var errors = [];

  stateKeys.forEach(function(key) {
    backupState(key, function(err) {
      completed++;
      if (err) {
        errors.push(key);
      }
      if (completed === stateKeys.length) {
        if (errors.length > 0) {
          console.error("Backup completed with errors: " + errors.join(", "));
          process.exit(1);
        }
        console.log("All state files backed up successfully");
      }
    });
  });
}

runBackups();

Layer 3: State Refresh. If state is truly lost and no backups exist, you can reconstruct it by importing every resource. This is tedious but possible:

# For each resource in your configuration
terraform import aws_vpc.main vpc-0abc123def456789
terraform import aws_subnet.public[0] subnet-0aaa111222333444
terraform import aws_subnet.public[1] subnet-0bbb555666777888
terraform import aws_subnet.private[0] subnet-0ccc999000111222
# ... continue for every resource

# Verify
terraform plan
# Should show no changes if imports match configuration

Complete Working Example

Here is a production-ready setup with state splitting across networking and application stacks, complete with cross-stack references.

Backend Bootstrap

# 00-backend/main.tf
terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

variable "project_name" {
  default = "mycompany"
}

resource "aws_s3_bucket" "state" {
  bucket = "${var.project_name}-terraform-state"
}

resource "aws_s3_bucket_versioning" "state" {
  bucket = aws_s3_bucket.state.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "state" {
  bucket = aws_s3_bucket.state.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "aws:kms"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "state" {
  bucket                  = aws_s3_bucket.state.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_dynamodb_table" "locks" {
  name         = "${var.project_name}-terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

Networking Stack

# 01-networking/backend.tf
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/networking/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "mycompany-terraform-locks"
  }
}

# 01-networking/main.tf
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.aws_region
}

variable "aws_region" {
  default = "us-east-1"
}

variable "environment" {
  default = "production"
}

variable "vpc_cidr" {
  default = "10.0.0.0/16"
}

data "aws_availability_zones" "available" {
  state = "available"
}

resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name        = "${var.environment}-vpc"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "aws_subnet" "public" {
  count                   = 2
  vpc_id                  = aws_vpc.main.id
  cidr_block              = cidrsubnet(var.vpc_cidr, 8, count.index)
  availability_zone       = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true

  tags = {
    Name        = "${var.environment}-public-${count.index}"
    Environment = var.environment
    Tier        = "public"
  }
}

resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 8, count.index + 10)
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name        = "${var.environment}-private-${count.index}"
    Environment = var.environment
    Tier        = "private"
  }
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name        = "${var.environment}-igw"
    Environment = var.environment
  }
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }

  tags = {
    Name = "${var.environment}-public-rt"
  }
}

resource "aws_route_table_association" "public" {
  count          = 2
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

# 01-networking/outputs.tf
output "vpc_id" {
  value       = aws_vpc.main.id
  description = "VPC ID for cross-stack references"
}

output "vpc_cidr" {
  value       = aws_vpc.main.cidr_block
  description = "VPC CIDR block"
}

output "public_subnet_ids" {
  value       = aws_subnet.public[*].id
  description = "Public subnet IDs"
}

output "private_subnet_ids" {
  value       = aws_subnet.private[*].id
  description = "Private subnet IDs"
}

Application Stack with Cross-Stack References

# 02-application/backend.tf
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/application/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "mycompany-terraform-locks"
  }
}

# 02-application/data.tf
data "terraform_remote_state" "networking" {
  backend = "s3"

  config = {
    bucket = "mycompany-terraform-state"
    key    = "production/networking/terraform.tfstate"
    region = "us-east-1"
  }
}

# 02-application/main.tf
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

variable "environment" {
  default = "production"
}

locals {
  vpc_id             = data.terraform_remote_state.networking.outputs.vpc_id
  vpc_cidr           = data.terraform_remote_state.networking.outputs.vpc_cidr
  public_subnet_ids  = data.terraform_remote_state.networking.outputs.public_subnet_ids
  private_subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
}

resource "aws_security_group" "alb" {
  name   = "${var.environment}-alb-sg"
  vpc_id = local.vpc_id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "${var.environment}-alb-sg"
  }
}

resource "aws_security_group" "app" {
  name   = "${var.environment}-app-sg"
  vpc_id = local.vpc_id

  ingress {
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "${var.environment}-app-sg"
  }
}

resource "aws_lb" "main" {
  name               = "${var.environment}-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = local.public_subnet_ids

  tags = {
    Name        = "${var.environment}-alb"
    Environment = var.environment
  }
}

resource "aws_lb_target_group" "app" {
  name     = "${var.environment}-app-tg"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = local.vpc_id

  health_check {
    path                = "/health"
    healthy_threshold   = 2
    unhealthy_threshold = 5
    timeout             = 10
    interval            = 30
  }
}

resource "aws_lb_listener" "http" {
  load_balancer_arn = aws_lb.main.arn
  port              = 80
  protocol          = "HTTP"

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.app.arn
  }
}

output "alb_dns_name" {
  value = aws_lb.main.dns_name
}

output "alb_zone_id" {
  value = aws_lb.main.zone_id
}

Automation Script for Multi-Stack Applies

// scripts/apply-all.js
var execSync = require("child_process").execSync;
var path = require("path");

var stacks = [
  "00-backend",
  "01-networking",
  "02-application"
];

var baseDir = path.resolve(__dirname, "..");

function runTerraform(stack, command) {
  var stackDir = path.join(baseDir, stack);
  console.log("\n=== " + stack + ": terraform " + command + " ===\n");

  try {
    var output = execSync("terraform " + command, {
      cwd: stackDir,
      stdio: "inherit",
      env: Object.assign({}, process.env, {
        TF_INPUT: "0"
      })
    });
    return true;
  } catch (err) {
    console.error("Failed in " + stack + ": " + err.message);
    return false;
  }
}

function main() {
  var command = process.argv[2] || "plan";

  if (["plan", "apply", "destroy"].indexOf(command) === -1) {
    console.error("Usage: node apply-all.js [plan|apply|destroy]");
    process.exit(1);
  }

  var autoApprove = command === "plan" ? "" : " -auto-approve";

  for (var i = 0; i < stacks.length; i++) {
    var stack = stacks[i];

    // Always init first
    if (!runTerraform(stack, "init -reconfigure")) {
      console.error("Init failed for " + stack + ", aborting");
      process.exit(1);
    }

    if (!runTerraform(stack, command + autoApprove)) {
      console.error("Terraform " + command + " failed for " + stack + ", aborting");
      process.exit(1);
    }
  }

  console.log("\nAll stacks processed successfully");
}

main();

Common Issues and Troubleshooting

1. State Lock Stuck After Crash

Error: Error acquiring the state lock

Error message: ConditionalCheckFailedException: The conditional request failed
Lock Info:
  ID:        d4f89b2a-1234-5678-abcd-ef0123456789
  Path:      mycompany-terraform-state/production/networking/terraform.tfstate
  Operation: OperationTypeApply
  Who:       deploy@ci-runner-7
  Created:   2026-01-10 08:15:22.456789 +0000 UTC

Cause: A CI/CD pipeline or engineer's process crashed during an apply, leaving a stale lock in DynamoDB.

Fix: Confirm the process is truly dead, then force-unlock:

terraform force-unlock d4f89b2a-1234-5678-abcd-ef0123456789

2. State Serial Mismatch on Push

Error: Failed to persist state to backend.

The error shown above has prevented Terraform from writing the updated state
to the configured backend. To prevent data loss, Terraform will not continue
with any actions that might affect the state.

Serial: 42; Expected: 41

Cause: Someone else wrote to the state between your read and your write. Terraform's serial number check caught the conflict.

Fix: Run terraform refresh to get the latest state, then re-run your plan/apply. Do not use state push with -force unless you fully understand what changed.

3. Backend Initialization Failure

Error: Failed to get existing workspaces: S3 bucket does not exist.

The referenced S3 bucket must have been previously created. If the S3 bucket
was recently created, please retry after a few seconds.

Cause: The S3 bucket for state storage does not exist yet, or you have the wrong bucket name or region in your backend configuration.

Fix: Verify the bucket exists, the name is correct, and your AWS credentials have access. Create the bucket if it does not exist yet (see the backend bootstrap section above).

4. Terraform Import Drift

Error: Cannot import non-existent remote object

While attempting to import an existing object to
"aws_instance.app", the provider detected that no object exists
with the given id.

The given "i-0abc123def456789" id does not match any existing
EC2 instances.

Cause: The resource ID is wrong, the resource was deleted, or you are authenticated against the wrong AWS account or region.

Fix: Double-check the resource ID in the AWS console. Verify your AWS_DEFAULT_REGION and AWS_PROFILE environment variables. Make sure you are importing into the correct provider configuration if you have multiple provider aliases.

5. Remote State Data Source Returns Empty Outputs

Error: Unsupported attribute

  on data.tf line 15, in resource "aws_instance" "app":
  15:   subnet_id = data.terraform_remote_state.networking.outputs.private_subnet_ids[0]

This object has no argument, nested block, or exported attribute named
"private_subnet_ids".

Cause: The source stack either has not been applied yet, or the output name does not match. Outputs are only available in state after a successful terraform apply.

Fix: Apply the source stack first. Verify the output name matches exactly. Check that the backend configuration in the terraform_remote_state data source points to the correct state file.

Best Practices

  • Never store state in version control. Add *.tfstate and *.tfstate.backup to your .gitignore. State files contain secrets and should only live in encrypted backends.

  • Use one state file per environment per component. The key path should encode both: {environment}/{component}/terraform.tfstate. This limits blast radius and enables per-component access control.

  • Enable versioning on your state bucket. This is your primary recovery mechanism. Without versioning, a corrupted state write is permanent.

  • Always run terraform plan after state manipulation. After any state mv, state rm, state push, or import operation, run plan and verify zero changes. If the plan shows unexpected changes, something is wrong.

  • Use moved blocks instead of state mv for refactoring. Moved blocks are declarative, version-controlled, and work correctly in team workflows where multiple people might run terraform apply.

  • Lock down state bucket access with IAM policies. Only CI/CD pipelines and designated infrastructure engineers should have read/write access. Everyone else gets read-only at most.

  • Automate state backups to a separate account. Cross-account backups protect against accidental deletion of the state bucket itself, compromised credentials, and account-level failures.

  • Set lifecycle rules on backup buckets. Keep daily backups for 30 days and weekly backups for 90 days. State files are small but accumulate quickly with high serial numbers.

  • Use workspaces sparingly. Terraform workspaces share the same backend configuration and code, just with different state files. For most teams, separate directories per environment are clearer and safer than workspaces.

  • Document your state layout. Maintain a README or wiki page that maps each state file to the infrastructure it manages, who owns it, and the apply order. When something goes wrong at 2 AM, you do not want to be guessing which state file holds the VPC.

References

Powered by Contentful