Aws

ECS and Fargate: Container Orchestration on AWS

Deploy and scale Node.js containers on AWS ECS with Fargate, ALB integration, auto-scaling, and production logging

ECS and Fargate: Container Orchestration on AWS

Overview

Amazon Elastic Container Service (ECS) is AWS's native container orchestration platform that lets you run Docker containers at scale without managing the underlying complexity of distributed systems. When paired with Fargate, the serverless compute engine, you eliminate the need to provision, configure, and manage EC2 instances entirely — you define your containers and AWS handles the rest. If you are running Node.js microservices in production and need a container orchestration solution that integrates deeply with the AWS ecosystem, ECS with Fargate is the most practical choice short of Kubernetes.

Prerequisites

Before diving in, you should have the following in place:

  • An AWS account with appropriate IAM permissions
  • AWS CLI v2 installed and configured with aws configure
  • Docker installed locally for building container images
  • Basic familiarity with Docker (Dockerfiles, images, containers)
  • Node.js v18+ installed for the example application
  • A registered domain name (optional, for production ALB setup)

ECS Core Concepts

ECS has a layered architecture. Understanding these layers is essential before you deploy anything.

Clusters

A cluster is the top-level grouping for your ECS resources. It is a logical boundary, not a physical one. A cluster can run tasks using Fargate, EC2 instances, or both. Think of it as a namespace that ties your services, tasks, and compute capacity together.

# Create an ECS cluster
aws ecs create-cluster --cluster-name my-node-cluster

# List existing clusters
aws ecs list-clusters

In practice, most teams create one cluster per environment — dev, staging, production — though you can segment further by domain or team.

Task Definitions

A task definition is a blueprint for your application. It is the ECS equivalent of a docker-compose.yml file. It specifies which container images to run, how much CPU and memory to allocate, which ports to expose, environment variables, logging configuration, and IAM roles.

Task definitions are versioned. Every time you update one, ECS creates a new revision. You reference a specific revision (or :latest) when running tasks or updating services.

Tasks

A task is a running instance of a task definition. It is ephemeral — a single execution of one or more containers defined in the task definition. Tasks can run standalone (for batch jobs, one-off scripts) or as part of a service.

Services

A service maintains a desired count of running tasks. If a task dies, the service scheduler launches a replacement. Services integrate with load balancers, auto-scaling policies, and deployment strategies. For long-running applications like a Node.js API server, you always want a service managing your tasks.

The relationship flows like this: Cluster → Service → Task → Container(s).

EC2 vs Fargate Launch Types

ECS supports two launch types, and choosing between them shapes your entire operational model.

EC2 Launch Type

With EC2, you manage a pool of EC2 instances that form your cluster's compute capacity. You are responsible for:

  • Instance type selection and provisioning
  • AMI updates and patching
  • Capacity planning and scaling the instance pool
  • Monitoring instance health and disk usage

The upside is cost efficiency for predictable, sustained workloads. You can use Reserved Instances or Savings Plans to reduce costs significantly. You also get access to GPU instances and more granular control over networking and storage.

Fargate Launch Type

With Fargate, AWS manages the compute. You specify CPU and memory at the task level and Fargate provisions the right amount of infrastructure. There are no instances to manage, no AMIs to patch, no capacity planning headaches.

The tradeoff is cost. Fargate is more expensive per unit of compute than EC2. For spiky workloads, development environments, or teams that do not want to manage infrastructure, the operational savings justify the price.

My recommendation: Start with Fargate. Move to EC2 only when cost optimization becomes a priority and your team has the operational maturity to manage instances. For most Node.js applications, Fargate is the right choice.

Task Definition Configuration for Node.js

Here is a production-ready task definition for a Node.js Express application:

{
  "family": "node-api-task",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/ecsTaskRole",
  "containerDefinitions": [
    {
      "name": "node-api",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/node-api:latest",
      "essential": true,
      "portMappings": [
        {
          "containerPort": 3000,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {
          "name": "NODE_ENV",
          "value": "production"
        },
        {
          "name": "PORT",
          "value": "3000"
        }
      ],
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:ssm:us-east-1:123456789012:parameter/prod/database-url"
        },
        {
          "name": "API_KEY",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/api-key"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/node-api",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      }
    }
  ]
}

A few critical notes on this configuration:

  • CPU and memory are specified in Fargate units. "512" means 0.5 vCPU and "1024" means 1 GB RAM. Fargate has specific valid combinations — you cannot pick arbitrary values.
  • networkMode must be awsvpc for Fargate. Each task gets its own elastic network interface (ENI) and private IP.
  • executionRoleArn is the role ECS uses to pull images from ECR and fetch secrets. It is not the role your application code uses.
  • taskRoleArn is the role your application code assumes at runtime. If your Node.js app reads from S3 or writes to DynamoDB, those permissions go here.
  • startPeriod in the health check gives your Node.js app time to boot before ECS starts checking health. Set this high enough for your startup time.

Fargate CPU/Memory Valid Combinations

CPU (vCPU) Memory Options (GB)
0.25 0.5, 1, 2
0.5 1, 2, 3, 4
1 2, 3, 4, 5, 6, 7, 8
2 4 through 16 (1 GB increments)
4 8 through 30 (1 GB increments)

For a typical Node.js API, 0.5 vCPU and 1 GB RAM is a solid starting point. Node.js is single-threaded, so throwing more CPU at a single container has diminishing returns. Scale horizontally with more tasks instead.

ECR: Container Registry

Amazon Elastic Container Registry (ECR) is the native Docker registry for AWS. It integrates seamlessly with ECS — no credential management needed if your execution role has the right permissions.

# Create a repository
aws ecr create-repository --repository-name node-api

# Authenticate Docker with ECR
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com

# Build and push an image
docker build -t node-api .
docker tag node-api:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/node-api:latest
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/node-api:latest

Set up a lifecycle policy to automatically clean up old images. Without one, your ECR costs will grow indefinitely:

{
  "rules": [
    {
      "rulePriority": 1,
      "description": "Keep last 10 images",
      "selection": {
        "tagStatus": "any",
        "countType": "imageCountMoreThan",
        "countNumber": 10
      },
      "action": {
        "type": "expire"
      }
    }
  ]
}
aws ecr put-lifecycle-policy \
  --repository-name node-api \
  --lifecycle-policy-text file://lifecycle-policy.json

Load Balancer Integration with ALB

For any production service, you need an Application Load Balancer (ALB) in front of your ECS tasks. The ALB distributes traffic across tasks, terminates TLS, and performs health checks.

Target Group Configuration

ECS services register tasks as targets in an ALB target group. With Fargate, the target type must be ip (not instance):

# Create the target group
aws elbv2 create-target-group \
  --name node-api-tg \
  --protocol HTTP \
  --port 3000 \
  --vpc-id vpc-0abc123def456 \
  --target-type ip \
  --health-check-path /health \
  --health-check-interval-seconds 30 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3

Health Check Endpoint

Your Node.js app needs a health check endpoint that the ALB can hit. Keep it simple:

var express = require("express");
var app = express();

app.get("/health", function(req, res) {
  res.status(200).json({ status: "healthy", timestamp: Date.now() });
});

For more sophisticated health checks, verify downstream dependencies:

var express = require("express");
var mongoose = require("mongoose");
var app = express();

app.get("/health", function(req, res) {
  var dbState = mongoose.connection.readyState;

  if (dbState === 1) {
    res.status(200).json({
      status: "healthy",
      database: "connected",
      uptime: process.uptime()
    });
  } else {
    res.status(503).json({
      status: "unhealthy",
      database: "disconnected",
      readyState: dbState
    });
  }
});

Be careful with deep health checks. If your database has a transient hiccup, you do not want the ALB to deregister all your tasks simultaneously. Consider having a shallow /health for the ALB and a deep /health/detailed for monitoring.

Service Auto-Scaling

ECS integrates with Application Auto Scaling to adjust your desired task count based on metrics. There are three scaling policy types:

Target Tracking Scaling

This is the simplest and most common approach. You set a target value for a metric and ECS adjusts capacity to maintain it:

# Register the scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/my-node-cluster/node-api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 10

# Create a target tracking policy on CPU utilization
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/my-node-cluster/node-api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-target-tracking \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 60.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }'

For Node.js APIs, I recommend scaling on ALBRequestCountPerTarget rather than CPU. Node.js CPU usage can be misleading because of the event loop — a single-threaded process can handle many concurrent requests at low CPU while still being saturated:

aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/my-node-cluster/node-api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name request-count-tracking \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 1000.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ALBRequestCountPerTarget",
      "ResourceLabel": "app/my-alb/abc123/targetgroup/node-api-tg/def456"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }'

Set ScaleOutCooldown lower than ScaleInCooldown. You want to scale up quickly and scale down slowly.

Environment Variables and Secrets

Never hardcode secrets in your task definition or Docker image. ECS supports two approaches:

Plain Environment Variables

Use the environment array for non-sensitive configuration:

"environment": [
  { "name": "NODE_ENV", "value": "production" },
  { "name": "LOG_LEVEL", "value": "info" },
  { "name": "PORT", "value": "3000" }
]

Secrets from SSM Parameter Store or Secrets Manager

Use the secrets array for sensitive values. ECS injects them at container startup:

"secrets": [
  {
    "name": "DATABASE_URL",
    "valueFrom": "arn:aws:ssm:us-east-1:123456789012:parameter/prod/database-url"
  },
  {
    "name": "JWT_SECRET",
    "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/jwt-secret"
  }
]

SSM Parameter Store is free for standard parameters and works well for most use cases. Secrets Manager costs $0.40 per secret per month but supports automatic rotation — use it for database credentials.

Your execution role needs permission to read these secrets:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ssm:GetParameters",
        "secretsmanager:GetSecretValue"
      ],
      "Resource": [
        "arn:aws:ssm:us-east-1:123456789012:parameter/prod/*",
        "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/*"
      ]
    }
  ]
}

Logging with CloudWatch

The awslogs log driver sends container stdout and stderr to CloudWatch Logs. This is the standard approach for Fargate since you have no access to the underlying host.

Create the log group before deploying:

aws logs create-log-group --log-group-name /ecs/node-api
aws logs put-retention-policy \
  --log-group-name /ecs/node-api \
  --retention-in-days 30

Structure your Node.js logs as JSON for easier querying in CloudWatch Logs Insights:

var logger = {
  info: function(message, meta) {
    var entry = {
      level: "info",
      message: message,
      timestamp: new Date().toISOString(),
      service: "node-api"
    };
    if (meta) {
      Object.assign(entry, meta);
    }
    console.log(JSON.stringify(entry));
  },
  error: function(message, error, meta) {
    var entry = {
      level: "error",
      message: message,
      timestamp: new Date().toISOString(),
      service: "node-api",
      stack: error ? error.stack : undefined
    };
    if (meta) {
      Object.assign(entry, meta);
    }
    console.error(JSON.stringify(entry));
  }
};

module.exports = logger;

Then query structured logs in CloudWatch Logs Insights:

fields @timestamp, message, level
| filter level = "error"
| sort @timestamp desc
| limit 50

Networking: VPC, Security Groups, and Service Discovery

VPC Configuration

Fargate tasks run inside your VPC. You need:

  • Private subnets for your tasks (no direct internet access)
  • Public subnets for your ALB
  • NAT Gateway so tasks in private subnets can pull images from ECR and access external APIs
# Create security group for the ALB
aws ec2 create-security-group \
  --group-name alb-sg \
  --description "ALB security group" \
  --vpc-id vpc-0abc123def456

# Allow inbound HTTP/HTTPS
aws ec2 authorize-security-group-ingress \
  --group-id sg-alb123 \
  --protocol tcp --port 443 --cidr 0.0.0.0/0

aws ec2 authorize-security-group-ingress \
  --group-id sg-alb123 \
  --protocol tcp --port 80 --cidr 0.0.0.0/0

# Create security group for ECS tasks
aws ec2 create-security-group \
  --group-name ecs-tasks-sg \
  --description "ECS tasks security group" \
  --vpc-id vpc-0abc123def456

# Allow inbound only from the ALB security group
aws ec2 authorize-security-group-ingress \
  --group-id sg-ecs123 \
  --protocol tcp --port 3000 --source-group sg-alb123

This is a critical security practice: ECS tasks should only accept traffic from the ALB, never directly from the internet.

Service Discovery with Cloud Map

For service-to-service communication, AWS Cloud Map provides DNS-based service discovery:

# Create a private DNS namespace
aws servicediscovery create-private-dns-namespace \
  --name internal.myapp \
  --vpc vpc-0abc123def456

# Create a service in the namespace
aws servicediscovery create-service \
  --name node-api \
  --dns-config "NamespaceId=ns-abc123,DnsRecords=[{Type=A,TTL=10}]" \
  --health-check-custom-config FailureThreshold=1

With service discovery enabled, other services can reach your Node.js API at node-api.internal.myapp without going through the ALB.

Deployment Strategies

Rolling Update (Default)

ECS launches new tasks, waits for them to pass health checks, then drains and stops old tasks. Configure it in your service definition:

{
  "deploymentConfiguration": {
    "maximumPercent": 200,
    "minimumHealthyPercent": 100
  }
}

With minimumHealthyPercent at 100 and maximumPercent at 200, ECS launches a full set of new tasks before removing old ones. This ensures zero downtime but uses double the resources during deployment.

Blue-Green Deployment with CodeDeploy

For production workloads, blue-green deployments give you the ability to validate the new version before shifting traffic, and instantly roll back if something goes wrong.

This requires an ECS service configured with the CODE_DEPLOY deployment controller:

aws ecs create-service \
  --cluster my-node-cluster \
  --service-name node-api-service \
  --task-definition node-api-task:1 \
  --desired-count 2 \
  --launch-type FARGATE \
  --deployment-controller type=CODE_DEPLOY \
  --network-configuration '{
    "awsvpcConfiguration": {
      "subnets": ["subnet-private1", "subnet-private2"],
      "securityGroups": ["sg-ecs123"],
      "assignPublicIp": "DISABLED"
    }
  }' \
  --load-balancers '[{
    "targetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/node-api-tg/abc123",
    "containerName": "node-api",
    "containerPort": 3000
  }]'

CodeDeploy manages the traffic shift. You can configure it to shift all at once, linearly (e.g., 10% every minute), or in a canary pattern (10% first, then 90% after validation).

Complete Working Example

Let us deploy a Node.js Express application on Fargate with ALB, auto-scaling, and CloudWatch logging.

Step 1: The Node.js Application

// app.js
var express = require("express");
var os = require("os");

var app = express();
var port = process.env.PORT || 3000;

app.use(express.json());

// Health check endpoint
app.get("/health", function(req, res) {
  res.status(200).json({
    status: "healthy",
    hostname: os.hostname(),
    uptime: process.uptime(),
    timestamp: Date.now()
  });
});

// Main API endpoint
app.get("/api/info", function(req, res) {
  res.json({
    service: "node-api",
    version: process.env.APP_VERSION || "1.0.0",
    environment: process.env.NODE_ENV || "development",
    hostname: os.hostname()
  });
});

app.get("/api/items", function(req, res) {
  // Simulated data — in production this would query a database
  var items = [
    { id: 1, name: "Widget A", price: 29.99 },
    { id: 2, name: "Widget B", price: 49.99 },
    { id: 3, name: "Widget C", price: 19.99 }
  ];
  res.json({ items: items, count: items.length });
});

// Graceful shutdown handler
process.on("SIGTERM", function() {
  console.log(JSON.stringify({
    level: "info",
    message: "SIGTERM received, shutting down gracefully",
    timestamp: new Date().toISOString()
  }));
  server.close(function() {
    console.log(JSON.stringify({
      level: "info",
      message: "Server closed",
      timestamp: new Date().toISOString()
    }));
    process.exit(0);
  });
});

var server = app.listen(port, function() {
  console.log(JSON.stringify({
    level: "info",
    message: "Server started on port " + port,
    timestamp: new Date().toISOString()
  }));
});

module.exports = app;

The SIGTERM handler is essential. When ECS stops a task, it sends SIGTERM first and waits (default 30 seconds) before sending SIGKILL. Your app should finish in-flight requests and close connections cleanly.

Step 2: Dockerfile

FROM node:18-alpine

WORKDIR /app

# Copy package files first for better layer caching
COPY package*.json ./
RUN npm ci --only=production

# Copy application code
COPY . .

# Create non-root user
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
USER appuser

EXPOSE 3000

HEALTHCHECK --interval=30s --timeout=5s --start-period=30s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/health || exit 1

CMD ["node", "app.js"]

Key points: use npm ci instead of npm install for deterministic builds. Run as a non-root user. Use node:18-alpine for a smaller image. The multi-stage approach of copying package*.json first means your npm ci layer is cached unless dependencies change.

Step 3: Build and Push to ECR

# Create ECR repository
aws ecr create-repository --repository-name node-api --region us-east-1

# Get the registry URI
REGISTRY="123456789012.dkr.ecr.us-east-1.amazonaws.com"

# Authenticate
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin $REGISTRY

# Build and push
docker build -t node-api:v1.0.0 .
docker tag node-api:v1.0.0 $REGISTRY/node-api:v1.0.0
docker tag node-api:v1.0.0 $REGISTRY/node-api:latest
docker push $REGISTRY/node-api:v1.0.0
docker push $REGISTRY/node-api:latest

Step 4: Create the ECS Cluster and Service

# Create cluster
aws ecs create-cluster --cluster-name production

# Create CloudWatch log group
aws logs create-log-group --log-group-name /ecs/node-api
aws logs put-retention-policy \
  --log-group-name /ecs/node-api \
  --retention-in-days 30

# Register task definition
aws ecs register-task-definition --cli-input-json file://task-definition.json

# Create the service with ALB
aws ecs create-service \
  --cluster production \
  --service-name node-api-service \
  --task-definition node-api-task \
  --desired-count 2 \
  --launch-type FARGATE \
  --platform-version LATEST \
  --network-configuration '{
    "awsvpcConfiguration": {
      "subnets": ["subnet-private1", "subnet-private2"],
      "securityGroups": ["sg-ecs123"],
      "assignPublicIp": "DISABLED"
    }
  }' \
  --load-balancers '[{
    "targetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/node-api-tg/abc123",
    "containerName": "node-api",
    "containerPort": 3000
  }]'

Step 5: Configure Auto-Scaling

# Register scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/production/node-api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 10

# Scale on request count per target
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production/node-api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name request-tracking \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 500.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ALBRequestCountPerTarget",
      "ResourceLabel": "app/prod-alb/abc123/targetgroup/node-api-tg/def456"
    },
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }'

Step 6: Verify the Deployment

# Check service status
aws ecs describe-services \
  --cluster production \
  --services node-api-service \
  --query 'services[0].{desired: desiredCount, running: runningCount, status: status}'

# List running tasks
aws ecs list-tasks --cluster production --service-name node-api-service

# View logs
aws logs tail /ecs/node-api --follow

# Test the endpoint through the ALB
curl https://api.myapp.com/health
curl https://api.myapp.com/api/info

Deployment Automation Script

Here is a deployment script that builds, pushes, and updates your ECS service:

#!/bin/bash
set -euo pipefail

SERVICE_NAME="node-api-service"
CLUSTER="production"
TASK_FAMILY="node-api-task"
ECR_REPO="123456789012.dkr.ecr.us-east-1.amazonaws.com/node-api"
REGION="us-east-1"

# Get version from package.json
VERSION=$(node -e "console.log(require('./package.json').version)")
echo "Deploying version $VERSION"

# Authenticate with ECR
aws ecr get-login-password --region $REGION | \
  docker login --username AWS --password-stdin $ECR_REPO

# Build and push
docker build -t $ECR_REPO:$VERSION -t $ECR_REPO:latest .
docker push $ECR_REPO:$VERSION
docker push $ECR_REPO:latest

# Get current task definition and create new revision with updated image
TASK_DEF=$(aws ecs describe-task-definition \
  --task-definition $TASK_FAMILY \
  --query 'taskDefinition' \
  --output json)

NEW_TASK_DEF=$(echo $TASK_DEF | \
  jq --arg IMAGE "$ECR_REPO:$VERSION" \
  '.containerDefinitions[0].image = $IMAGE |
   del(.taskDefinitionArn, .revision, .status, .requiresAttributes, .compatibilities, .registeredAt, .registeredBy)')

# Register new task definition
NEW_REVISION=$(aws ecs register-task-definition \
  --cli-input-json "$NEW_TASK_DEF" \
  --query 'taskDefinition.taskDefinitionArn' \
  --output text)

echo "Registered new task definition: $NEW_REVISION"

# Update service
aws ecs update-service \
  --cluster $CLUSTER \
  --service $SERVICE_NAME \
  --task-definition $NEW_REVISION \
  --force-new-deployment

# Wait for deployment to stabilize
echo "Waiting for service to stabilize..."
aws ecs wait services-stable \
  --cluster $CLUSTER \
  --services $SERVICE_NAME

echo "Deployment complete: version $VERSION"

Common Issues and Troubleshooting

1. Task Fails to Start: CannotPullContainerError

CannotPullContainerError: Error response from daemon: pull access denied for 123456789012.dkr.ecr.us-east-1.amazonaws.com/node-api, repository does not exist or may require 'docker login'

This means your ECS task execution role lacks ECR permissions. Attach the AmazonECSTaskExecutionRolePolicy managed policy, or ensure the role has ecr:GetDownloadUrlForLayer, ecr:BatchGetImage, and ecr:GetAuthorizationToken permissions. Also verify the image URI matches your actual ECR repository name exactly.

2. Task Keeps Restarting: Health Check Failures

service node-api-service (instance i-xxx) (port 3000) is unhealthy in target-group node-api-tg due to (reason Health checks failed)

Common causes: your application takes too long to start and the health check times out before it is ready. Increase the startPeriod in the container health check and the HealthCheckGracePeriodSeconds on the service. Also verify your health endpoint returns a 200 status code and your security group allows traffic from the ALB on the correct port.

# Update service with longer health check grace period
aws ecs update-service \
  --cluster production \
  --service node-api-service \
  --health-check-grace-period-seconds 120

3. Tasks Stuck in PROVISIONING State

service node-api-service was unable to place a task because no container instance met all of its requirements.

For Fargate, this usually means you have exhausted the available IP addresses in your subnets. Each Fargate task requires one ENI and one private IP address. If your subnets are small (e.g., /28 with only 11 usable IPs), you will hit this limit. Use larger subnets (/24 at minimum for production) or reduce your max task count.

4. Secrets Not Available in Container

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve secret from asm: service call has been retried 5 time(s)

Your execution role cannot access Secrets Manager or SSM Parameter Store. Check three things: the role has the correct IAM permissions, the secret ARN in the task definition matches the actual resource, and the VPC has a route to the AWS APIs (either through a NAT Gateway or VPC endpoints for SSM and Secrets Manager).

5. Container Exits with Signal 9 (OOM Kill)

Essential container in task exited (exit code: 137, reason: OutOfMemoryError)

Exit code 137 means the container was killed with SIGKILL, typically due to exceeding the memory limit. Node.js defaults to a heap limit that may be higher than your Fargate memory allocation. Set the Node.js heap limit explicitly:

CMD ["node", "--max-old-space-size=768", "app.js"]

For a task with 1024 MB memory, set the heap to around 768 MB to leave room for the Node.js runtime overhead, OS buffers, and non-heap allocations.

Best Practices

  • Always run at least 2 tasks across multiple Availability Zones. A single task is a single point of failure, and AZ outages do happen.

  • Use immutable image tags (e.g., v1.2.3) instead of latest for production. When you deploy latest, you have no way to know which version is actually running or to roll back to a specific version.

  • Set the stopTimeout in your container definition to match your application's graceful shutdown time. The default is 30 seconds. If your app needs longer to drain connections, increase it (max 120 seconds for Fargate).

  • Enable ECS Exec for debugging production issues. It gives you an interactive shell into a running Fargate container without SSH:

    aws ecs execute-command \
      --cluster production \
      --task abc123def456 \
      --container node-api \
      --interactive \
      --command "/bin/sh"
    
  • Use separate execution and task roles. The execution role is for ECS infrastructure operations (pulling images, fetching secrets). The task role is for your application's runtime permissions. Never combine them into a single overly permissive role.

  • Implement connection draining properly. When a task is being stopped, the ALB deregisters the target and waits for in-flight requests. Set deregistration_delay.timeout_seconds on your target group to match your expected longest request duration.

  • Pin your Fargate platform version in production. Using LATEST means new deployments might use a different platform version. Use a specific version like 1.4.0 for consistency.

  • Set resource limits carefully. Monitor your actual CPU and memory usage with CloudWatch Container Insights before right-sizing. Over-provisioning wastes money. Under-provisioning causes OOM kills and throttling.

  • Use VPC endpoints for ECR and CloudWatch if your tasks run in private subnets. This avoids NAT Gateway data transfer charges and reduces latency for image pulls and log delivery.

References

Powered by Contentful