Blue/Green Deployments on ECS with CodeDeploy

Rolling deployments are fine until they aren't. The problem is that a rolling update gradually replaces old tasks with new ones, and if your new version is broken, some percentage of real traffic hits broken code while the rollout is in progress. Blue/green sidesteps this by running both versions in parallel, shifting traffic only when the new version is verified healthy, and leaving the old version running as an instant rollback target. AWS CodeDeploy supports blue/green natively for ECS, and it's worth setting up for anything you care about.

How It Works

The setup involves three pieces:

Two ECS task sets. CodeDeploy creates a replacement ("green") task set alongside your current ("blue") one. Both run simultaneously during the deployment.
An ALB with two target groups. One target group points at blue, one at green. CodeDeploy shifts listener rules between them.
A CodeDeploy deployment group. Configured with your ECS service, cluster, and both target groups.

Traffic shifting happens according to the deployment configuration you choose.

Traffic Shifting Configurations

CodeDeploy ships with three built-in strategies for ECS:

CodeDeployDefault.ECSAllAtOnce — Shift 100% of traffic to green immediately. Fast, but no gradual validation. Good for non-production environments.

CodeDeployDefault.ECSLinear10PercentEvery1Minute — Shift 10% every minute until you reach 100%. Ten minutes total, with CloudWatch alarm checks at each step.

CodeDeployDefault.ECSCanary10Percent5Minutes — Shift 10% immediately, wait 5 minutes, then shift the remaining 90%. This is my default for production services. You get early signal on the new version with limited blast radius, and the total deployment time is predictable.

You can also define custom configurations with arbitrary percentages and intervals.

The appspec.yml

CodeDeploy reads an appspec.yml to know what to deploy. For ECS, it's simple:

version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        TaskDefinition: <TASK_DEFINITION>
        LoadBalancerInfo:
          ContainerName: "web"
          ContainerPort: 5000
        PlatformVersion: "LATEST"

The <TASK_DEFINITION> placeholder gets substituted at deploy time — your CI pipeline registers a new task definition revision and injects the ARN here. If you're using CodePipeline, the imageDetail.json artifact from your build stage handles this automatically.

A minimal imageDetail.json from your build step:

{
  "ImageURI": "123456789012.dkr.ecr.us-east-1.amazonaws.com/myapp:a1b2c3d"
}

CodePipeline stitches the image URI into the task definition registration step before passing it to CodeDeploy.

Registering the Task Definition

Before CodeDeploy can deploy, you need a new task definition revision. Here's how to do it from the CLI in a build script:

# Get current task definition and update the image
TASK_DEF=$(aws ecs describe-task-definition \
  --task-definition myapp \
  --query 'taskDefinition' \
  --output json)

NEW_TASK_DEF=$(echo $TASK_DEF | python3 -c "
import json, sys
td = json.load(sys.stdin)
for cd in td['containerDefinitions']:
    if cd['name'] == 'web':
        cd['image'] = '${NEW_IMAGE_URI}'
# Remove fields not accepted by register-task-definition
for key in ['taskDefinitionArn','revision','status','requiresAttributes',
            'compatibilities','registeredAt','registeredBy']:
    td.pop(key, None)
print(json.dumps(td))
")

NEW_TASK_DEF_ARN=$(echo $NEW_TASK_DEF | aws ecs register-task-definition \
  --cli-input-json file:///dev/stdin \
  --query 'taskDefinition.taskDefinitionArn' \
  --output text)

echo "Registered: $NEW_TASK_DEF_ARN"

This pattern — describe, mutate, re-register — is the standard approach. The Python snippet strips the read-only fields that the API rejects on re-registration.

Setting Up the Deployment Group

Via the CLI:

aws deploy create-deployment-group \
  --application-name myapp \
  --deployment-group-name myapp-production \
  --deployment-config-name CodeDeployDefault.ECSCanary10Percent5Minutes \
  --service-role-arn arn:aws:iam::123456789012:role/CodeDeployECSRole \
  --ecs-services clusterName=production,serviceName=myapp \
  --load-balancer-info "targetGroupPairInfoList=[{
    targetGroups=[
      {name=myapp-blue-tg},
      {name=myapp-green-tg}
    ],
    prodTrafficRoute={listenerArns=[arn:aws:elasticloadbalancing:us-east-1:123456789012:listener/app/myapp-alb/abc123/def456]},
    testTrafficRoute={listenerArns=[arn:aws:elasticloadbalancing:us-east-1:123456789012:listener/app/myapp-alb/abc123/xyz789]}
  }]" \
  --blue-green-deployment-configuration "terminateBlueInstancesOnDeploymentSuccess={
    action=TERMINATE,
    terminationWaitTimeInMinutes=60
  },
  deploymentReadyOption={
    actionOnTimeout=CONTINUE_DEPLOYMENT
  }"

The terminationWaitTimeInMinutes is how long CodeDeploy keeps the old (blue) task set running after a successful shift. I set it to 60 minutes for production — that's your rollback window without needing a full re-deploy.

The testTrafficRoute listener is optional but useful: it lets you send test traffic to the green task set before shifting production traffic. Point your smoke tests at port 8080 (or whatever you use for the test listener) as a pre-shift validation step.

Rollback Triggers with CloudWatch Alarms

CodeDeploy can watch CloudWatch alarms and roll back automatically if they fire during a deployment. Add an alarm trigger to the deployment group:

aws deploy update-deployment-group \
  --application-name myapp \
  --current-deployment-group-name myapp-production \
  --alarm-configuration "enabled=true,alarms=[
    {name=myapp-5xx-rate},
    {name=myapp-p99-latency}
  ]" \
  --auto-rollback-configuration "enabled=true,events=[DEPLOYMENT_FAILURE,DEPLOYMENT_STOP_ON_ALARM]"

DEPLOYMENT_STOP_ON_ALARM is the key event. If myapp-5xx-rate fires while CodeDeploy is in the canary phase, the deployment stops and traffic shifts back to blue automatically. No human required.

Set up your CloudWatch alarms to cover the metrics you actually care about: error rate, latency percentiles, custom application metrics from CloudWatch Embedded Metric Format. The alarms need to be in the same region as the deployment group.

Health Check Grace Periods

One gotcha worth calling out: ECS services have a healthCheckGracePeriodSeconds setting. This tells ECS to ignore health check failures for the first N seconds after a task starts. If your app takes 30 seconds to initialize and your grace period is 10, ECS will kill the task before it ever finishes starting.

For blue/green with CodeDeploy, this is doubly important: the ALB target group health check needs to pass before CodeDeploy considers the green task set healthy enough to receive production traffic. If your health check grace period is too short, you'll see deployments fail immediately with tasks being killed during startup.

My rule of thumb: set the grace period to at least 2x your observed cold-start time, and make sure your /health endpoint responds quickly (not doing DB migrations or cache warming).

aws ecs update-service \
  --cluster production \
  --service myapp \
  --health-check-grace-period-seconds 60

Wrapping Up

Blue/green on ECS with CodeDeploy is more setup than a basic rolling update, but the operational benefits are real: instant rollback, gradual traffic shifting, and automatic alarm-triggered rollbacks. The canary configuration is the sweet spot for production — 10% early validation catches most problems before you've committed to the full shift. The main thing to get right upfront is the health check grace period; that's where I've seen most failed deployments in practice.