AWS/Terraform Workshop #7: AWS ECS insights and troubleshooting

Artem Nosulchik
Universal Language
Published in
7 min readJan 9, 2018

--

This post is part of our AWS/Terraform Workshops series that explores our vision for Service Oriented Architecture (SOA) and closely examines AWS Simple Storage Service, Terraform Remote State, and Identity Access Management. To learn more, check out our introductory workshop and new posts at Smartling Engineering Blog.

Prerequisites

This workshop covers two ECS service insights: troubleshooting of applications running in ECS and deployment into ECS with zero downtime. Unlike previous workshops there’s no preface section here, it’s done intentionally — please just move forward to the next section.

Hands On

Troubleshooting EC2 instance not registering in ECS cluster

ECS agent allows container instances to connect to your cluster. ECS agent is a container that is included in the Amazon ECS-optimized AMI, and starts on instance boot by default.

ECS agent container stores configuration in /etc/ecs/ecs.config on host machine and logs to /var/log/ecs directory. You can also use command to see its logs:

docker logs <container id>

In most of the cases ECS agent’s configuration in /etc/ecs/ecs.config only includes name of ECS cluster to join.

Here are top 3 reasons why ECS agent may fail to register instance in ECS cluster:

Missing IAM permissions for ECS agent.Troubleshooting: check ECS agent logs in /var/log/ecs dir.Reason: ECS agent communicates with ECS service and sends API calls to it e.g. to register EC2 instance in cluster. In case IAM role for EC2 instance that runs ECS agent doesn’t have required permissions those API calls will fail with 403.Solution: Configure IAM role for EC2 instance with minimum required permissions for ECS:"Action": [
"ecs:RegisterContainerInstance",
"ecs:DeregisterContainerInstance",
"ecs:SubmitContainerStateChange",
"ecs:SubmitTaskStateChange",
"ecs:List*",
"ecs:*Poll*"
]
ECS cluster name is missing in agent’s configuration.Troubleshooting: check ECS agent logs, check its config.Reason: If ECS agent cannot read its config file or if config file is empty — ECS agent will register EC2 instance in ECS cluster named ‘default’.Solution: Update config file with proper ECS cluster name before agent is started. Instance cannot be re-registered(moved) to another ECS cluster. It has to be replaced with new instance with properly configured ECS agent.Docker or ECS agent container is missing in AMITroubleshooting: SSH into the instance, check for docker and list of running containers.Reason: in case you’re using custom AMI for your EC2 instances it may not come with docker or ECS agent, or wrong AMI is used.Solution: add missing components (docker and ECS agent container) to AMI or use ECS-optimized AMI.

Now let’s troubleshoot the problem:

Use suggested Terraform configuration to create ECS cluster, register EC2 instance in this cluster. Find out why EC2 instance doesn't register in ECS cluster and fix it.1. Go to w7 directory in cloned Smartling/aws-terraform-workshops git repository.2. Complete terraform configuration to create ECS cluster and EC2 instance. Keep remaining resources in terraform configuration files commented out.3. Add your SSH key into user-data.txt file.4. Apply terraform configuration.5. Go to ECS web console to make sure that container instance is registered there or not.6. Troubleshoot why EC2 instance isn’t registered in ECS cluster and fix the problem.

Troubleshoot ECS task failing in ECS service not attached to ELB

ECS task definition is JSON data that contains configurations of all containers which are required to run your application including image names, required resources, links between containers etc.

Top reasons why ECS tasks may fail in ECS service not attached to ELB:

Mistake in ECS task definition e.g. missing docker image or requested more resources than available in ECS cluster.Troubleshooting: ECS web console — see stopped task status for ECS service, it includes the reason why it was stopped e.g. cannot pull container.Reason: we’re humans and may make mistakes in container definitions e.g. a typo in docker image name etc.Solution: Examine source of the problem and fix it in task definition.ECS cluster infrastructure problems.Troubleshooting: ECS web console — see stopped tasks reasons, see application’s logs for errors, change logging severity if necessary to identify why application in container or container fails to start.Reason: there might be no access to smartling docker registry or to any other resource service might rely on, e.g. database.Solution: Adjust security groups and/or makes other changes to infrastructure to eliminate source of the problem.Crashing application running in container.Troubleshooting: ECS console and application container logs.Reason: application exits unexpectedly e.g. due to bug in code or corrupt docker image.Solution: identify and fix the problem: create new docker image with the bugs fixed etc.Missing IAM permissions to start tasks by ECS agent.Troubleshooting: ECS agent logs.Reason: ECS agent must have enough permissions to start/stop ECS tasks.Solution: Fix IAM permissions for IAM role of container instance.

Let’s troubleshoot and fix failing ECS service:

Create ECS service to run sample application. Do not attach ELB to ECS service at this step. (troubleshoot container failing to start after deployment).1. Uncomment resources in file ecs_service.tf, complete terraform configuration. Keep resources in ecs_service_elb.tf file commented out.2. Make sure you specified docker image anosulchik/workshop-sample-application:v1.0 in task definition.3. Apply terraform configuration.4. Go to ECS web console, choose ECS cluster and ECS service, check pending, running and stopped ECS tasks.5. Troubleshoot why tasks fail to start.6. Fix the problem, make changes into terraform config files.7. Apply terraform configuration changes.8. Go to ECS console to see ECS tasks running.

Troubleshoot ECS task failing in ECS service attached to ELB

ELB attached to ECS service automatically checks the health of the tasks in your service. If it finds an unhealthy task, it stops sending traffic to the instance and reroutes traffic to healthy instances. Then ECS stops unhealthy task and starts another instance of that task.

Top reasons of failing ECS tasks in ECS attached to ELB:

Application running in container fails to respond to ELB health checks.Troubleshooting: ECS service Events in web console.Reason: bug in application code, misconfigured health checks or port mapping, no access from ELB to EC2 instance.Solution: eliminate reason of the problem e.g. fix EC2 security group, make changes to ELB health check config etc.

Go ahead and troubleshoot why ECS service doesn’t stabilize:

Create ECS service with ELB attached (troubleshoot flapping container -- it’s started and then stopped by ECS scheduler). You should have EC2 instance registered in ELB with status InService.1. Uncomment resources in ecs_service_elb.tf file.2. Finish configuration to create ECS service attached to ELB.3. Do "terraform plan" and "terraform apply".4. Go to EC2 web console, find your ELB and see instance’s state.5. Go to ECS web console to see what happens in newly created ECS service -- tasks are constantly restarted. Find out the reason of this happening.6. Once the problem is fixed you should see EC2 instance in ELB in InService state. Sample application should be accessible via HTTP - open ELB's address in your browser.

How to track deployments into ECS?

There are three deployment statuses of ECS service:

  • PRIMARY — the most recent deployment
  • ACTIVE — previous deployments that still have tasks running, which are being replaced with the PRIMARY deployment
  • INACTIVE — deployments that have been completely replaced
1. Go to ECS service web console, choose ECS service and go to Deployments tab - you should see 1 deployment in state PRIMARY.2. Change container definition - change allocated memory e.g. from 100 to 128.3. Run terraform plan and terraform apply.4. Return to ECS service console, choose Deployments tab in ECS service - you should see 2 deployments - one of them is ACTIVE and another is PRIMARY.

Here in Smartling, we’re using Lambda function to detect anomalies in ECS services like failed and/or stuck deployments. You may look into it in optional hands on step of this workshop (see Bonus Track section below).

Deployments into ECS with zero downtime

ECS service deployment configuration controls how many ECS tasks may run during the deployment and the ordering of stopping and starting tasks:

  1. minimumHealthyPercent (the lower limit of the number of running tasks that must remain running in a service during a deployment).
  2. maximumPercent (the upper limit of the number of running tasks that must remain running in a service during a deployment).

Default ECS deployment configuration has minimumHealthyPercent = 100% that means that your ECS cluster’s capacity must be enough to start new version of application while still running old version.

1. Configure ECS service to have deployments with zero downtime:a. Change setup of ASG: change min = max = 2 to add one more instance into ECS cluster.b. Change desired_count in ECS service (one attached to ELB) from 1 to 2.c. Run terraform plan and terraform apply.d. Wait until ELB has two instances InService. It may take 3-4 minutes for second instance to bootup and run tasks.e. SSH into one of instances and start simple uptime checker for our sample service:while true; do curl -m 5 -sko/dev/null "http://<elb_dns_name>:80/" -w "%{http_code} %{time_appconnect} %{time_connect} %{time_total}\n" 2>&1;sleep 1; doneWhere elb_dns_name is DNS name of ELB that is attached to ECS service. You can find it in terraform.tfstate or in AWS web console.If service is up and running you should see smth like this:200 0.000 0.013 0.017
200 0.000 0.005 0.008
200 0.000 0.005 0.007
200 0.000 0.005 0.007
f. Change docker image from anosulchik/workshop-sample-application:v1.0 to anosulchik/workshop-sample-application:v2.0 in containers definition.g. terraform plan, terraform apply.h. Review output of uptime checker - you should see 503 returned by ELB that means there’s downtime during deployment.i. Make changes into terraform configuration to deploy version v1.0 again with zero downtime. It’s required to make changes into ELB and ECS service resources.k. terraform plan, terraform apply.l. Check uptime checker’s output to make sure there was no downtime for service during deployment.

Bonus Track

Automate detection of failed ECS deployments.1. Uncomment lambda.tf file, finish configuration to create lambda function that detects failed deploys into ECS service attached to ELB (deployed in previous step on this workshop).2. Make intentional mistake in task definition - specify non-existent docker image.3. Subscribe your email to SNS topic that is used by Lambda to send notifications.4. Run terraform plan, terraform apply.5. You should get email with notification about failed deployment as soon as Lambda will be executed by CloudWatch event rule.

--

--