Amazon ECS: a retrospective

After a few years working with Kubernetes in EKS, plain VMs, Lambdas and similar compute solutions, I finally got my hands on ECS. I won’t go into many details on the why but, essentially, I felt EKS/k8s was too much for the team that was going to work with it (not due to lack of skill, but rather time/manpower), and ECS seemed to abstract away a lot of things so, I decided to go for it.

As expected, documentation is vast and finding examples is quite easy. The problem is that some details are unclear, harder to find, or even just omitted from the documentation.

Connecting services

If you’re deploying multiple services in your cluster, there’s a chance some of them need to talk to each other.

ECS gives you two options for this: Service Discovery and Service Connect.

Very shortly, Service Discovery employs a DNS-based service lookup approach, while Service Connect uses a service mesh by deploying (and managing) sidecar proxy containers along with your services. Both use AWS Cloud Map, the difference being that one uses DNS lookups, the other queries Cloud Map directly via API.

If you’re interested in more details, check this article and this Stack Overflow answer.

Service Connect: things to know

Before going into specifics, a basic concept is to get out of the way is that services that connect to other services are “clients”. Services that accept connections from other services are “servers”.

For example, you have a REST API server and a service that sends emails to users on request. The API server requests the emailer service for messages to be sent. The API server is a “client”; the emailser service is a “server”.

Deployment order matters!

Maybe this is an obvious thing and it’s just me that hadn’t had previous service mesh/discovery experience. Maybe not.

Either way, after properly configuring and double-checking many times the setup I had, I was not still not able to make service A talk to service B.

Eventually, I randomly found in this demo video an AWS developer saying that service deployment order matters.

Servers need to be deployed before clients in order to be discoverable.

Fields to configure

Check this documentation page before starting to create your services.

This is useful because you will need to, for example, know that portMappings in a server service needs to have a name, because that is how the serviceConnectConfiguration.service config identifies where incoming traffic to the service is directed to.

Here’s a redacted version of how aws_ecs_service Terraform/OpenTofu objects look like for client and server:

client

resource "aws_ecs_service" "demo_app_service" {
  name            = "demo-${var.env}-service"
  cluster         = aws_ecs_cluster.ecs_cluster.id
  task_definition = aws_ecs_task_definition.demo_app_task.arn
  desired_count   = 1
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = local.ecs_security_groups
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.admin_website_app_tg.arn
    container_name   = "hbh-admin-website-${var.env}-container"
    container_port   = local.admin_website_container_port
  }

  service_connect_configuration {
    enabled   = true
    namespace = aws_service_discovery_http_namespace.private_dns_namespace.arn
  }
}

server

resource "aws_ecs_service" "demo_api_service" {
  name            = "demo-api-${var.env}-service"
  cluster         = aws_ecs_cluster.ecs_cluster.id
  task_definition = aws_ecs_task_definition.demo_api_task.arn
  desired_count   = 1
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = local.ecs_security_groups
    assign_public_ip = false
  }

  service_connect_configuration {
    enabled   = true
    namespace = aws_service_discovery_http_namespace.private_dns_namespace.arn
    service {
      port_name = local.demo_api_port_name
      client_alias {
        port = local.demo_api_container_port
      }
    }
  }
}

Discovery names

By default, you can refer to a service by portName.namespace, where portName is the name give to the in the portMappings of the server service, and namespace is the name of the CloudMap private dns namespace in use (can be created explicitly or one will be created otherwise).

This can be overriden via discoveryName in the serviceConnectService object.

Health checks

Your typical health check will be a curl request. ECS executes this in the container itself, just like Docker’s HEALTHCHECK (if there is one, it is replaced witht he task definition health check configuration). Make sure curl is installed in the image, otherwise the check will fail and the logs are not going to be clear about it.

Task Role vs Task Execution Role

I can’t say this is well documented, but I felt that, when jumping on to create a cluster and my first services/tasks with OpenTofu, it was immediately clear that these two existed.

The Task Role is an IAM role that attaches to a specific task definition, granting container permissions to access AWS resources. The role is assumed by the containers running in the task.

The Task Execution Role is an IAM role attached to the task definition responsible for running the task. This is typically useful to allow ECS to pull a private image from ECR, sending logs to CloudWatch, or read secrets manager keys for the environment variables.

Debugging with ECS Exec

If you’re used to use kubectl to run some commands from within Kubernetes pods/containers, you can achieve the same with ECS exec.

Looks like this:

aws ecs execute-command --cluster cluster-name \
    --task task-id \
    --container container-name \
    --interactive \
    --command "/bin/sh"

Not all clusters and tasks support this so, make sure your configuration is a match. There’s this handy tool that helps with that.