I've noticed a recurrent error when running many workflows/jobs at the same time. AWS ECS seems to have trouble supporting pulling large numbers of containers from AWS ECR. When this happens, the AWS job fails (and doesn't retry) killing the workflow and reporting this error: Caused by: Task failed to start - CannotPullContainerError: context canceled Discussing with AWS support and they suggested that the job definition should be updated to include something along the lines of: > Sometimes this error could also be caused due to intermittent network connectivity issues, registry related issue when the ECS Agent/Docker Daemon is not able to connect to the registry and pull the image onto the host instance successfully. You can workaround this error by using the "Retry Mechanisim" in your job definition. Example: ********** "retryStrategy": { "attempts": 5, "evaluateOnExit": [ { "onStatusReason": "Host EC2*", "action": "RETRY" }, { "action": "RETRY", "onReason": "Task failed to start" (You can change the "onReason" statement as per error) }, { "onReason": "*", "action": "EXIT" } ] } ********** Is this something that can be updated in a future release so when the platform creates job definitions workflow don't fail due to not being able to pull a container and instead retries the container pull at least once?