Update Job Definition (AWS Retry Strategy) | Voters

Update Job Definition (AWS Retry Strategy)

acknowledged

Aquamarine Gibbon

I've noticed a recurrent error when running many workflows/jobs at the same time. AWS ECS seems to have trouble supporting pulling large numbers of containers from AWS ECR. When this happens, the AWS job fails (and doesn't retry) killing the workflow and reporting this error:

Caused by:
  Task failed to start - CannotPullContainerError: context canceled

Discussing with AWS support and they suggested that the job definition should be updated to include something along the lines of:

> Sometimes this error could also be caused due to intermittent network connectivity issues, registry related issue when the ECS Agent/Docker Daemon is not able to connect to the registry and pull the image onto the host instance successfully. You can workaround this error by using the "Retry Mechanisim" in your job definition.

Example:

**********
"retryStrategy": {
            "attempts": 5,
            "evaluateOnExit": [
                {
                    "onStatusReason": "Host EC2*",
                    "action": "RETRY"
                },
                 {
                    "action": "RETRY",
                    "onReason": "Task failed to start"      (You can change the "onReason" statement as per error)
                 },
                {
                    "onReason": "*",
                    "action": "EXIT"
                }
            ]
        }
**********

Is this something that can be updated in a future release so when the platform creates job definitions workflow don't fail due to not being able to pull a container and instead retries the container pull at least once?

Created by Shahzeb Mahmood

April 16, 2024

Charcoal Mandrill

we have this issue a lot as well, in fact I would say its one of our primary causes for task failure that are not due to some "user error". It would be really fantastic to have better handling for this. Already we have things like Nextflow

process.errorStrategy

process.maxRetries

, but these were always intended for handling task-errors (like not enough memory), they do not do a good job of handling these types of errors where the AWS Batch job fails to start due to issues with ECS, network, etc.. At the very least being able to handle some of these things from the launch template seems like a good idea.

Aquamarine Gibbon

To be clear I don't believe the recommended workaround solution of using the retry strategy fixes anything because the retry strategy is only being invoked for errors that have a status reason of "Host EC2*".

So this feature request is to expand the list of accepted "onReason" to also include "Task failed to start*"

Drew DiPalma

updated the status to

acknowledged