We have issues where user submitted pipelines running in AWS Batch end up with misconfigured resource requests. This is especially the case when Nextflow pipelines use dynamic directives to automatically increase their CPU or Memory resource requirements on subsequent retried pipeline tasks. User's Nextflow tasks end up eventually requesting more CPU or Memory resources than is available by any of the EC2 types in the AWS Batch Compute Environment.
This causes the jobs to get stuck in a RUNNABLE state. However, worse is that this causes the entire AWS Batch Job Queue to grind to a halt and stop running
any
Jobs at all. This is not a mistake on AWS's part, this is the accepted and expected behavior for AWS Batch; a single misconfigured Batch Job halts all job execution on the entire Job Queue. See the docs here;
The "solution" for this from AWS is that you need to update your AWS Batch Job Queue with
jobStateLimitActions.action
in order to have jobs stuck in RUNNABLE state automatically get canceled after some time limit if they are stuck due to
MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE
or
MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT
. I am not aware of any alternative method for handling this, as this is the procedure recommended by AWS Support themselves.
This is a problem because Seqera Platform's TowerForge interface for creating Compute Environments with AWS Batch via Platform does not appear to have any way to add these jobStateLimitActions.
We cannot apply the jobStateLimitActions in a post-hoc manner ourselves because the TowerForge CE's are being actively managed by Platform, via the Platform UI,
tw
, and
seqera-kit
. We cant go making custom changes to the CE outside the scope of what TowerForge has set up since that breaks any notion of consistency in our deployment via "Infrastructure as Code". We also cannot even make these Job Queue changes ourselves since the AWS environment is managed and requires escalation and support tickets to IT to make such changes, which is infeasible if we need to re-deploy our CE's repeatedly via
tw
/
seqerakit
, etc.. Ultimately these features need to be baked into TowerForge so that we can apply this in our scripted Platform configs.