TowerForge needs to be able to set jobStateLimitActions in AWS Batch Job Queue
acknowledged
B
Burgundy Crawdad
We have issues where user submitted pipelines running in AWS Batch end up with misconfigured resource requests. This is especially the case when Nextflow pipelines use dynamic directives to automatically increase their CPU or Memory resource requirements on subsequent retried pipeline tasks. User's Nextflow tasks end up eventually requesting more CPU or Memory resources than is available by any of the EC2 types in the AWS Batch Compute Environment.
This causes the jobs to get stuck in a RUNNABLE state. However, worse is that this causes the entire AWS Batch Job Queue to grind to a halt and stop running
any
Jobs at all. This is not a mistake on AWS's part, this is the accepted and expected behavior for AWS Batch; a single misconfigured Batch Job halts all job execution on the entire Job Queue. See the docs here;The "solution" for this from AWS is that you need to update your AWS Batch Job Queue with
jobStateLimitActions.action
in order to have jobs stuck in RUNNABLE state automatically get canceled after some time limit if they are stuck due to MISCONFIGURATION:COMPUTE_ENVIRONMENT_MAX_RESOURCE
or MISCONFIGURATION:JOB_RESOURCE_REQUIREMENT
. I am not aware of any alternative method for handling this, as this is the procedure recommended by AWS Support themselves. This is a problem because Seqera Platform's TowerForge interface for creating Compute Environments with AWS Batch via Platform does not appear to have any way to add these jobStateLimitActions.
We cannot apply the jobStateLimitActions in a post-hoc manner ourselves because the TowerForge CE's are being actively managed by Platform, via the Platform UI,
tw
, and seqera-kit
. We cant go making custom changes to the CE outside the scope of what TowerForge has set up since that breaks any notion of consistency in our deployment via "Infrastructure as Code". We also cannot even make these Job Queue changes ourselves since the AWS environment is managed and requires escalation and support tickets to IT to make such changes, which is infeasible if we need to re-deploy our CE's repeatedly via tw
/ seqerakit
, etc.. Ultimately these features need to be baked into TowerForge so that we can apply this in our scripted Platform configs.Rob Newman
Burgundy Crawdad Thank you for your feature request. We believe we have a better solution to this issue, which is to add support for the Nextflow
resourceLimit
configuration to the compute environment CLI command. This would apply a ceiling on the resources any one task can request. This is being tracked here. Do let us know if this resolves your needs.As an aside, it appears you submitted this feature request with your old email address, so you have created another new user account. Would you like me to merge both accounts again?
B
Burgundy Crawdad
Rob Newman thanks for the reply, we are implementing
resourceLimit
, however this is not an adequate solution because users can freely remove or modify the Nextflow pipeline configs when they launch their pipelines. It also does not protect against users who are utilizing the AWS Batch compute environment from the command line with Nextflow using their own configs, which is where these misconfigured jobs often come from. It seems like the only actual solution here is for the appropriate setting to be applied to the AWS Batch Compute Environment, which requires that this setting be exposed via the TowerForge system during CE setup.Rob Newman
acknowledged