"Dead man's" switch for in-flight compute jobs when the head node dies | Voters

"Dead man's" switch for in-flight compute jobs when the head node dies

complete

Shamrock green Gopher

Currently with AWS Batch executor if the head job dies (e.g. OOM), remaining in-flight compute jobs continue running and are not cancelled. Tower would ideally kill the child compute jobs when the parent has died (I believe it already tracks the native task IDs from batch).

February 1, 2024

Rob Newman

marked this post as

complete

This is released (and enabled by default) in Seqera Platform and will be released (but not enabled by default) as part of the Enterprise v25.1.0.

Rob Newman

marked this post as

in progress

Currently in-flight, early version already undergoing internal testing. Early access forthcoming.

Rob Newman

marked this post as

planned

Charcoal Mandrill

Rob Newman hey, I am just curious, what the planned mechanism for implementing this might be? I think that with SLURM and HPC schedulers, Nextflow can catch the SIGTERM or other process signals and perform a graceful shutdown of the child compute jobs. Not clear if that is a thing in AWS Batch?

Rob Newman

Charcoal Mandrill: We're looking at a similar approach by enhancing the Seqera Platform-available information via Nextflow (sharing the specific task information with the Platform). Then the Platform itself can manage the child jobs/tasks directly. This also allows us to retrieve any debug information from cloud-based tools (e.g. CloudTrail or others), helping all parties to troubleshoot more effectively.

Rob Newman

marked this post as

acknowledged