"Dead man's" switch for in-flight compute jobs when the head node dies
complete
Rob Newman
complete
This is released (and enabled by default) in Seqera Platform and will be released (but not enabled by default) as part of the Enterprise v25.1.0.
Rob Newman
in progress
Currently in-flight, early version already undergoing internal testing. Early access forthcoming.
Rob Newman
planned
C
Charcoal Mandrill
Rob Newman hey, I am just curious, what the planned mechanism for implementing this might be? I think that with SLURM and HPC schedulers, Nextflow can catch the SIGTERM or other process signals and perform a graceful shutdown of the child compute jobs. Not clear if that is a thing in AWS Batch?
Rob Newman
Charcoal Mandrill: We're looking at a similar approach by enhancing the Seqera Platform-available information via Nextflow (sharing the specific task information with the Platform). Then the Platform itself can manage the child jobs/tasks directly. This also allows us to retrieve any debug information from cloud-based tools (e.g. CloudTrail or others), helping all parties to troubleshoot more effectively.
Rob Newman
acknowledged