Reduce workflow run time and the cost of jobs with intelligent scheduling and automation
planned
M
Michael Tansini
Seqera supports running workflows across many different types of compute infrastructure. Pipelines in some cases may have unutilised compute, which increases costs. The start up time, especially on some types of batch compute, may significantly increase.
Scheduler is a multi-phased project designed to simplify compute configuration options and reduce the time and cost of running a pipeline by intelligently choosing the right configuration for a workflow and scaling the infrastructure at the correct tasks within the pipeline to remove overprovisioning or resource contention issues
M
Michael Tansini
Merged in a post:
Workflow resource optimization (M3): task-based ML
B
Brass Wildcat
Resources for Nextflow tasks are assigned through process directives. These include resources such as CPUs, memory and time. Workflow developers usually define resources manually for individual processes, groups of processes or across the entire workflow.
- Problem 1: In order to avoid tasks runs and resource-related issues, tasks in a workflow are typically configured and launched with more resources than needed.
- Problem 2: In order to optimize task in a workflow, developers and users currently must determine these resources through the collection of data on successful tasks and manual assignment of values for each process in the Nextflow config.
- Problem 3: Changes in input data, parameters, or platforms can dramatically change the actual resources used by each task.
There is an opportunity to improve the resource allocation for most Netxflow workflows being run (saving cost), save users the trouble of manually defining values (saving time), and enable the same workflow to be run with differing resource requirements (flexibility). This can be addressed by training a machine learning model from a previous pool of pipeline execution. The goal is the optimized deployment of tasks in real-time to ensure the best utilization of resources and significantly reduce the associated costs.
The main advantage of this approach versus the heuristic model is that the resource estimation and assignation is defined at the tasks level (more granularity) rather than at the process level and that it can leverage a diverse set of inputs to achieve greater precision (richer input data).
Intended Outcome
A user launches a pipeline and all tasks are optimized for the number of resources the task will consume.
Previous runs of a pipeline are used as training datasets to infer the relevant variables to estimate task-level requirements. A machine learning model is created and will be applied to subsequent runs.
Once the model is ready, it will be applied to optimize the resource allocation for tasks in real-time (i.e. it is not a static preconfigured config but it is adapted to each task need).
M
Michael Tansini
Merged in a post:
Workflow resource optimization (M4) - intelligent scheduling
B
Brass Wildcat
Resources for Nextflow tasks are assigned through process directives. These include resources such as CPUs, memory and time.
These resource requests translate to task definitions on different platforms (AWS Batch, Slurm, k8s, etc). To choose which platform, currently, the
executor
and queue
directives are set manually which defines where the workflow tasks are submitted.Several use cases exist where defining the
executor
and queue
automatically makes a lot of sense. For example:- The cost of a particular VM type in a region on AWS was 200% more than usual (cost)
- Limited availability of instances in a region resulting in delays (time)
- The SLURM cluster if full (occupancy)
- The data for a particular run in on a specific cloud provider in specific regions (data locality)
Issues
- Problem 1: The status of any potential computing environment with respect to costs, time, availability, and occupancy is not known before and during workflow execution
- Problem 2: A user must know in advance the location of the data used in the workflow execution.
- Problem 3: A user launches a pipeline and all tasks are placed on the optimal Compute Environment.
A prototype using both heuristic and machine learning models has been developed, code-named Groundswell. This service can connect with a Tower database instance, infer heuristic values, predict values using any feature, and return a Nextflow configuration with the optimized values.
M
Michael Tansini
planned
Work for this is planned for the roadmap subject to a successful Proof of Concept