Tower Forge improvements
acknowledged
Drew DiPalma
acknowledged
Rob Newman
Merged in a post:
Forge multiqueue: Configure queues and resources upon creation
B
Brass Wildcat
Forge serves as a valuable tool for streamlining user workflows by automating the creation of necessary resources like compute environments, queues, and shared file system resources.
However, the current implementation may not always meet the diverse needs of multiple customers, especially for complex, high-volume systems.
One limitation is the inability to have dedicated GPU or CPU queues within the same Compute Environment (CE), potentially leading to resource misuse. Additionally, the current implementation restricts users to a single unit of an external file system (e.g., FSx), hindering performance optimization scenarios where multiple drives are needed for resource sharing and work directory organization.
There is also an opportunity to enhance the transparency and configurability of the existing Forge mechanism, empowering users through a more intuitive UI that leverages existing capabilities.
The intended outcome of these proposed improvements is for users to gain the ability to create and define multiple queues for a single CE, including GPU, ARM, CPU, and spot/on-demand queues, each with precise activation requirements (e.g. "dragen" label in the nextflow pipeline). This enhances CE performances, optimizes the resource usage, and simplifies the experience for administrators and users alike.
Rob Newman
Merged in a post:
Additional control over EC2 types and/or compute environments for AWS Batch submissions
S
Short Chicken
The pipelines we run have a lot of variability in their resource requirements across different processes. This hasn't been an issue since we've been using AWS Batch with optimal EC2 assignment and EBS-autoscale / S3 work directories. However, as we migrate over to Fusion 2.0, it would be great if we could have more granular control over the EC2s being utilized by the pipeline.
Some of our processes can generate up to a TB of output, and it will be important to run those on an EC2 type that has sufficient NVMe storage and network throughput to handle those tasks effectively. We would also prefer to avoid using those expensive EC2s for smaller processes that could be handled on, for example, an
m5ad.large
vs. an m5ad.12xlarge
for the more intense tasks. Ideally, this would be something we could configure in our Nextflow scripts using one of the following approaches:
1. A configuration option that can assigning processes with certain tags to a particular family of EC2s
- A Compute Environment option that defines particular families of EC2s for defined processes or resource requirement ranges
- Using multiple Compute Environments optimized for different process requirements and specifying the environment in our configuration on to specific processes.
1 or 2 would be ideal, but I know there are a lot of technical hurdles to navigate with AWS.
Rob Newman
Merged in a post:
Batch forge with GPU option
S
Shamrock green Gopher
We’ve setup a Compute Environment (CE) that all works fine for our non-CPU processes. When cloning that environment, toggling the GPU setting in Seqera Platform and setting the AMI to an AWS recommended GPU Linux image and then running a workflow requiring GPUs, we don’t see any GPU instances being created although there are tasks in AWS Batch requiring a GPU. Tasks not requiring a GPU are still running fine.
The AWS Batch page contains the following somewhat ambiguous help:
> "All instance types in a compute environment that run GPU jobs must be from the
p2
, p3
, p4
, g3
, g3s
, g4
, or g5
instance families. If this isn't done a GPU job might get stuck in the RUNNABLE
status."which read one way suggests all instances in a queue must be GPU instances for any GPU job to run. Therefore we limited the compute environment setup in Seqera Platform to
p3
instances. This results in nothing running at all.We are trying to run in the
eu-west-1b
AZ where there is availability of p3
instances.We worked around this in several ways:
- We delegated choosing an AMI to Batch - it seems to choose a sensible one.
- Create a second AWS Batch queue for GPU jobs and use process selector to push GPU jobs to the second queue.
Elsewhere it's considered good practice to have separate GPU and CPU queues. Can we ask that Seqera Platform make two queues when GPU instances are required (similar to the setup with Dragen).
This could work like so:
- Toggling "Enable GPU" in the CE configuration to Trueshould create an additional AWS Batch queue and CE (similar to how the Dragen toggle presumably works), it will be necessary to ask the user which GPU instance types they wish this queue to have (as "optimal" does not include GPU instance types, and there are varied options)
- Jobs with an accelerator directive should have their process.queueoverriden to the created GPU Batch queue (unless it is not set to the default queue) to forward them to the GPU-only Batch queue you have created.
It's unclear why a mixed queue is not working properly in our case (as evidence suggests that CPU and GPU jobs should all be able to run on a GPU instance type if the AMI has been set appropriately). However, it is unreasonable to expect users to submit CPU-heavy work to more expensive GPU nodes with a longer wait time. Distinct queues for accelerated GPU and CPU jobs is a necessity and Seqera Platform should support users with GPU elements in their workflows.