Datasets improvements | Voters | Seqera Feedback Forum

Datasets improvements

evaluating

Rob Newman

Datasets within Seqera Platform facilitate structured handling of input sample sheets required for genomics pipelines such as RNA-seq. Currently, researchers often face friction in assembling these datasets, needing to prepare CSV files externally. This could be improved by the following:
1. Remove/Increase Dataset limit:
 Users are currently limited to 1000 datasets per workspace. Action:
 Raise or remove limit
2. Dataset creation from Pipeline Launch UI:
 Currently, creating datasets requires users to navigate away from the pipeline launch page, disrupting workflow and causing friction. Action:
 Integrate dataset creation directly within the pipeline launch interface, users can seamlessly upload or enter sample data without leaving their primary task.
3. Improved Dataset listing:
 Currently, the dataset listing is relatively basic and low density. Around 6 datasets can be viewed in a typical desktop browser window size. This density plus lack of tooling hinders users from effectively navigating or identifying datasets quickly. Action:
 A metadata-rich table view, such as those used on the Runs page and elsewhere would be preferable. Such a table could include the dataset name, number of rows, author, creation date, last used date, and potentially the start of the description. The table will be sortable.
4. Keep a record of Dataset usage in Runs:
 Currently, it’s difficult / impossible to know if a dataset has ever been used. This makes their utility post-usage very limited. Action:
 With a record of Dataset usage within Run history, Datasets suddenly become a powerful tool for the user. They act as a rich history of run inputs, agnostic to the specifics of pipeline design and file usage.
5. Enhanced Dataset details user interface:
 Currently, viewing dataset details emphasizes metadata over actual dataset content, causing users unnecessary scrolling and inefficiency. Action:
 Prioritizing the dataset’s actual content prominently at the top ensures users quickly verify the dataset's accuracy and completeness, reducing errors and improving productivity.
6. Manual inline Dataset editing:
 Users currently must create or edit sample sheets externally, and upload these files to create or edit Datasets, which is inefficient. Action:
 An embedded spreadsheet editor within the platform would allow users to quickly and intuitively enter data and make edits directly.
7. "Archive" or "deactivate" a Dataset:
 Datasets are often used once or twice, and then no longer actively needed. For GxP/clinical environments, the dataset should not be deleted/removed, but made inactive/disabled/"archived"/"deactivated". This allows inspection of the dataset, but it would be excluded from any pipeline launches. The table of datasets can be filtered to remove inactive/disabled/"archived"/"deactivated" entries. Action:
 Allow datasets to be tagged as “inactive/disabled/archived/deactivated” and allow filtering of datasets to show/hide archived entries.
8. Support for YAML and JSON formatted data:
 Support for non-tabular datasets (eg. arbitrary YAML and JSON) is useful for users. Action:
 Allow upload and editing of YAML and JSON datasets.
9. Fix How Dataset Versioning Works:
 Managing multiple versions of a single dataset is currently suboptimal. Uploading a new file associated with a new version does not preview correctly.
Future potential milestones:
Integration with new Nextflow data lineage

May 29, 2025

Rob Newman

Merged in a post:

Improvements to Dataset Metadata, Naming Limits, and Search Functionality

Anuj Garlapati

Customer is requesting a few minor features when working with Datasets:

Additional metadata:

Date/time created
,
Author
Increase name length limit
(previously requested): Remove hard limit on the name length
Improve search functionality:
Search does not currently index
ID
or
Description
fields

3 days ago

Rob Newman

marked this post as

evaluating

Inquisitive Reindeer

Our Datasets needs are more directed to automation/ semi-automation. So for us things that are missing:

If possible an unlimited amount of Datasets would be great for traceability of workflows.
Navigation of the Datasets could use improving. It would be nice to be able to sort by added date (created/ modified) or alphabetic order. (if more Datasets are allowed this will be critical)
Pagination is also missed within a Dataset , as it is nice for quickly inspecting a samplesheet.

Rob Newman

Merged in a post:

Support for YAML in Datasets

Maxime Garcia

Currently only CSV and TSV are supported in Datasets, but I have some data in YAML format, that I would love to upload and use.

May 29, 2025

Rob Newman

Merged in a post:

Datasets editing enhancements

Esha Joshi

Current Limitation: The Datasets platform currently lacks basic editing capabilities for samplesheets (TSV/CSV), requiring users to modify files externally when working with SRA Explorer imports or other data sources.

Requested Features:

Column Removal:
Allow users to drop unwanted columns from datasets/samplesheets. For example, this is essential for cleaning up SRA Explorer imports and making them compatible with pipelines (e.g. nf-core/rnaseq).
Header Name Editing:
Enable editing of column header names. Ties into the earlier point, will make it much easier to adapt samplesheets to be compatible with pipelines.
Manual Sample Entry:
Add ability to manually enter data via free text. This is useful for adding individual SRA samples or IDs without having to generate a text file externally then import it (e.g. "SAMN1000000" as a sample ID used in
nf-core/fetchngs
).
Provenance:
All edits should automatically create new versions in the dataset's history, maintaining a clear record of modifications and enabling users to revert changes if needed.

These features will significantly reduce the need for external file editing and streamline the workflow preparation process, especially when working with SRA Explorer data.

May 29, 2025

Rob Newman

Merged in a post:

Improve searchability for Datasets

Rob Syme

There is a character length limit for legal Datasets names. This limit can frustrate naming conventions in some groups which might include a lot of metadata in the dataset name. This metadata would be useful as search targets when searching for a datatset.

Alternative (and perhaps preferable) solutions:

Adding arbitrary metadata tags or key-value pairs to datasets:
This metadata would need to be searchable
Adding the "Description" to the elements to consider during search:
This is less structured than the suggestion above, but perhaps more easily implemented.

May 29, 2025

Rob Newman

Merged in a post:

Datasets UX enhancements

Brass Wildcat

The interaction with the dataset feature can be improved to enhance the user experience. Here some potential improvements:

Enhanced Search Box
: The search box could enable the filtering of datasets based on file types, such as CSV, TSV, and more.
Pagination for Datasets
: Pagination could improve the user experience when dealing with a large number of datasets.
Dataset-Pipeline Association
: Users could associate a dataset with a pipeline in both Launchpad and Actions, enhancing the integration and utility of datasets in the platform.
No Dataset Matching Message
: When there are no datasets matching the pipeline specification (schema file), display a dropdown with the message to inform the users.
Dataset Version Dropdown
: The dropdown for dataset versions could include the "Date edited" for added context.
Labeling Datasets
: To streamline the collation of resources associated with a specific study, it could be beneficial to allow datasets to be labeled with study identifiers.

May 29, 2025

Rob Newman

marked this post as

acknowledged