Datasets improvements
acknowledged
Rob Newman
Datasets within Seqera Platform facilitate structured handling of input sample sheets required for genomics pipelines such as RNA-seq. Currently, researchers often face friction in assembling these datasets, needing to prepare CSV files externally. This could be improved by the following:
1. Remove Dataset limit:
Users are currently limited to 1000 datasets per workspace. Action:
Raise or remove limit2. Dataset creation from Pipeline Launch UI:
Currently, creating datasets requires users to navigate away from the pipeline launch page, disrupting workflow and causing friction. Action:
Integrate dataset creation directly within the pipeline launch interface, users can seamlessly upload or enter sample data without leaving their primary task.3. Improved Dataset listing:
Currently, the dataset listing is relatively basic and low density. Around 6 datasets can be viewed in a typical desktop browser window size. This density plus lack of tooling hinders users from effectively navigating or identifying datasets quickly. Action:
A metadata-rich table view, such as those used on the Runs page and elsewhere would be preferable. Such a table could include the dataset name, number of rows, creation date, last used date, and potentially the start of the description.4. Keep a record of Dataset usage in Runs:
Currently, it’s difficult / impossible to know if a dataset has ever been used. This makes their utility post-usage very limited. Action:
With a record of Dataset usage within Run history, Datasets suddenly become a powerful tool for the user. They act as a rich history of run inputs, agnostic to the specifics of pipeline design and file usage.5. Enhanced Dataset details user interface:
Currently, viewing dataset details emphasizes metadata over actual dataset content, causing users unnecessary scrolling and inefficiency. Action:
Prioritizing the dataset’s actual content prominently at the top ensures users quickly verify the dataset's accuracy and completeness, reducing errors and improving productivity.6. Manual inline Dataset editing:
Users currently must create or edit sample sheets externally, and upload these files to create or edit Datasets, which is inefficient. Action:
An embedded spreadsheet editor within the platform would allow users to quickly and intuitively enter data and make edits directly.7. Support for YAML and JSON formatted data:
Support for non-tabular datasets (eg. arbitrary YAML and JSON) is useful for users. Action:
Allow upload and editing of YAML and JSON datasets.Rob Newman
Merged in a post:
Support for YAML in Datasets
Maxime Garcia
Currently only CSV and TSV are supported in Datasets, but I have some data in YAML format, that I would love to upload and use.
Rob Newman
Merged in a post:
Datasets editing enhancements
Esha Joshi
Current Limitation: The Datasets platform currently lacks basic editing capabilities for samplesheets (TSV/CSV), requiring users to modify files externally when working with SRA Explorer imports or other data sources.
Requested Features:
- Column Removal:Allow users to drop unwanted columns from datasets/samplesheets. For example, this is essential for cleaning up SRA Explorer imports and making them compatible with pipelines (e.g. nf-core/rnaseq).
- Header Name Editing:Enable editing of column header names. Ties into the earlier point, will make it much easier to adapt samplesheets to be compatible with pipelines.
- Manual Sample Entry:Add ability to manually enter data via free text. This is useful for adding individual SRA samples or IDs without having to generate a text file externally then import it (e.g. "SAMN1000000" as a sample ID used innf-core/fetchngs).
- Provenance:All edits should automatically create new versions in the dataset's history, maintaining a clear record of modifications and enabling users to revert changes if needed.
These features will significantly reduce the need for external file editing and streamline the workflow preparation process, especially when working with SRA Explorer data.
Rob Newman
Merged in a post:
Improve searchability for Datasets
Rob Syme
There is a character length limit for legal Datasets names. This limit can frustrate naming conventions in some groups which might include a lot of metadata in the dataset name. This metadata would be useful as search targets when searching for a datatset.
Alternative (and perhaps preferable) solutions:
- Adding arbitrary metadata tags or key-value pairs to datasets:This metadata would need to be searchable
- Adding the "Description" to the elements to consider during search:This is less structured than the suggestion above, but perhaps more easily implemented.
Rob Newman
Merged in a post:
Datasets UX enhancements
B
Brass Wildcat
The interaction with the dataset feature can be improved to enhance the user experience. Here some potential improvements:
- Enhanced Search Box: The search box could enable the filtering of datasets based on file types, such as CSV, TSV, and more.
- Pagination for Datasets: Pagination could improve the user experience when dealing with a large number of datasets.
- Dataset-Pipeline Association: Users could associate a dataset with a pipeline in both Launchpad and Actions, enhancing the integration and utility of datasets in the platform.
- No Dataset Matching Message: When there are no datasets matching the pipeline specification (schema file), display a dropdown with the message to inform the users.
- Dataset Version Dropdown: The dropdown for dataset versions could include the "Date edited" for added context.
- Labeling Datasets: To streamline the collation of resources associated with a specific study, it could be beneficial to allow datasets to be labeled with study identifiers.
Rob Newman
acknowledged