Dataset Collections - Initial Models, API, Tool and Workflow Enhancements

#386 Merged
  1. John Chilton

Dataset Collections - Initial Models, API, Tool and Workflow Enhancements

What IS NOT in this pull request

The pull request contains pretty much no good or useful UI enhancements.

  • No UI for digging into, modifying, or flexibly creating collections.
  • No UI for creating collections directly from uploads.
  • No UI for anything related to collections in libraries.

Quick options for creating pairs and lists of datasets are available using Carl's multiple selection framework as well as subtle UI touches to support advanced tool and workflow options.

Even within the realms of tools and workflows - there is much I would still need to do - but I think there is a useful core here that allows Galaxy to express all sorts of workflows that were not formerly possible. Here are some of the most conspicuously absent features though :

  • No ability for tools to explicitly produce dataset collections.
  • No ability to filter collections (for instance ability to filter all the 'ok' datasets out of a collection into a new collection).
  • No ability to reduce collections over repeat parameters (multiple input data parameters or data_collection tool parameters must be used currently).

What IS in this pull request:

This pull request covers two Trello cards - Dataset Collections - Iteration 1 and Tool Multi-Execution.

These cover collection models, the start of a collections API, data collection tool parameters, new tool execution options, and workflow adaptions to cover all of this. These are summarized below


Changeset 5c7c365 adds models, mapping, and a database migration to enable the entire rest of the pull request. At a high-level, this adds DatasetCollections which can be associated with any number of DatasetCollectionElements - each of these can in turn be associated with exactly one HDA, LDDA, or DatasetCollection. The recursive nature of these models was added after the SAB and will should make it possible to construct really interesting statistical workflows - control and disease cohorts, with multiple patients, each with multiple technical replicates and each of these a paired dataset for instance. These DatasetCollections are essentially immutable given the current code (could change) and each element has an element identifier that Galaxy will attempt to carry through long analyses and makes available to tools for more sophisticated sample tracking than was previously available in Galaxy. While these identifiers are available via the API and to tools - they are not currently available via the GUI.

DatasetCollections may be associated with histories via a HistoryDatasetCollectionAssociation (begrudgingly referred to throughout code as HDCAs) or to libraries via LibraryDatasetCollectionAssociation (largely untested). These are jointly referred to as DatasetCollectionInstances - following the existing DatasetInstance nomenclature in Galaxy. These instances carry the mutable attributes of collections (deleted, visible, name, most particularly...).

I wanted to delay merging this work into central until I felt confident about the models - I still don't feel confident about the models - but I have been able to get this setup to work. I have outlined some of the questions I still have here - ( - the two big questions are how does one define the state of a dataset collection (or does it even need one) and should the subcollections be collection instances or not. This pull request punts on the first question and answers the second question with a hesitant I guess not.

API and Infrastructure

66ed534 provides much of the Python infrastructure from above the model layer to the web API to support dataset collections (this needs to be reworked to look more like Carl's managers). It introduces the concept of type plugins to provide some more intelligence to dataset collections. These are under utilized concepts with this initial prototype - but I hopeful they will prove useful ways to extend and add interesting features to collections.

History Panel UI

This pull request contains some small UI modifications to abstract out the history panel to allow a minimal display of collections and the ability to create pairs and lists from datasets for quick testing purposes. I promised Carl I wouldn't do to much in this area so that he can have full freedom to pursue the whole large project which is a UI for these concepts however he sees fit.

Tool Parameters

9146cb4 defines a new tool parameter type data_collection. Like data parameters this can define a format (collections don't have a format but are filtered out at runtime by iterating over the whole collection) but also can define a collection_type (e.g. list, paired, list:paired, etc...). 3fd5d1b adds the ability to create tool tests that use dataset collection input parameters and includes some tools demonstrating how to pick apart and utilize collections.

Tool Execution Enhancements - Map/Reduce

While these tool parameters are perfectly good for new tools - an important feature of this work is that existing tools can benefit from these concepts without modification via new (to Galaxy anyway) map and reduce concepts.

For instance 5c15644 (API) and 91b65da (UI) adds the ability to map an existing tool across a collection - for instance one can take a list of paired samples use the FASTQ groomer tool to groom all of them and dca2a89 ensures a new list of paired samples is created from this operation with all of the element identifiers preserved from the original dataset collection. Dataset collection parameters can likewise be mapped over (see 76856c3). For instance if the bowtie tool was modified to take in a paired dataset - one could take a list of paired samples and map it over the bowtie tool and get a "flat" list of aligned bam files. In this way mapping over collection parameters is also a form of reduction (bowtie would reduce a list of pairs to just a list).

Reductions using existing tools are also possible 8e9dc1d - input data parameters marked as "multiple" can be substituted with collections. The tool framework will provide the tool an interface that makes the collection appear as a disconnect set of datasets.

These reductions using multiple data inputs are only available for flat collections - a "list" or a "paired" (referred to as rank 1 collections in some places in this work). This restriction is in place because I am pretty confident in most biomedical workflows with more deeply nested collections you would want to reduce them one level at a time - say analyze/aggregate/summarize all the technical replicates of a patient in one step (for each patient) and then aggregate/summarize these statistics across patients in a subsequent step. This sort of combination map/reduce of nested collections is not implemented yet but needs to be added. And while certainly more sophisticated tools exist that can summarize and aggregate across multiple levels of collections - these in my experience need some concept of the nesting structure of the data and so this would need to be made explicit in the tool and a data_collection parameter should be used so this information can be structured and passed into the underlying application/analysis.

The final extension to the tool execution framework worth mentioning is that input data parameters to tools be supplied multiple inputs to sort of batch execute the tool without the use of dataset collections. No dataset collection will be created from results or sample tracking enabled - it is just a quick a dirty way to crank through more data. Implementation-wise this was a stepping stone to the more complicated dataset collection mapping and that is why it is in here.


Dataset collections parameters, mapping, and reduction all had large impact on workflow extraction, running, and editing. These modifications are included in this pull request.

The workflow editor in particular probably needs a comment but I am not sure what to say. If I had to predict, I would suspect 80% of people will dislike the changes - half those thinking it is too "clever" and the other half thinking it is not nearly "clever" enough. I am not saying I did things right - only that I did my best and it is just a first crack at it. If it it is merged if you get a chance to play with it and you have comments or questions please don't hesitate to let me know. One thing I am confident of is the user needs more feedback - this arrow cannot connect to that arrow with no explanation probably was never enough - but certainly isn't enough after this. This initial commit is about verifying features are possible more than about polish - I will continue to polish going forward and hopefully people will help and provide feedback. Finally, existing workflows, things that don't use collections at all, should be completely unaffected.

Final Notes

In the couple weeks between when this gets merged and when the next-stable branch is opened - I will be working on testing this and some smaller priority fixes and I will track on this newer smaller card -

I am sure there are many issues even with the features that I claim are already implemented here - please be patient as I work through them. Everything has been working on its own in isolation but there have been several substantial reworkings of this stuff just to get to this point - there are lots of tests here and I will write lots more and this will iterate to stability.

That said - there is one open and very critical issue with these - the history panel UI changes - minimal though they may be - will not work if an HDA and HDCA share the same id within a history - which would never happen on but happens immediately on a fresh Galaxy clone against a new database. I am leaving this unsolved because the change would probably be somewhat involved and I promised Carl I would stop working on the GUI. I will merge this as is and leave it for Carl to tell me how to fix or to fix himself - it will be easier to collaborate on this stuff once it is in central.

Comments (0)