Allow tools to explicitly produce dataset collections.

#634 Merged
  1. John Chilton

The pull requests allows tool authors to explicitly create collections as part of jobs.

Whenever possible simpler operations that produce datasets should be implicitly "mapped over" to produce collections - but there are a variety of situations for which this idiom is insufficient. This work attempts to address many of those situations.

This pull request introduces new tool syntax options to handle various scenarios - as well extensions to tool running, tool testing, workflow running, and workflow extraction backends and API to support these (workflow GUI components - mainly the editor - do yet support such tools however).

Progressively more complex syntax elements exist for the increasingly complex scenarios. Broadly speaking - the three scenarios covered are the tool produces...

  • ... a collection with a static number of elements (mostly for paired, but if a tool does say fixed binning it might make sense to create a list this way as well)
  • ... a list with the same number of elements as an input (common pattern for normalization applications for instance).
  • ... a list where the number of elements is not knowable until the job is complete.

For the first case - the tool can simply declare standard data elements below an output collection element in the outputs tag of the tool definition.

  <collection name="paired_output" type="paired" label="Split Pair">
    <data name="forward" format="txt" />
    <data name="reverse" format_source="input1" from_work_dir="reverse.txt" />

Templates (e.g. the command tag) can then reference $forward and $reverse or whatever name the corresponding data elements are given - as demonstrated in test/functional/tools/collection_creates_pair.xml.

The tool should describe the collection type via the type attribute on the collection element. data elements can define format, format_source, metadata_source, from_work_dir, and name.

The above syntax would also work for the corner case of static lists. For paired collections specifically however, the type plugin system now knows how to prototype a pair so the following even easier (though less configurable) syntax works.

<collection name="paired_output" type="paired" label="Split Pair" format_source="input1">

In this case the command template could then just reference ${paried_output.forward} and ${paired_output.reverse} as demonstrated in test/functional/tools/collection_creates_pair_from_type.xml.

For the second case - where the structure of the output is based on the structure of an input - a structured_like attribute can be defined on the collection tag.

<collection name="list_output" type="list" label="Duplicate List"
     structured_like="input1" inherit_format="true">

Templates can then loop over input1 or list_output when buliding up command-line expressions. See test/functional/tools/collection_creates_list.xml for an example.

format, format_source, and metadata_source can be defined for such collections if the format and metadata are fixed or based on a single input dataset. If instead the format or metadata depends on the formats of the collection it is structured like - inherit_format="true" and/or inherit_metadata="true" should be used instead - which will handle corner cases where there are for instance subtle format or metadata differences between the elements of the incoming list.

The third and most general case is when the number of elements in a list cannot be determined until runtime. For instance, when splitting up files by various dynamic criteria.

In this case a collection may define one of more discover_dataset elements (introduced in pull request #356 and in rare form documented on the wiki).

As an example of one such tool that splits a tabular file out into multiple tabular files based on the first column see test/functional/tools/collection_split_on_column.xml - which includes the following output definition:

<collection name="split_output" type="list" label="Table split on first column">
  <discover_datasets pattern="__name_and_ext__" directory="outputs" />

I suspect the first mechanism above is pretty intuitive and it makes sense why it should exist. I do feel however I should address why structured_like should exist - despite the fact the discover_datasets mechanism can represent any tool that the structured_like mechanism could. There are two primary reasons - the first is that I really do think it is simpler to configure in most scenarios. The other reason is that such collections will prove to be more robust because they can be pre-created. For instance, dynamically discovering datasets in workflows requires the extensions implemented as part of the workflow scheduling rewrite (pull request #561) and as such should be considered an even more experimental feature and will provide a rockier UI (requiring an explicit refresh of the history as the workflow extends pass such tools).


Galaxy workflow data flow before collections

* - * - * - * - * - *

Galaxy workflow data flow after collections (iteration 1)

* - * - * \
           * - * - *
* - * - * /         \
                     * - * - *
* - * - * \         /
           * - * - *
* - * - * /

Galaxy workflow data flow after this pull request

              / * - * \
         * - *         * - *
        /     \ * - * /     \
       /                     \
      /                       \
     /        / * - * \        \
* - * -- * - *         * - * -- * - *
     \        \ * - * /        /
      \                       /
       \                     /
        \     / * - * \     /
         * - *         * - *
              \ * - * /

Comments (1)