Enhancements for Runtime Discovered (Collected Primary) Datasets

#356 Merged at 8e6cda4
  1. John Chilton

Enhancements for Runtime Discovered Datasets

This pull requests is comprised of two main improvements aimed at tool developers using these kinds of datasets - the ability to test these outputs and the ability to configure how they are discovered.


These extensions make it easier to write a tool that outputs a directory of bam files and makes it possible to actually test this dynamic behavior.


This pull request extends the tool test's output tag so that it may contain any number of nested discovered_dataset tags. Each nested discovered_dataset tag must specify a designation attribute (designation is the terminology used in the code - not sure it is the right thing to expose to developers). Additionally it must specify one or more tests for the output - mirroring the options available to output elements it can contain a file attribute, a assert_contents child, a metadata child.

The following code example is taken from test/functional/tools/multi_output.xml and demonstrates adding a test for the world discovered dataset.

      <param name="input" value="7" />
      <output name="report">
          <has_line line="Hello" />
        <discovered_dataset designation="world">
            <has_line line="World Contents" />

This example test can be executed with the commands:

% sh run_functional_tests.sh -framework -id multi_output

Note these enhancements are only available for API driven tool tests.

Customizing Discovery

Galaxy checks configured directories for files of the form primary_<dataset_id>_<designation>_<visible>_<ext>(_<dbkey). The implementation of this has been transformed to use a regular expression with named groups for - designation, visible, ext, and dbkey. The advantage of doing this is that it allows the tool author to easily swipe in new patterns.

Tool output elements may now contain any number of discover_datasets elements. Each should define at least a pattern attribute (or the default pattern describe above will be use). Additionally these elements may contain dbkey, ext, and visible attributes to provide defaults if the pattern supplied does not contain them. Additionally, a directory can be supplied as an attirbute to this discover_datasets tag to search specific sub-directories of the job_working_directory.

To allow complete configurability of the discovered dataset names - name can be used in place of designation in patterns and named patterns (described below) to also name the resulting dataset (the previous behavior or naming these outputs based on the original output name and the designation remains the default).

Below is an example from the unit tests demonstrating using a custom pattern to grab the dbkey and designation from the filename while providing a default extension ext for use.

A complete example of custom pattern matching and matching test cases can be found in test/functional/tools/multi_output_configured.xml. To run this functional test simply execute:

% sh run_functional_tests.sh -framework -id multi_output_configured

Another example of defining a custom pattern is shown below and taken from a unit test included with this pull request.

    def test_custom_pattern( self ):
        # Hypothetical oral metagenomic classifier that populates a directory
        # of files based on name and genome. Use custom regex pattern to grab
        # and classify these files.
        self._replace_output_collectors( '''<output><discover_datasets pattern="(?P&lt;designation&gt;.*)__(?P&lt;dbkey&gt;.*).fasta" directory="genome_breakdown" ext="fasta" /></output>''' )
        self._setup_extra_file( subdir="genome_breakdown", filename="samp1__hg19.fasta" )
        self._setup_extra_file( subdir="genome_breakdown", filename="samp2__lactLact.fasta" )
        self._setup_extra_file( subdir="genome_breakdown", filename="samp3__hg19.fasta" )
        self._setup_extra_file( subdir="genome_breakdown", filename="samp4__lactPlan.fasta" )
        self._setup_extra_file( subdir="genome_breakdown", filename="samp5__fusoNucl.fasta" )

        # Put a file in directory we don't care about, just to make sure
        # it doesn't get picked up by pattern.
        self._setup_extra_file( subdir="genome_breakdown", filename="overview.txt" )

        primary_outputs = self._collect( )[ DEFAULT_TOOL_OUTPUT ]
        assert len( primary_outputs ) == 5
        genomes = dict( samp1="hg19", samp2="lactLact", samp3="hg19", samp4="lactPlan", samp5="fusoNucl" )
        for key, hda in primary_outputs.iteritems():
            assert hda.dbkey == genomes[ key ]

The regular expression group syntax requiring angle brackets makes it awkward to express in XML - so several predefined "named" patterns have been supplied (see lib/galaxy/tools/parameters/output_collect.py). These patterns include:

  • __default__: This is just the default, existing pattern (starting with primary_).
  • __designation_and_ext__: Will read all files of the form <basename>.<ext> and assign the designation to the basename and use the file extensions as the dataset ext.
  • __name_and_ext__: Just like the above pattern, but setting the dataset name in addition to designation to attribute.
  • __designation__, __name__: These patterns will use the complete filename as the dataset designation or name respectively and so should be used in conjuction with an explicitly defined ext attribute on the discover_datasets tag to explicitly provide a dataset type.

WARNING: Besides __default__, all of these patterns are very liberal and so should only be used with the directory attribute as to not pick up random metadata files, etc...

In order to enable the extensions to the testing framework described above - two minor modifications have been made to the API. The dataset provenance controller now returns an encoded job id - to allow easily grabbing job information from dataset provenance output. Additionally, the Jobs API itself has been augmented to return information on input and output job dataset associations - this is required to determine if and which datasets match a job's discovered datasets.


Beyond this being a frequent request (e.g. this), at some point dataset collections will hopefully make the outputs from tools such as these more useful (e.g. usable within workflows) and manageable - so I want to make it easier to write such a tool and establish some reusable components (both within the XML level and implementation) for describing these collections.

Comments (1)