Wiki

Clone wiki

rossdev / biocWrapper_dev

From Bioconductor package to Galaxy tool

This guide is for Galaxy programmers wanting to create new Galaxy tools. It was drafted during the process of developing a new Galaxy tool to prepare QC reports on short read sequencing data files using the Bioconductor ShortRead package. It follows the winding course Ross Lazarus followed. It may contain useful tips if you are starting out.

Writing a new Galaxy tool can be a very simple process but in this case we choose to add an additional complexity - a python wrapper - because the Bioconductor package outputs are not immediately suitable for display in a user's history.

Note that although Galaxy makes it relatively easy, writing, installing and testing a new Galaxy tool requires a reasonably comprehensive understanding of the care and feeding of the package to be installed as well as substantial programming expertise - even though no changes are made to the Galaxy framework code, new tool code must be supplied to make the package run properly inside Galaxy.

10,000 ft. view

Out of the box and with all dependencies satisfied, a new installation of Galaxy has a wide range of tools. If you've ever wondered why some of your favorite analysis tools are not available, these notes might help.

Making a new analysis tool available in Galaxy means adding a new tool to extend Galaxy. You can only do this as an authorized user and only on a local copy of Galaxy. Tools running on any Galaxy instance (including the main one at http://usegalaxy.org) are all installed locally as Galaxy does not (yet) have the capacity to execute remotely hosted tools.

Programmer View:

existing           wrapper,                                                             
(eg BioC)     ---> newtool.xml,       -->  Galaxy  
Package            add to tool_conf                     

User View

             Params,
History <--- Job control,    <---  NewTool   <--- User
             Output xform

A Galaxy tool 'wraps' some existing, useful piece of third party (ie not developed as part of Galaxy) analysis software. The specific software package must already be installed correctly and able to execute correctly on the computational cluster your Galaxy uses. Then, a programmer must describe to Galaxy what items to present to the user on the tool form, and how the new executable is called at run time. All this is done by writing a tool descriptor file marked up with XML. This descriptor is used to create an interactive tool form where parameters can be set when a user selects the new tool from the Galaxy tool menu.

Components to be created

A new Galaxy tool requires:

  1. a new tool descriptor XML file in a subdirectory of the Galaxy installation tools directory
  2. a new entry naming that new XML file in tool_conf.xml
  3. optionally, a wrapper to call the third party application and adjust inputs and/or outputs

Properties for each tool including a user form where the parameters are chosen and documentation, are defined using XML markup language to provide instructions to Galaxy on how the package should be handled and what parameters the user can choose.

Telling Galaxy to load your new tool

A galaxy restart will be needed so Galaxy reads the edits you've made to tool_conf.xml before it can become visible or useable in your Galaxy tool list. Check the output in paster.log to make sure your new tool xml loaded without error - it won't appear in the tool menu if it failed to load for any reason - failure to parse broken XML is the usual reason, but a failure to pass any included python code can also be the cause if you include code using the code tag.

Wrapper scripts

Many third party analysis tools have input and output formats compatible with Galaxy datatypes and can be called directly in the new tool "command" line generated by the tool. Some do not. For example, some useful R and BioConductor reporting functions such as ShortRead QC create HTML reports that rely on specific subdirectories and are not immediately suitable for display in Galaxy. In these cases, a python wrapper is one convenient way to rearrange the outputs a little so Galaxy can present them appropriately.

For the ShortRead package used in this example, we add one layer of additional complexity and create two new Galaxy source files. The extra layer will be a python script serving as a wrapper for running R and cleaning up the outputs.

Preparatory work

Preparatory steps briefly described below are always needed, because the new package must be tested and run at least once before we know enough to make decisions about the architecture of a new tool. Tool functional tests for the new tool can be created based on the outputs from running a small set of data - these tests will be run routinely as part of the functional tests if the tool is present in tool_conf.xml.sample.

For wrapping the ShortRead BioC package:

  • We need a package (ShortRead) installed for Galaxy

Clearly, the R executable that runs on the target Galaxy job runner (cluster) node(s) must have the appropriate Bioconductor packages and dependencies installed and ready to run or the tool cannot possibly function. This is generally done as root for R.

NOTE: If you have upgraded your local R installation, you may want to follow the advice at http://www.bioconductor.org/install/index.html#update-bioconductor-packages before updating.

The recommended incantation depends on local setup but is probably something like:

sudo R
source("http://bioconductor.org/biocLite.R")
pkgs <- rownames(installed.packages())
biocLite(pkgs)
biocLite(c('lumi','ArrayExpress','GEOquery','GGtools','arrayQualityMetrics','affyQCReport','snpMatrix','ArrayTools','ShortRead'))
q()

Otherwise, we recommend updating any existing BioC packages before installing the new one.

The first time is probably something like:

source("http://www.bioconductor.org/biocLite.R")
biocLite(c('lumi','ArrayExpress','GEOquery','GGtools','arrayQualityMetrics','affyQCReport','snpMatrix','ArrayTools','ShortRead'))
q()

Once biocLite.R has been sourced (should be in .Rprofile for main system, updates are done using:

old.packages(repos=biocinstallRepos())  # tells you which packages are out of date
update.packages(repos=biocinstallRepos(), ask=FALSE) #will retrieve and install newer versions when available
q()

Everything should download and compile. If it does not, the tool cannot work and you may need to seek help from the BioC list (don't forget to post the output from sessionInfo() in your question to show the versions of packages you have).

If you are using any sane way to get to your Galaxy machine, using security with ssh, it is ugly but necessary to use ssh -Y rather than the safer ssh -X to finesse X11 strict authentication when you play with sudo. If you forget, sadly, our experience is that X authentication errors will more often than not cause a BioC installation to go awry if any of the tools access the X11 device and trigger an X11 authentication exception - eventually needing manual lock directory removal and repeating.

If all else fails, try using an SSH terminal session without any X connection (ie no -X or -Y)

  • We need to be able to run the Bioconductor package and get predictable outputs

Writing a tool XML file and wrapper script requires clear understanding of the package care and feeding. Our technical lead user for this project tells us that for this tool, a typical use is:

require(ShortRead)
qaSummary<-qa(path,"*_sequence.txt.gz", "fastq")
report(qaSummary,dest="/path/to/report_directory")

So, we test this. It turns out that a successful run generates:

  1. an HTML file - index.html containing
  2. links to pdf/jpg files in an images subdirectory, and
  3. a CSS

We will take care of preparing and running the R code in the wrapper, but we now know a lot including what outputs the package creates, so we can design the new tool architecture to cope.

  • Parameters

From running the package, it seems the parameters include a directory, a glob or filename to match in that directory, an incoming file format string (eg 'fasta', a destination directory and a destination html report file name.

Galaxy will provide a file name for the new html page when the tool is run, but we will need to rewrite the package report html from 'index.html' with some file path edits so it works in Galaxy.

  • Run an R script as part of a Galaxy job

There are lots of ways of doing this and they each have benefits and costs. One simple example can be found in tools/stats/r_wrapper.sh but the most general method is to create a template for the required R script, instantiate the template and write it, then run it calling R using the subprocess module.

TODO: You can also use rpy but you'll be writing 1.03 code since we haven't bitten the bullet and upgraded all the code (there's a lot...) to 2.0 - so I'm holding off for the moment and usually writing actual R scripts as artifacts because I think that's most consistent with users really being able to reproduce what was executed - here be dragons.

  • Clean up and package outputs.

After R is done, it leaves files where we don't want them, linked with what would be broken links in Galaxy. Fixing this sort of stuff in R may be your preference, but I like Python - and that's why I choose to go to all the trouble of creating a wrapper. The html output from the package appears in index.html but we need it written to output_file - along the way we need to make some small adjustments to links

In summary, the wrapper will:

  1. rewrite the index.html written by the BioC package,
  2. strip all href targets to bare filenames.
  3. move all the href targets to the $output.files_path

For the wrapper, and for the package, the output directory where all the images are to be moved is passed on the command line example above as:

--o $out_html.files_path
  • create a small sample for the tool tests

Galaxy tools must have at least some rudimentary functional test before they can go out for user acceptance testing. They should a small input and generate predictable outputs. To write a test, we need to have those expected outputs and inputs. eg

head -1000000 /mnt/memefs/Hi-Seq-Pilot_2010/s_3_1_sequence.txt > test-data/ShortRead1Mrows.fastq

gives us a tiny sample (about 0.3%) but it's still huge.

  • back into R to generate expected test outputs
> library('ShortRead')
Loading required package: IRanges

Attaching package: 'IRanges'

The following object(s) are masked from 'package:base':

   cbind, Map, mapply, order, paste, pmax, pmax.int, pmin, pmin.int,
   rbind, rep.int, table

Loading required package: GenomicRanges
Loading required package: Biostrings
Loading required package: lattice
Loading required package: Rsamtools
> qas = qa('/tmp','s_3_1_1Mrows.txt','fastq')
> report(qas,dest='/tmp/fooqa')
[1] "/tmp/fooqa/index.html"
> q()
  • Move expected test outputs and write tests

Tool functional tests are an important part of any new tool.

/tmp/fooqa 

now has an html page, a css and a subdirectory of jpg/pdf images

These might go to your galaxy test-data/ directory or perhaps the rgtestouts/toolname directory in our case.

Components needed in the xml tool file

A tool file contains xml attributes, defining all aspects of a tool's behavior in Galaxy through the exposed tool API.

command : In the command template, Galaxy is shown how to create the correct command line to send to the cluster. Parameters are substituted for user choices.

If the python wrapper is called shortreadqc.py then we can use something like:

<command>shortreadqc.py -i $input_file -o $output_file.files_path -h $output_file -t $input_file.ext 

Items in the template starting with the $ character are place holders for user supplied parameters

input dataset : The BioC package needs an input file or directory of input files of the appropriate type

The ShortRead BioC package can read ("SolexaExport", "SolexaRealign", "Bowtie", "MAQMap",  "MAQMapShort", "fastq"), so in the tool xml, something like

 <param name="input_file"  type="data" label="Short read sequence data from your current history" format="fastq,fastqsanger" />

will present a select list on the tool form for the user to select either fastq or groomed fastqsanger from their history

The type of input file (-t in the command above) is a required parameter for the package and must be passed to R. This can be found through $input_file.ext - which will map to 'fastq','bwa' or 'fastqsanger' which in turn must be mapped to the strings the package requires.

sidenote : It's possible to do this in the python script template language available when constructing the command line. Eg something like:

<command>shortreadqc.py -i $input_file -o $output_file.files_path -t
#if $input_file.ext == 'bwa'
'Bowtie'
#else
'fastq'

This kind of string substitution can be efficiently pushed to the wrapper...

Output html in the user's history : Since the package can write an html document, we should take advantage of the Galaxy html datatype where we can present the package output. Unfortunately we need to move the links around so Galaxy can present them using a secure display mechanism - but this is a very general approach to wrapping tools - after all, we can create html easily enough if the package doesn't...

Creating a new html datatype requires an output segment. something like:

  <outputs>
      <data format="html" name="out_html" />
  </outputs>

Link targets where Galaxy can find them to display : The output directory where all the images are to be moved is passed on the command line example above as:

--o $out_html.files_path

Note that this is really the extra_files_path when you want to access it - but for writing, it's files_path. Ask Nate. All the files linked in the package html output index.html must be moved there before the tool finishes, and as has been described above, the actual html must be written as the contents of out_html, with all links altered by removing any paths prepended by the R code (eg a link like

<a href="images/foo.pdf">Foo pdf</a>

must be changed to

<a href="foo.pdf">Foo pdf</a>

Updated