Parameter Based Bam file parallelization

#175 Declined
  1. Kyle Ellrott

This patch adds BAM file parallelization options for tools. However, because it would be pretty inefficient to split up the BAM file, this code points all the tasks to the original file, and tells them which reference sequence to work on via a variable passed to the command line template. This is a new way of doing parallelization, previously all parallel tasks ran the exact same command line. This change in how task command lines are created (by overlaying custom parameters on top of the parameters inherited from the parent job), requires the addition on a new database table called 'task_parameters'. In this example, the variable 'seq_name' is used by the parallelization code to tell each task which reference sequence from the BAM file it should work on (ie, each task will work on a different chromosome.

Example program:

<tool id="bam_counter" name="BamCounter" version="1.0.0">
  <description>count bam data</description>
  <parallelism method="multi" split_inputs="input" split_mode="by_segment" segment_variable="seq_name" merge_outputs="outfile"></parallelism>
  <command interpreter="bash"> $input $input.metadata.bam_index $seq_name > $outfile
    <param format="bam" name="input" type="data" label="Select dataset to count"/>
    <param name="seq_name" type="hidden" value="-"/>
    <data format="txt" name="outfile" />

The bam counter program:


if [ ! -e $1.bai ]; then
    ln -s $2 $1.bai

if [ "$3" = "-" ]; then
    for a in `samtools view -H $1 | grep "@SQ" | sed 's/.*SN:\([^\n\t]*\).*/\1/'`; do 
        echo -n $a " "
        samtools view $1 $a | wc -l
    echo -n $3 " "
    samtools view $1 $3 | wc -l

Comments (4)

  1. John Chilton

    @kellrott Are you still using this? Is it working out well for you?

    I think I would like to see an alternative implementation of this. Basically, I think the splitter code should write the "part" parameters in some form as files and place them in each task working directory. The wrapper could then check if it is its working directory has say the seq_name file and process just part of the input if it did. This would simplify the XML - you wouldn't need that hidden parameter and the command-line would largely be the same between both when splitting and when not split. I would see this as a more flexible and less disruptive solution to this.

    Does this sound possible to you? Do you agree it would be a better implementation of this functionality?

  2. Kyle Ellrott author

    I haven't been using this patch. I've started to feel that any kind of 'custom' parallelization solution that we try to shoe horn into Galaxy will only end up serving a limited number of needs, and just creates some other random standard people have to work towards ( ) Internally I've been working with a branch that allows Galaxy to hand off distributed resource control to Mesos ( ). This way the tool could be based on Hadoop, MPI, Spark, or anything that have a control wrapper.

  3. John Chilton

    Fair enough critique :). I am very eager to see your Mesos work integrated in Galaxy. On the other hand, I think there is value in having first class entities for "tasks" the way Galaxy currently works that Galaxy can reason about and allows for parallelization of jobs without the tool/wrapper needing to know it is being parallelized. Happy to see both paths pursued.

    It sounds then like you won't have any hard feelings if I decline this pull request then?

  4. Kyle Ellrott author

    Of the original 3 goals mentioned ( ), I've completed the first one. The second actually has a bit of overlap with LWR (ie need a client side tool runner), so I wanted to see how some of your refactoring went (and have much code I can share with it). And point 3 is dependent on #2.

    If you decline this merge, you'll save me from having to do it ;-)