Wiki

Clone wiki

jgi-workflows / Home

Lessons Learned

These pages are for workflow developers; feel free to edit. Questions and support requests should be posted to the slack channel.

Setting up the Environment

  1. Shared workflows should use docker containers for all tasks. Workflows should reference specific container versions (not "latest" or default).

Trouble installing a package using conda install

I get error when trying to run conda install -y -c bioconda bbmap

CondaMultiError: CondaVerificationError: The package for openjdk located at <mypath>/miniconda2/pkgs/openjdk-11.0.1-h01d97ff_1016
appears to be corrupted. The path 'lib/modules'
specified in the package manifest cannot be found.

The solution is to clean out the old cached packages

# first try this which will remove index cache, lock files, unused cache packages, and tarballs
conda clean --all

# and if that fails, do this but be warned:
# This will remove *all* writable package caches. This option is not included with the --all flag. 
# WARNING: This will break environments with packages installed using symlinks back to the package cache.
conda clean --force-pkgs-dirs


WDL Files

Using ${} in bash commands (and you don't want it interpreted as a variable by cromwell).

  1. WDL doesn't have an escape for variable interpolation, so ${} is always interpolated by cromwell; so if we want to change the file suffix the bash ${file/suffix-old/suffix-new} wont work because cromwell will try and replace ${} with some value. This example won't work:
    command {
        fname="scaffolds.fasta"
        fname1="${fname/fasta/trim.fasta}" # doesn't work because cromwell tries to 
                                           # interpret this instead of letting bash shell do it.
    }
    

You can use a Cromwell variable to hack around this behavior:

String dollar="$"
command <<<
    fname='scaffolds.fasta'
    ${dollar}{fname/fasta/trim.fasta}
<<<

Or use basename, perl, sed, or awk:

command <<<
  fname="scaffolds.fasta"
  fname1="${fname/fasta/trim.fasta}" # doesn't work
  fname2="`basename $fname .fasta`.trim.fasta"
  fname3=`echo "$fname" | perl -pe 's/fasta$/trim.fasta/'`
  fname4=`echo "$fname" | sed 's|fasta|trim.fasta|'`
<<<

Using languages other than bash in the command stanza

example running python code. If you need to pass information between bash and python, you can use a "tmpfile" (shown here), and maybe you can do it some other way but I couldn't get it to work.

command {
        # you can run bash and python
        echo We can run bash too

        python <<CODE > tmpfile
        import os, sys, glob

        if os.path.isfile('${dbpath}') and glob.glob('${dbpath}' + '*.nin'):
            print('${dbpath}')
        else:
            fname = os.path.basename('${dbpath}')
            cmd = "${cmd}" + " -in %s 1>makeblastdb.log"%fname
            os.symlink('${dbpath}', fname)
            os.system(cmd)
            print(os.path.join(os.getcwd(), fname))
        CODE

        cat tmpfile
    }

Create a hash & then access its contents

create the hash (map)

Map[String, String] outputName = {
        "refseq.mito": "refseq.mito",
        "refseq.bacteria": "refseq.bacteria",
        "assembly": "assembly.fasta"
    }

Accessing the contents

scatter (pair in outputName){
  prefix=pair.left
  value=pair.right
}

Testing if a file exists

You need to run a task that returns a boolean. Then use the boolean in a "if then" statement.

### in the workflow section: ###
# test if file exists. This task returns a boolean.
call if_file_exists {
     input: myfile=reads
}

# now you can make a decision
if(if_file_exists.answer) {
    call sometask {}
}

### in the task section ###
task if_file_exists {
    File myfile
    command {
        if [[ -s ${myfile} ]]; then
            echo true
        else
            echo false
        fi
    }
    output {
        Boolean answer = read_boolean(stdout())
    }
}

Remove File Extensions

in a task do something like this

File reference = "DOE_UTEX.polished.t635masked.fasta"
String reference_bname = basename(reference)
String a = sub(reference_bname,"\\.\\w+$",""
This will give you a=DOE_UTEX.polished.t635masked

Scatter within Scatter

Nested Scatters are not supported (yet) ... but you might try a sub workflow to achieve the same effect!

setting WDL variable depending on conditional

The problem is you can't have a variable with the same name being set to different things in WDL. The solution is to use a special function select_first.

if (caller == 'VARSCAN') {
    Array[File]? bams_varscan=gatkqc.gatk_bams
}
Array[File]? gatkBamList =select_first([bams_varscan, merge_qc_bam_files.merged_bam])
* If VARSCAN==True, then gatkBamList will be set to gatkqc.gatk_bams otherwise it will be set to merge_qc_bam_files.merged_bam. * Note that both gatkqc.gatk_bams and merge_qc_bam_files.merged_bam need to be arrays. * Also, note that the question marks are also required.


Subworkflows

  1. For shared/production workflows, all subworkflows should be added to this repository and referenced by URL (use jaws --list) to list available workflows and their URLs. You can also supply your own WDL via the jaws -f option and any referenced subworkflows must be in the same folder as the main workflow (symlinks OK).

Inputs JSON file

  1. JSON format (i.e. inputs file) uses double-quotes, not single-quotes!

Cromwell Specific Issues


###Cromwell caching behavior

  1. Outputs are reused if the inputs and task command are identical, so during development, if you make changes to a called script, Cromwell will not recognize that as a different version (unless you changed the filename or command-line parameters). To avoid caching, using the "jaws --rm" command to purge old results.

Updated