1. Ross Lazarus
  2. rossdev

Wiki

Clone wiki

rossdev / picardWrapper_dev

Two Picard tools for Galaxy

Suggested by Mike Cho.

Picard executables for Galaxy tools

Picard tools are on q:/share/shared/galaxy/tool-date/shared/jars

To start, let's run them to see what they create. We need a small sam test file - I'll make one.

Test Runs

java -jar /share/shared/galaxy/tool-data/shared/jars/CalculateHsMetrics.jar BAIT_INTERVALS=test.pic TARGET_INTERVALS=test.pic INPUT=test.bam OUTPUT=picardHsMetrics.txt

produces:

## net.sf.picard.metrics.StringHeader
# net.sf.picard.analysis.directed.CalculateHsMetrics BAIT_INTERVALS=test.pic TARGET_INTERVALS=test.pic INPUT=test.bam OUTPUT=picardHsMetrics.txt    TMP_DIR=/tmp/rerla VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
## net.sf.picard.metrics.StringHeader
# Started on: Thu Oct 07 10:47:28 EDT 2010

## METRICS CLASS        net.sf.picard.analysis.directed.HsMetrics
BAIT_SET        GENOME_SIZE     BAIT_TERRITORY  TARGET_TERRITORY        BAIT_DESIGN_EFFICIENCY  TOTAL_READS     PF_READS        PF_UNIQUE_READS PCT_PF_READS    PCT_PF_UQ_READS PF_UQ_READS_ALIGNED     PCT_PF_UQ_READS_ALIGNED PF_UQ_BASES_ALIGNED     ON_BAIT_BASES   NEAR_BAIT_BASES OFF_BAIT_BASES  ON_TARGET_BASES PCT_SELECTED_BASES      PCT_OFF_BAIT    ON_BAIT_VS_SELECTED     MEAN_BAIT_COVERAGE      MEAN_TARGET_COVERAGE    PCT_USABLE_BASES_ON_BAIT        PCT_USABLE_BASES_ON_TARGET      FOLD_ENRICHMENT ZERO_CVG_TARGETS_PCT    FOLD_80_BASE_PENALTY    PCT_TARGET_BASES_2X     PCT_TARGET_BASES_10X    PCT_TARGET_BASES_20X    PCT_TARGET_BASES_30X    HS_LIBRARY_SIZE HS_PENALTY_10X  HS_PENALTY_20X  HS_PENALTY_30X
test    32581119        27986079        27986079        1       1047456 1047456 825929  1       0.788509        754784  0.913861        237122533       0       0       237122533       0       0       1       ?       0       ?       0       0       0       0.833575        ?       0       0       0       0               0       0       0

Note on 'bait and target' files for Picard

Mike gives /udd/remhc/Work/Sequencing/ucscCcds.interval_list as an example - it has header rows like:

@SQ     SN:chrM LN:16571
@SQ     SN:chr1 LN:247249719
@SQ     SN:chr2 LN:242951149
@SQ     SN:chr3 LN:199501827
@SQ     SN:chr4 LN:191273063

followed by BED like data rows like this

chr1    67052400        67052451        -       CCDS635.1_cds_0_0_chr1_67052401_r
chr1    67060631        67060788        -       CCDS635.1_cds_1_0_chr1_67060632_r
chr1    67065090        67065317        -       CCDS635.1_cds_2_0_chr1_67065091_r
chr1    67066082        67066181        -       CCDS635.1_cds_3_0_chr1_67066083_r

So, the sam style header first lists all the features (eg chromosome) referred to in the first column of all the targets - the L:n number is the sum of all the individual lengths for that feature in the data.

We will need to count those up for a fake header and rewrite as 1 based instead of ucsc zero based. This way the user can (eg) grab ucsc exons from the table browser for exomes and the Picard tool will convert those into it's own dumb format before running...

Here's some code

"""
Picard tools requiring targets want 
a sam style header:

@SQ     SN:chrM LN:16571
@SQ     SN:chr1 LN:247249719
@SQ     SN:chr2 LN:242951149
@SQ     SN:chr3 LN:199501827
@SQ     SN:chr4 LN:191273063
added to the start of what looks like a bed style file
chr1    67052400        67052451        -       CCDS635.1_cds_0_0_chr1_67052401_r
chr1    67060631        67060788        -       CCDS635.1_cds_1_0_chr1_67060632_r
chr1    67065090        67065317        -       CCDS635.1_cds_2_0_chr1_67065091_r
chr1    67066082        67066181        -       CCDS635.1_cds_3_0_chr1_67066083_r

see http://genome.ucsc.edu/FAQ/FAQtracks.html#tracks1
we need to add 1 to start coordinates on the way through - but length calculations are easier
"""
# bedToPicard.py
# ross lazarus October 2010
# LGPL 
# for Rgenetics

import sys

def getFlen(bedfname=None):
    """
    find all features in a BED file and sum their lengths
    """
    features = {}
    try:
        infile = open(bedfname,'r')
    except:
        print '###ERROR: getFlen unable to open bedfile %s' % bedfname
        sys.exit(1)
    for i,row in enumerate(infile):
        if row[0] == '@': # shouldn't happen given a bed file!
              print 'row %d=%s - should NOT start with @!' % (i,row)
              sys.exit(1)
        row = row.strip()
        if len(row) > 0:
            srow = row.split('\t')
            f = srow[0]
            spos = srow[1] # zero based from UCSC so no need to add 1 - eg 0-100 is 100 bases numbered 0-99 (!)
            epos = srow[2] # see http://genome.ucsc.edu/FAQ/FAQtracks.html#tracks1
            flen = int(epos) - int(spos)
            features.setdefault(f,0)
            features[f] += flen
    infile.close()
    return features

def keynat(string):
    '''
    borrowed from http://code.activestate.com/recipes/285264-natural-string-sorting/
    A natural sort helper function for sort() and sorted()
    without using regular expressions or exceptions.

    >>> items = ('Z', 'a', '10th', '1st', '9')
    >>> sorted(items)
    ['10th', '1st', '9', 'Z', 'a']
    >>> sorted(items, key=keynat)
    ['1st', '9', '10th', 'a', 'Z']    
    '''
    it = type(1)
    r = []
    for c in string:
        if c.isdigit():
            d = int(c)
            if r and type( r[-1] ) == it: 
                r[-1] = r[-1] * 10 + d
            else: 
                r.append(d)
        else:
            r.append(c.lower())
    return r

def writePic(outfname=None,bedfname=None):
    """
    collect header info and rewrite bed with header for picard
    """
    featlen = getFlen(bedfname=bedfname)
    try:
        outf = open(outfname,'w')
    except:
        print '###ERROR: writePic unable to open output picard file %s' % outfname
        sys.exit(1)
    infile = open(bedfname,'r') # already tested in getFlen
    fk = sorted(featlen.keys(), key=keynat)
    header = ['@SQ\tSN:%s\tLN:%d' % (x,featlen[x]) for x in fk]
    outf.write('\n'.join(header))
    outf.write('\n')
    for row in infile:
        row = row.strip()
        if len(row) > 0: # convert zero based start coordinate to 1 based
            srow = row.split('\t')
            srow[1] = '%d' % (int(srow[1])+1) # see http://genome.ucsc.edu/FAQ/FAQtracks.html#tracks1
            outf.write('\t'.join(srow))
            outf.write('\n')
    outf.close()
    infile.close()



if __name__ == "__main__":
    if len(sys.argv) < 3:
        print 'bedToPicard.py needs an existing bed format file path and the file path for the new Picard format file supplied on the command line'
        sys.exit(1)
    writePic(outfname=sys.argv[2],bedfname=sys.argv[1])



http://picard.sourceforge.net/command-line-overview.shtml#CalculateHsMetrics

CalculateHsMetrics

Calculates a set of Hybrid Selection specific metrics from an aligned SAMor BAM file.

OptionDescription
BAIT_INTERVALS=FileAn interval list file that contains the locations of the baits used. Required.
TARGET_INTERVALS=FileAn interval list file that contains the locations of the targets. Required.
INPUT=FileAn aligned SAM or BAM file. Required.
OUTPUT=FileThe output file to write the metrics to. Required. Cannot be used in conjuction with option(s) METRICS_FILE (M)
METRICS_FILE=FileLegacy synonym for OUTPUT, should not be used. Required. Cannot be used in conjuction with option(s) OUTPUT (O)
CREATE_MD5_FILE=BooleanWhether to create an MD5 digest for any BAM files created. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

CollectAlignmentSummaryMetrics

http://picard.sourceforge.net/command-line-overview.shtml#CollectAlignmentSummaryMetrics Reads a SAM or BAM file and writes a file containing summary alignment metrics.

OptionDescription
INPUT=FileSAM or BAM file Required
OUTPUT=FileFile to write insert size metrics to Required.
REFERENCE_SEQUENCE=FileReference sequence file Required.
ASSUME_SORTED=BooleanIf true (default), "unsorted" SAM/BAM files will be considerd coordinate sorted Default value: true. This option can be set to 'null' to clear the default value. Possible values: {true, false}
MAX_INSERT_SIZE=IntegerPaired end reads above this insert size will be considered chimeric along with inter-chromosomal pairs. Default value: 100000. This option can be set to 'null' to clear the default value.
ADAPTER_SEQUENCE=StringThis option may be specified 0 or more times. This option can be set to 'null' to clear the default list.
IS_BISULFITE_SEQUENCED=BooleanWhether the SAM or BAM file consists of bisulfite sequenced reads. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}
CREATE_MD5_FILE=BooleanWhether to create an MD5 digest for any BAM files created. Default value: false. This option can be set to 'null' to clear the default value. Possible values: {true, false}

Updated