Commits

Anonymous committed 00c0223

Edits to the consensus caller CGES

Comments (0)

Files changed (6)

tools/consensus_caller/galaxy.consensus/README.md

-Introduction:
+Description:
 ==============================
 
-This is an initial implementation of a two stage voting scheme among variant calling algorithms. Given a set of VCF files produced by various algorithms, sites are selected if they are seen among all callers. Genotypes among these sites are then selected as those that match among all callers. Currently, a user can input any number of sorted VCF files, and a strict consensus of variant sites and genotypes will be generated.
+This is an implementation of an ensemble variant calling method. Specifically, it takes VCF files generated by various calling algorithms and merges them according to specified thresholds on variant and genotype concordance. The resulting VCF can range from a strict consensus among inputs, to a union of all possible observations.
 
-Any VCF can be used as long as it can be parsed by [James Casbon's pyVCF module](https://github.com/jamescasbon/PyVCF).
+
+Check out the github pages site: http://vtrubets.github.io/galaxy.consensus/
+
+I'll be updating that site with an overview and tutorial soon.
+
 
 Options:
-========
+==============================
 
-    usage: consensus_genotyper.py [-h] VCFS [VCFS ...]
+    usage: consensus_genotyper.py [-h] [--site-threshold SITETHRESH]
+                                  [--genotype-threshold GENOTHRESH]
+                                  [--ignore-missing]
+                                  VCFS [VCFS ...]
 
-    Find sites and genotypes which aggree among an arbitrary number of VCF files.
-    
+    Find sites and genotypes that aggree among an arbitrary number of VCF files.
+
     positional arguments:
-      VCFS        List of VCF files for input.
+      VCFS                  List of VCF files for input.
 
     optional arguments:
-      -h, --help  show this help message and exit
+      -h, --help            show this help message and exit
+      --site-threshold SITETHRESH, -s SITETHRESH
+                            Number of inputs which must agree for a site to be
+                            included in the output.
+      --genotype-threshold GENOTHRESH, -g GENOTHRESH
+                            Number of inputs which must agree for a genotype to be
+                            marked as non-missing.
+      --ignore-missing, -m  Flag specifying how to handle missing genotypes in the
+                            vote. If present, missing genotypes are excluded from
+                            the genotype concordance vote unless all genotypes are
+                            missing.   usage: consensus_genotyper.py [-h] VCFS [VCFS ...]
+
+        Find sites and genotypes which aggree among an arbitrary number of VCF files.
+        
+        positional arguments:
+          VCFS        List of VCF files for input.
+
+        optional arguments:
+          -h, --help  show this help message and exit
 
 
 
 Usage:
-========
+==============================
 
 Test data is located in the data/ directory. The following command:
 
-python ./consensus_tool/consensus_genotyper.py ./data/*vcf > test.output.vcf
+python ./consensus_tool/consensus_genotyper.py --geno-thresh 3 --site-thresh 3 ./data/*vcf > test.output.vcf
 
-Will take the three test files in the data directory and generate a strict consensus of sites and genotypes (i.e. 3/3 files containt the variant site, and 3/3 files agree on the genotype for a sampple at that site).
+Will take the three test files in the data directory and generate a strict consensus of sites and genotypes (i.e. 3/3 files contain the variant site, and 3/3 files agree on the genotype for a sampple at that site).
 
 Some things to keep in mind: 
 * Multi-sample VCF files are currently supported, and the output will contain only samples which are found in all input files.
 * Files must be sorted by physical position. This can be achieved using any VCF utility such as (vcf-sort in vcftools)[http://vcftools.sourceforge.net/perl_module.html#vcf-sort]. The caller works by iterating simultaneously across all input files until a matching variant record is found. If a VCF file is not sorted similarly, it is unlikely that any overlapping sites will be found.
-* Missing data on the genotype level is ignored if actual genotypes are available in other VCF files. Missing data is produced only if all sites are missing, or if genotypes do not agree among all call sets.
-
+* VCF files must be indexed with [tabix](http://samtools.sourceforge.net/tabix.shtml). This also requires that they be zipped with bgzip.
 
 Planned Extensions:
 ===================
-* Operating on specific regions using the tabix index.
 * Support for multi-allelic sites.
-* Outputting variant sites which are discordant between callers. This is potentially interesting variation.
-* The ability to specify concordance thresholds on the site and genotype level. This could be particularly helpful if one set of variants is markedly different from others, or if one is interested in finding the union of call sets rather than an intersection.
-* The ability to preserve information from input VCF files. I'm thinking that it would help to specify this information in a high level configuration file. This would allow you to do things like propagate QUAL scores and compute with them downstream.
+* The ability to preserve information from input VCF files. Perhaps by specifying this information in a high level configuration file. This allows operations such as propagating QUAL scores and analyzing them downstream.
 

tools/consensus_caller/galaxy.consensus/consensus_tool/consensus.xml

 <?xml version="1.0"?>
 
-<tool name="naive variant consensus caller" id="awesome_consensus_0.1"  > 
+<tool name="Consensus Genotyper for Exome Variants" id="cges"  > 
     <description>
         Naive consensus caller for ATLAS SNP2, GATK UG, freebayes output files.
     </description>
 
     <command interpreter="python">
-        consensus_genotyper.py $ATLAS_VCF  $GATK_VCF $FREEBAYES_VCF > $OUT
+        consensus_caller/galaxy.consensus/consensus_tool/consensus_genotyper.py
+          --site-threshold $site_threshold
+          --geno-threshold $geno_threshold
+          $ignore_missing
+          $ATLAS_VCF
+          $GATK_VCF
+          $FREEBAYES_VCF
+          > $OUT
     </command>
 
     <inputs>
         <param name="GATK_VCF" type="data" format="vcf"
             label="GATK VCF">
         </param>
-        <param name="OUT" format="vcf" type="data" label="Consensus VCF">
+        <param name="site_threshold" type="integer" value="0" label="Concordance threshold for variant sites.">
+        </param>
+        <param name="geno_threshold" type="integer" value="0" label="Concordance threshold for genotypes.">
+        </param>
+        <param name="ignore_missing" type="boolean" truevalue="--ignore-missing" falsevalue="" label="Ignore missing genotypes during vote.">
         </param>
     </inputs>
     
     </outputs>
 
     <help>
-        This tool is designed to accept output VCF files from freebayes, GATK
-        Unified Genotyper, and ATLAS SNP2 and merge variation called within
-        those.
+    usage: consensus_genotyper.py [-h] [--site-threshold SITETHRESH]
+                              [--genotype-threshold GENOTHRESH]
+                              [--ignore-missing]
+                              VCFS [VCFS ...]
+
+Find sites and genotypes that aggree among an arbitrary number of VCF files.
+
+positional arguments:
+  VCFS                  List of VCF files for input.
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --site-threshold SITETHRESH, -s SITETHRESH
+                        Number of inputs which must agree for a site to be
+                        included in the output.
+  --genotype-threshold GENOTHRESH, -g GENOTHRESH
+                        Number of inputs which must agree for a genotype to be
+                        marked as non-missing.
+  --ignore-missing, -m  Flag specifying how to handle missing genotypes in the
+                        vote. If present, missing genotypes are excluded from
+                        the genotype concordance vote unless all genotypes are
+                        missing.   usage: consensus_genotyper.py [-h] VCFS [VCFS ...]
+
+    Find sites and genotypes which aggree among an arbitrary number of VCF files.
+
+    positional arguments:
+      VCFS        List of VCF files for input.
+
+    optional arguments:
+      -h, --help  show this help message and exit
+
     </help>
 
 </tool>

tools/consensus_caller/galaxy.consensus/consensus_tool/consensus_genotyper.py

-from ensemble_walker import concordant_walker
+#from ensemble_walker import concordant_walker
+from vcf_ensemble import vcf_ensemble
 from variant_ensemble import variant_ensemble
 from consensus_writer import consensus_vcf
 import argparse as arg
+import pysam
+import sys
 
 def __main__():
 
   ## parse command line
-  parser = arg.ArgumentParser(description='Find sites and genotypes which aggree among an arbitrary number of VCF files.')
+  parser = arg.ArgumentParser(description='Find sites and genotypes that aggree among an arbitrary number of VCF files.')
   parser.add_argument('vcfFiles', nargs='+', metavar='VCFS', help='List of VCF files for input.')
+  parser.add_argument('--site-threshold', '-s', dest='siteThresh', action='store', type=int, help='Number of inputs which must agree for a site to be included in the output.')
+  parser.add_argument('--genotype-threshold', '-g', dest='genoThresh', action='store', type=int, help='Number of inputs which must agree for a genotype to be marked as non-missing.')
+  parser.add_argument('--ignore-missing', '-m', dest='ignoreMissing', action='store_true', help='Flag specifying how to handle missing genotypes in the vote. If present, missing genotypes are excluded from the genotype concordance vote unless all genotypes are missing.')
   args = parser.parse_args()
-
-
-  walker = concordant_walker(vcfList = args.vcfFiles, contig = '1')
-  ## set up the consensus VCF you want to write out
-  ## TODO:: there should be a standard and transparent way to propagate information for individual VCF files to the consensus stage.
+ 
+  ## create the VCF ensemble
+  ensemble = vcf_ensemble(vcfList = args.vcfFiles, ignoreMissing = args.ignoreMissing)
+ 
+  ## setup output VCF file. Dummy fields are created for downstream parsing with other tools.
   outVcf = consensus_vcf()
-  ## leave these for later -- there is a better way
-  #outVcf.add_info(id="AQ", number="1", type="Float", description="Atlas QUAL score")
-  #outVcf.add_info(id="FQ", number="1", type="Float", description="Freebayes QUAL score")
-  #outVcf.add_info(id="GQ", number="1", type="Float", description="GATK QUAL score")
-  #outVcf.add_format(id="PL", number="G", type="Integer", description="GATK's normalized, Phred-scaled likelihoods for genotypes as defined in their VCF spec")
-  outVcf.add_format(id="CN", number="1", type="Character", description="Consensus status")
+  outVcf.add_format(id="CN", number="1", type="Character", description="Consensus status. \'C\' is concordant, \'D\' is discordant, and \'A\' is ambiguous (i.e. no majority at the given genotype threshold).")
   outVcf.add_format(id="GT", number="1", type="String", description="Genotype")
-  outVcf.samples = walker.samples
+  outVcf.add_info(id="X1", number="1", type="String", description="Placeholder for INFO parsing")
+  outVcf.add_info(id="X2", number="1", type="String", description="Placeholder 2 for INFO parsing")
+  outVcf.samples = ensemble.samples
   outVcf.write_header()
 
-
-  ## walk over each contig independently
-  #contigs = ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrMT']
-  contigs = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22'] 
-
-  for contig in contigs:
-    ## instantiate a walker over the input vcf files.
-    walker = concordant_walker(vcfList = args.vcfFiles, contig = contig)
-
-    ## iterate over sites which match among all VCFs in concordant_walker
-    for concordantSites in walker.walk_concordant_sites():
-      consensus = variant_ensemble(recordSet=concordantSites, samples=walker.samples)
-      concordantGenotypes = consensus.set_consensus()
-      outVcf.write_record(concordantSites, concordantGenotypes)
+  ## iterate over the concordant sites 
+  for records, genotypes in ensemble.concordant_variants(siteThresh=args.siteThresh, genoThresh=args.genoThresh):
+    outVcf.write_record( records, genotypes )
 
 
 if __name__ == '__main__':

tools/consensus_caller/galaxy.consensus/consensus_tool/consensus_writer.py

     '''
     All variants will be uniquely identified by chr:pos:ref:alt
     '''
-    return ':'.join([rec.CHROM, str(rec.POS), rec.REF, str(rec.ALT)])
+    return ':'.join([rec.CHROM, str(rec.POS), rec.REF, str(rec.ALT[0])])
 
   def write_header(self):
     '''
     pos = str(recordSet[0].POS)
     id = self.make_var_id(recordSet[0])
     ref = recordSet[0].REF
-    alt = str(recordSet[0].ALT)
-    qual = '-'
+    alt = str(recordSet[0].ALT[0])
+    qual = '.'
     filter = 'PASS'
-    info = '-'
+    info = ':'.join( [x["ID"] for x in self.info] )
     format = ':'.join([ x["ID"] for x in self.format ])
     
     ## put together sample fields according to FORMATs order
     for sample in self.samples:
       ## TODO: include where data comes from in self.format -- this is hacky
       if genotypes[sample] == '*':
-        cn = 'F'
+        ## genotype was discordant
+        cn = 'D'
+        geno = './.'
+      elif genotypes[sample] == '**':
+        ## genotype was ambiguous
+        cn = 'A'
         geno = './.'
       else:
-        cn = 'T'
+        cn = 'C'
         geno = self.convert_genotype(genotypes[sample])
       genotypeFields.append(cn+':'+geno)
       
     ## put together the line in the VCF file
-    vcfLine = [chr, pos, id, ref, alt, qual, filter, format] + genotypeFields
+    vcfLine = [chr, pos, id, ref, alt, qual, filter, info, format] + genotypeFields
     print '\t'.join(vcfLine)

tools/consensus_caller/galaxy.consensus/consensus_tool/ensemble_walker.py

 
 class concordant_walker:
   '''
-  Walks across an arbitrary number of VCF files given a chomosome or contig contained in all
+  Walks across an arbitrary number of VCF files given a common contig.
   '''
 
+  @property
+  def positions(self):
+    '''
+    Current positions in each VCF file.
+    '''
+    return [ x.POS for x in self.records ]
+
+  @property
+  def primeIndices(self):
+    '''
+    Indices of VCF readers which match the current largest position.
+    '''
+    return [ i for i, x in enumerate(self.positions) if x==self.prime ]
+
+  @property
+  def prime(self):
+    '''
+    Largest position currently observed.
+    '''
+    return max(self.positions)
+
+  @property
+  def vcfCount(self):
+    '''
+    Number of VCF files passed in to this instance of the consensus walker.
+    '''
+    return len(self.vcfs)
 
   @property
   def samples(self):
+    '''
+    Samples common to all input VCF files.
+    '''
+    sampleSets = [ set(x.samples) for x in self.readers  ]
+    self.samples = reduce( lambda x,y: x.intersection(y), sampleSets )
     return self.samples
+  
 
   def __init__(self, *args, **kwargs):
     self.vcfs = kwargs.get('vcfList')
-    self.contig = kwargs.get('contig')
-    self.readers = [ pyvcf.Reader(open(x), prepend_chr=True) for x in self.vcfs ]
-    self.readers = [ x.fetch(chrom=self.contig, start=1, end=1000000000) for x in self.readers ]
-
-    ## number of readers passed to this instance
-    self.vcfCount = len(self.vcfs)
-    ## first records in each file
-    self.records = [ x.next() for x in self.readers ] 
-    ## track current positions of each record
-    self.positions = [ int(rec.POS) for rec in self.records  ]
-    ## track current greatest position (known as the prime position, and prime readers)
-    self.prime = max(self.positions)
-    ## pull inidices of readers which have reached the largest observed position
-    self.primeIndices = [ i for i in range(self.vcfCount) if self.positions[i]==self.prime ]
-    ## identify samples which match between sets
-    sampleSets = [ set(x.samples) for x in self.readers  ]
-    self.samples = reduce( lambda x,y: x.intersection(y), sampleSets )
-
+    ## start readers across the VCF files
+    self.readers = [ pyvcf.Reader(open(x), prepend_chr = False) for x in self.vcfs ]
+    
+    self._getNewRecords()
     ## TODO: check that files are sorted
     for vcf in self.vcfs:
       pass
 
-  def __updatePrimeIndices(self):
-    '''
-    Update indices of VCF readers which match the current largest position.
-    '''
-    self.primeIndices = [ i for i, x in enumerate(self.positions) if x==self.prime ]
-
-
-  def __updateRecords(self):
+  def _updateRecords(self):
     '''
     Iterate non-prime readers.
     '''
-    ## pull indices of readers which need to iterate to reach or exceed the prime position 
-    iterIndices = [ i for i in range(self.vcfCount) if i not in self.primeIndices ] 
     ## update records in non-prime readers
     self.records = [ x if i in self.primeIndices else self.readers[i].next() for i,x in enumerate(self.records) ]
-    self.positions = [ int(x.POS) for x in self.records ]
     ## update prime if necessary
     if max(self.positions) > self.prime:
       self.prime = max(self.positions)
-    ## update which readers we are iterating over
-    self.__updatePrimeIndices()
 
+  def _getNewRecords(self):
+    '''
+    Get a new set of records after processing concordant sites or initilization.
+    '''
+    self.records = [ x.next() for x in self.readers ]
+    self._updateRecords()
 
-  def __getNewPositions(self):
-    '''
-    Try a new set of positions after processing concordant sites.
-    '''
-    self.positions = [ int(x.next().POS) for x in self.readers ]
-    self.__updatePrimeIndices()
-
-  def __iterUntilEqual(self):
-    '''
-    Update the positions vector until all positions are equal.
-    '''
-    ## check if all readers have reached the same position
-    while len(set(self.positions)) != 1:
-      try:
-        self.__updateRecords()
-      except StopIteration:
-        ## we've exhausted one of the readers
-        return False
-    ## we have not yet exhausted the readers
-    return True
 
   def walk_concordant_sites(self):
     '''
-    Return variants which match across call sets as you walk files
+    Return variants which match across call sets as you walk files.
     
     Start simple:
-      * assume vcf is sorted
       * assume only SNPs
       * assume only bi-allelic variation
-      * assume only one chromosome
     '''
+    ## iterate until all positions are equal
+    while True:
+      if len(set(self.positions)) != 1:
+        try: self._updateRecords()
+        ## stop only when a reader has been exhausted
+        except StopIteration:
+          break
+      else:
+        yield self.records
+        self._getNewRecords()
+ 
 
-    while self.__iterUntilEqual():
-      ## move on to next set of variants
-      self.__getNewPositions()
-
-      ## yield set genotypes
-      yield self.records
-
-    
-
-

tools/consensus_caller/galaxy.consensus/consensus_tool/variant_ensemble.py

+import itertools as it
+from collections import Counter
+from my_exceptions import ambiguousConcordance
+from my_exceptions import discordant
+
 class variant_ensemble:
   '''
   A set of variants from various VCF files. We want to be able to call various forms of consensus.
   def __init__(self, *args, **kwargs):
     self.recordSet = kwargs.get('recordSet')
     self.samples = kwargs.get('samples') 
+    self.ignoreMissing = kwargs.get('ignoreMissing')
+    self.threshold = kwargs.get('threshold')
 
-  def is_genotype_consensus(self, calls):
+  def _genotype_concordance(self, calls):
     '''
-    Determine whether all _Calls for a sample at a site are equal.
+    Find the call which agrees at a certain threshold. If a tie is observed, an exception is raised, and a missing value will be written to the VCF file and flagged as ambiguous.
     '''
-    ## check to see if all calls are equal
-    if calls[1:] == calls[:-1]: return True
-    else: return False
+    ## count occurences of a genotype, and those which match the threshold
+    counted = Counter(calls).most_common()
+    matched = [ x[0] for x in counted if x[1]>=self.threshold ]
+    ## check for a tie -- and raise an exception if necessary
+    if len(matched) > 1: raise ambiguousConcordance('tie!')
+    elif not matched: raise discordant('no match')
+    else: return matched[0]
+    
+  def set_concordance(self):
+    '''
+    Return a set of concordant genotypes at a specified threshold.
+    '''
+    concordantGenotypes = dict()
+    for sample in self.samples:
+      calls = [ record.genotype(sample).gt_type for record in self.recordSet ]
+      if self.ignoreMissing:
+        calls = [ x for x in calls if x != None ] 
+      if not calls:
+        concordantGenotypes[sample] = './.'
+        continue
+      ## try to find a concordant genotype
+      try:
+        concordantGenotypes[sample] = self._genotype_concordance(calls)
+      except ambiguousConcordance:
+        concordantGenotypes[sample] = '**'
+      except discordant:
+        concordantGenotypes[sample] = '*'
+    return concordantGenotypes
 
-  def set_consensus(self):
-    '''
-    Return a consensus set of genotypes.
-      Discount missing data during the vote.
-    '''
-    consensusGenotypes = dict()
-    for sample in self.samples:
-      genotypes = [ record.genotype(sample) for record in self.recordSet ]
-      genotypes = [ x for x in genotypes if x.gt_type != None ]
-      ## handle the missing genotype data here
-      if not genotypes:
-        consensusGenotypes[sample] = './.'
-        continue
-      consensusGenotype = self.is_genotype_consensus(calls=genotypes)
-      if consensusGenotype: 
-        ## return the consensus genotype
-        consensusGenotypes[sample] = genotypes[0].gt_type
-      else:
-        ## no consensus so we return a '*' in its place
-        consensusGenotypes[sample] = '*'
-    return consensusGenotypes
-
-
-