Files changed (6)
-This is an initial implementation of a two stage voting scheme among variant calling algorithms. Given a set of VCF files produced by various algorithms, sites are selected if they are seen among all callers. Genotypes among these sites are then selected as those that match among all callers. Currently, a user can input any number of sorted VCF files, and a strict consensus of variant sites and genotypes will be generated.
+This is an implementation of an ensemble variant calling method. Specifically, it takes VCF files generated by various calling algorithms and merges them according to specified thresholds on variant and genotype concordance. The resulting VCF can range from a strict consensus among inputs, to a union of all possible observations.
-Any VCF can be used as long as it can be parsed by [James Casbon's pyVCF module](https://github.com/jamescasbon/PyVCF).
-Will take the three test files in the data directory and generate a strict consensus of sites and genotypes (i.e. 3/3 files containt the variant site, and 3/3 files agree on the genotype for a sampple at that site).
+Will take the three test files in the data directory and generate a strict consensus of sites and genotypes (i.e. 3/3 files contain the variant site, and 3/3 files agree on the genotype for a sampple at that site).
* Multi-sample VCF files are currently supported, and the output will contain only samples which are found in all input files.
* Files must be sorted by physical position. This can be achieved using any VCF utility such as (vcf-sort in vcftools)[http://vcftools.sourceforge.net/perl_module.html#vcf-sort]. The caller works by iterating simultaneously across all input files until a matching variant record is found. If a VCF file is not sorted similarly, it is unlikely that any overlapping sites will be found.
-* Missing data on the genotype level is ignored if actual genotypes are available in other VCF files. Missing data is produced only if all sites are missing, or if genotypes do not agree among all call sets.
+* VCF files must be indexed with [tabix](http://samtools.sourceforge.net/tabix.shtml). This also requires that they be zipped with bgzip.
-* Outputting variant sites which are discordant between callers. This is potentially interesting variation.
-* The ability to specify concordance thresholds on the site and genotype level. This could be particularly helpful if one set of variants is markedly different from others, or if one is interested in finding the union of call sets rather than an intersection.
-* The ability to preserve information from input VCF files. I'm thinking that it would help to specify this information in a high level configuration file. This would allow you to do things like propagate QUAL scores and compute with them downstream.
+* The ability to preserve information from input VCF files. Perhaps by specifying this information in a high level configuration file. This allows operations such as propagating QUAL scores and analyzing them downstream.
+ <param name="site_threshold" type="integer" value="0" label="Concordance threshold for variant sites.">
+ <param name="geno_threshold" type="integer" value="0" label="Concordance threshold for genotypes.">
+ <param name="ignore_missing" type="boolean" truevalue="--ignore-missing" falsevalue="" label="Ignore missing genotypes during vote.">
- parser = arg.ArgumentParser(description='Find sites and genotypes
whichaggree among an arbitrary number of VCF files.')
+ parser = arg.ArgumentParser(description='Find sites and genotypes aggree among an arbitrary number of VCF files.')
+ parser.add_argument('--site-threshold', '-s', dest='siteThresh', action='store', type=int, help='Number of inputs which must agree for a site to be included in the output.')
+ parser.add_argument('--genotype-threshold', '-g', dest='genoThresh', action='store', type=int, help='Number of inputs which must agree for a genotype to be marked as non-missing.')
+ parser.add_argument('--ignore-missing', '-m', dest='ignoreMissing', action='store_true', help='Flag specifying how to handle missing genotypes in the vote. If present, missing genotypes are excluded from the genotype concordance vote unless all genotypes are missing.')
- ## TODO:: there should be a standard and transparent way to propagate information for individual VCF files to the consensus stage.
- #outVcf.add_format(id="PL", number="G", type="Integer", description="GATK's normalized, Phred-scaled likelihoods for genotypes as defined in their VCF spec")
+ outVcf.add_format(id="CN", number="1", type="Character", description="Consensus status. \'C\' is concordant, \'D\' is discordant, and \'A\' is ambiguous (i.e. no majority at the given genotype threshold).")
- #contigs = ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrMT']
- contigs = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22']
+ for records, genotypes in ensemble.concordant_variants(siteThresh=args.siteThresh, genoThresh=args.genoThresh):
self.records = [ x if i in self.primeIndices else self.readers[i].next() for i,x in enumerate(self.records) ]
+ Find the call which agrees at a certain threshold. If a tie is observed, an exception is raised, and a missing value will be written to the VCF file and flagged as ambiguous.