Wiki

Clone wiki

BAM-matcher / generate_VCF

Generating variants file for BAM-matcher

Human data - hg19

For BAM-matcher to function most effectively, the VCF file should:

  1. refer to the same genomic positions in the genome reference fasta used for mapping BAM files to be compared
  2. conform to VCF standard specification.
  3. contain informative SNP loci. Indels are not currently supported, nor likely to be any time soon.

By informative loci, we mean that the SNPs should be variants where both alleles (variant and reference) are observed at similar frequency, therefore, most likely to observe differences between samples.

For the VCF files provided with BAM-matcher, we extracted the variants from 1000Genomes Project database. The current version of the 1000Genomes VCF file is compatible with human genome reference hg19/GRCh37 in Ensembl/Broad format (i.e. chromosome names do not contain "chr").

We extracted only variants which are:

  1. exonic: therefore, more likely to be covered in whole exome sequencing (WES) and RNA-sequencing data.
  2. have global minor allele frequency between 0.45 and 0.55: which should have the strongest discriminating power between different samples.
  3. found only in the autosomes and X chromosome, discarding Y chromosome, mitochondrial and minor contigs. This is to allow for compatibility with the UCSC version of hg19/GRCh37, which have sequence differences in those regions.

Once the variants were extracted, we randomly selected subsets of the variants containing desired number of variants. The only difference between the three provided VCF files is the number of variants included in them.


Generating custom VCFs for other (human or non-human) genomes

If you wish to build custom VCF file for human samples, the previously mentioned 1000Genomes Project data base is useful for hg19/GRCh37 assembly. However, the SNP data for GRCh38 has not yet been released, but you can extract similar informative sites using dbSNP data (which also provides global minor allele frequency information).

Steps for generating VCF files to use for BAM-matcher

1) Download VCF file from 1000Genomes Project

2) For WES (whole-exome-sequencing) or RNA-sequencing projects, it is useful to extract exonic variants. Exome capture kit providers generally provides a BED file of the capture regions, you can use this to extract exonic variants with bedtools, e.g.:

bedtools intersect -a variants.vcf -b regions.bed > exon_variants.vcf

3) Extract SNP variants only. You can do this with Picard tools SplitVcfs, e.g.:

java -jar picard.jar SplitVcfs \
    I=input.vcf \
    SNP_OUTPUT=snp.vcf \
    INDEL_OUTPUT=indel.vcf \
    STRICT=false

4) There is no simple tools to extract variants with specific allele frequencies that I am aware of, but you can do this with 'grep'. For example, using 1000Genomes 1KG_AF (global allele frequency):

grep "1KG_AF=0.5" 1KG_variants.vcf > extracted_variants.vcf

should extract all variants with global allele frequency from 0.5 to (less than) 0.6.

To extract a range (say, 0.45 - 0.55), you will need to grep for multiple values and combine the output.

You can similarly extract variants using the sub-population allele frequencies, for example:

grep "AFR_AF=0.4" 1KG_variants.vcf > extracted_variants.vcf
should extract variants with allele frequency between 0.4 and <0.5 in the African population.

5) To extract just a subset of the variants, you can use awk to extract every nth line, e.g.:

awk 'NR % 3 == 0' input.vcf > output.vcf
should extract every third line in the VCF file.

However, in steps 4) and 5) you should treat the header line separately, e.g. extract header lines first:

grep "#" 1KG_variants.vcf > vcf_header

And then append the header and variant lines together at the end after extraction:

cat vcf_header vcf_lines > final_extracted.vcf

Non-human genomes

For non-human genomes, it is important to:

  1. have a genome reference (used for mapping reads)
  2. use SNP database that provides some form of observed allele frequencies. For example: C. elegans SNP data, drosophila SNP databases, Salk Arabidopsis 1,001 Genomes.
  3. convert the SNP data to the genome reference used for mapping reads

It is important to select sites where observed frequencies of two alleles are similar. Using sites where the allele frequency of one allele is much greater than the other, the overall discrimination power would not be very high.

Updated