Wiki

Clone wiki

BAM-matcher / multi_ref

Comparing BAM files mapped to different (but compatible) genome references

Human genome: hg19/GRCh37

We originally designed BAM-matcher to be able to deal with the two compatible versions of the human genome reference assembly hg19/GRCh37:

  • Ensembl/Broad - chromosome names do not contain "chr", e.g. 1, 2, 3, ... 22, X, Y, MT
  • UCSC - chromosome names contain "chr", e.g.: chr1, chr2, ... chr22, chrX, chrY, chrM

Aside from the chromosome names, the sequences are mostly compatible, except for the mitochondrial chromosome.

In the simple case where there is only one reference genome, and both BAM files were mapped to the same reference, and the variants VCF file also refers to this version, then BAM-matcher is quite straight-forward. Just add the path to the reference FASTA file in the configuration file:

REFERENCE: /path/to/genome/reference/genome.fasta
and make sure that the reference file has been indexed by samtools faidx:

samtools faidx genome.fasta
which should generate a .fai file.

However, if you have two BAM files, and one is mapped to the Broad hg19, and the other to UCSC, BAM-matcher is also able to compare between these two BAM files. However, it requires:

1) Path to both reference fasta files (indexed) in the configuration file:

REFERENCE:     /path/to/genome/reference/Broad.hg19.fasta
REF_ALTERNATE: /path/to/genome/reference/UCSC.hg19.fasta

The default version (REFERENCE) should be the version to which the VCF file is compatible. For example, if the above config setting is used, and the VCF file is to the UCSC version, then BAM-matcher will either fail or you will not have any locus compared.

The VCF files provided with BAM-matcher are extracted from 1000Genomes Project VCF file, which uses the Broad/Ensembl hg19, so to use the provided VCF, this should be the version specified for REFERENCE.

You should not use REF_ALTERNATE without REFERENCE already specified.

2) You will also need a chromosome map, which tells BAM-matcher what are the matching chromosome names in the two genome references. This can be specified in the configuration file:

CHROM_MAP:  /path/to/hg19.chromosome_map

An hg19 chromosome map ("hg19.chromosome_map") is provided with BAM-matcher, which contains the following:

DEFAULT ALTERNATE
1   chr1
2   chr2
3   chr3
4   chr4
5   chr5
6   chr6
7   chr7
8   chr8
9   chr9
10  chr10
11  chr11
12  chr12
13  chr13
14  chr14
15  chr15
16  chr16
17  chr17
18  chr18
19  chr19
20  chr20
21  chr21
22  chr22
X   chrX
Y   chrY

The DEFAULT column should contain the chromosome names in the default reference (REFERENCE), and the ALTERNATE contains the matching names in the alternate reference (REF_ALTERNATE).

MT/chrM is not included in the file because the sequences in the two versions are not compatible, so conversion of a variant genomic position from one to the other requires more than just a chromosome renaming.

Once the genome reference files have been specified in the configuration file, BAM-matcher will automatically check the chromosome names in the BAM files against the names in the REFERENCE/REF_ALTERNATE files and the chromosome map, and determine which is the correct reference for the BAM file.

So as long as the BAM files have been specified, you can just run a comparison as normal:

bam-matcher.py -B1 sample_to_Broad.bam -B2 sample_to_UCSC.bam

The cached genotype data is always stored in the REFERENCE format.


Overriding configuration

If you wish to use a different reference file without altering the configuration file, you can always specify them in the run time arguments (--reference/-R):

bam-matcher.py -B1 sample1.bam -B2.sample2.bam -R different_reference.fasta
And similarly for the alternate reference (--ref-alternate/-R2) and chromosome map (--chromosome-map/-M).


Working with non-human genome

For non-human genomes, it is important to:

  1. have a genome reference (used for mapping reads)
  2. use SNP database that provides some form of observed allele frequencies. For example: C. elegans SNP data, drosophila SNP databases, Salk Arabidopsis 1,001 Genomes.
  3. convert the SNP data to the genome reference used for mapping reads

The major challenge with working with non-human genomes is most likely the lack of population-wide SNPs and associated allele frequency.

It is important to select informative sites where observed frequencies of two (or more) alleles are similar. Using sites where the allele frequency of one allele is much greater than the other, the overall discrimination power would not be very high.

Non-human genome example: Mus musculus

1) First you should get a copy of the genome reference. This should be the same version of genome reference assembly used to map the sequenced read data (input BAM files).

2) Download mouse SNP data, e.g. from Sanger Mouse Genomes Project, or dbSNP.

The SNP data would require filtering. For identifying individual mouse, you will need to rely on data from studies on genomic variation of the population within the same strain/species (e.g. Keane et al. (2011), Nature 477(7364):289-294, DOI: 10.1038/nature10413). Alternatively, you can select variants which are commonly divergent between different strains (using the strain data provided in the Sanger Mouse Genomes Project.

You also need to make sure that SNP data (VCF file) matches the version of genome reference used for mapping sequence data.

3) Once you have the genome reference fasta file and the relevant variants data (VCF), just provide these in a BAM-matcher configuration file, you should then be able to use BAM-matcher on mouse samples.

Updated