Tutorial: Testing BAM-matcher installation using example data

How to get example data

If you installed via the git method, then you should find a sub-directory "test_data" under your BAM-matcher directory which contains some BAM files and VCF file.

If these files are missing, or if you installed via pip, you can download these files by:

visit: https://bitbucket.org/sacgf/bam-matcher/src
click on "test_data"
click on each file to download.

Example Data

In the directory, you should see these files:

merged.RG.bam
merged.RG.bam.bai
sample1.bam
sample1.bam.bai
sample2.bam
sample2.bam.bai
test.vcf
expected_outcomes

Genome Reference

These BAM files have all been mapped to human genome reference hg19/GRCh37 assembly. The FASTA file can be downloaded from either Broad Institute (as part of the GATK Resource Bundle), or from [Ensembl] (ftp://ftp.ensembl.org/pub/release-75//fasta/homo_sapiens/dna/). This is the version of hg19 that does not contain "chr" in the chromosome names.

To use the example BAM files, you should have a copy of the hg19 genome, and enter the full path to the fasta file in the configuration file (REFERENCE). Alternatively, you can also specify the path to the fasta file during runtime, using the --reference argument.

Variants file (VCF)

Three VCF files have been provided with BAM-matcher:

1kg.exome.highAF.1511.vcf
1kg.exome.highAF.3680.vcf
1kg.exome.highAF.7550.vcf

These are all extracted from 1000Genomes Project database. The variants are all exonic and have global minor allele frequency between 0.45 and 0.55. The number in the file names (1511, 3680, 7550) indicates the number of variants contained in each.

Testing BAM-matcher using the example data

Assuming that BAM-matcher has been installed correct, we can now test the installation using the example data and also demonstrate the various features in BAM-matcher.

Using Freebayes

The configuration file should have these parameters set as follows:

caller:    freebayes

VCF_file:  /full/path/to/1kg.exome.highAF.1511.vcf

fast_freebayes: True

REFERENCE: /full/path/to/hg19.fasta

CACHE_DIR: /full/path/to/cache_directory

If "freebayes" is not a system-recognised command, you may need to specify the full path to the freebayes executable binary.

Then to test the example data:

# navigate to directory containing the example data
cd /path/to/example/data/directory

# run bam-matcher 
bam-matcher.py -B1 sample1.bam -B2 sample2.bam

You should then see the output:

bam1:   sample1.bam
bam2:   sample2.bam
variants:   1kg.exome.highAF.1511.vcf
depth threshold: 15
________________________________________

Positions with same genotype:   162
     breakdown:    hom: 71
                   het: 91
________________________________________

Positions with diff genotype:   108
     breakdown:
                       BAM 1
               | het  | hom  | subset
        -------+------+------+-------
         het   |    1 |    0 |   31 |
        -------+------+------+-------
BAM 2    hom   |    0 |   21 |   -  |
        -------+------+------+-------
         subset|   55 |   -  |   -  |
________________________________________

Total sites compared: 270
Fraction of common: 0.600000 (162/270)
________________________________________
CONCLUSION:
BAM FILES ARE FROM DIFFERENT SOURCES

The exact number of sites reports in each category may differ from the above if different version of Freebayes was used. The results above was generated using Freebayes v1.0.1-1-g683b3cc.

Common errors

1) bam-matcher.py: command not found

The BAM-matcher executable ('bam-matcher.py') was not in the PATH variable. See installation guide on how to do this.

2)

+--------------+
| CONFIG ERROR |
+--------------+
Cannot access config file (/path/to/install/bam-matcher/bam-matcher.conf).
It either does not exist or is not readable.

The default configuration file is missing or has not been generated. If you have generated a configuration file, but it is not in the default path/name, then you can specify at run time using --config/-c

bam-matcher.py --config /path/to/config_file -B1 sample1.bam -B2 sample2.bam

3)

+-----------------------+
| VARIANT CALLING ERROR |
+-----------------------+
Caller check failed.

Command tested:
freebayes --version

Python error msg:
[Errno 2] No such file or directory

Please check the caller command or path to the binary.

The caller check failed. Check your configuration file for Freebayes command.

If you encounter more problems, please see the Troubleshooting guide.

Output

As no output file was specified at run time, BAM-matcher will write an output file:

bam_matcher_report.sample1_sample2_########

in the same directory from which the command was generated, where ######## is randomly generated string.

1) To save the output file to a specific location, use --output/-o, e.g.

bam-matcher.py -B1 sample1.bam -B2 sample2.bam  -o bam-matcher.output

2) If you do not want to write an output file at all, using the --no-report/-n option.

3) If you want to generate a short-form output, use --short-output/-so. This will generate a tab-separated values (TSV) file instead.

3) To generate an HTML output, use --html/-H.

Cached data and other parameters

If you tried running the same command again, you may have noticed that BAM-matcher takes much less time to report the results. This is because it stores the genotype data of previous called samples.

However, the cached data for each BAM file is specific to:

BAM file path
VCF file path
threshold depth
genome reference path
NUMBER_OF_SNPs parameter
BAM file timestamp
BAM file header

Therefore, if any of these are altered, BAM-matcher will recalculate the genotype data. However, the specific caller used to generate the genotype data does not matter. So even if you change the caller, it will still use the cached data rather than recalculating with a different caller.

For example, if you run the following:

bam-matcher.py -B1 sample1.bam -B2 sample2.bam --caller gatk

The --caller/-CL argument overrides the caller setting in the configuration file, and in this case it tells BAM-matcher to use GATK instead of Freebayes, the default set in the configuration.

However, the command above should give you exactly the same result, even if you don't have GATK installed/configured. This is because it's using the cached genotype data, previously calculated by Freebayes.

If you need to recalculate the results using GATK and not use the cached data, use the --recalculate/-RC option:

bam-matcher.py -B1 sample1.bam -B2 sample2.bam --caller gatk --recalculate

If you have GATK configured as well, then it should give you this output (calculated using GATK 3.5-0-g36282e4):

bam1:   sample1.bam
bam2:   sample2.bam
variants:   1kg.exome.highAF.1511.vcf
depth threshold: 15
________________________________________

Positions with same genotype:   170
     breakdown:    hom: 73
                   het: 97
________________________________________

Positions with diff genotype:   112
     breakdown:
                       BAM 1
               | het  | hom  | subset
        -------+------+------+-------
         het   |    0 |    0 |   32 |
        -------+------+------+-------
BAM 2    hom   |    0 |   22 |   -  |
        -------+------+------+-------
         subset|   58 |   -  |   -  |
________________________________________

Total sites compared: 282
Fraction of common: 0.602837 (170/282)
________________________________________
CONCLUSION:
LIKELY FROM DIFFERENT SOURCES

Changing threshold depth

If you alter the threshold depth, it should also prompt a recalculation of the genotype data:

bam-matcher.py -B1 sample1.bam -B2 sample2.bam -DP 8

The --dp-threshold/-DP option overrides the DP_threshold value in the configuration file. This command should generate this output (back to Freebayes as the default caller):

bam1:   sample1.bam
bam2:   sample2.bam
variants:   1kg.exome.highAF.1511.vcf
depth threshold: 8
________________________________________

Positions with same genotype:   128
     breakdown:    hom: 30
                   het: 98
________________________________________

Positions with diff genotype:   43
     breakdown:
                       BAM 1
               | het  | hom  | subset
        -------+------+------+-------
         het   |    0 |    0 |   18 |
        -------+------+------+-------
BAM 2    hom   |    0 |    0 |   -  |
        -------+------+------+-------
         subset|   25 |   -  |   -  |
________________________________________

Total sites compared: 171
Fraction of common: 0.748538 (128/171)
________________________________________
CONCLUSION:
LIKELY FROM DIFFERENT SOURCES

Compared to the original Freebayes result, there are more sites compared as the threshold depth is lower. However, using lower threshold depth will lead to inclusion of less reliable genotype calls (sites with fewer reads).

Changing VCF file

The run time argument --vcf/-V can be used to change the variant list used for comparison. For example, we can switch to another of the provided VCF files by:

bam-matcher.py -B1 sample1.bam -B2 sample2.bam --vcf /path/to/1KG.exome.highAF.7550.AF

which should have this result:

bam1:   sample1.bam
bam2:   sample2.bam
variants:   1kg.exome.highAF.7550.vcf
depth threshold: 15
________________________________________

Positions with same genotype:   139
     breakdown:    hom: 57
                   het: 82
________________________________________

Positions with diff genotype:   77
     breakdown:
                       BAM 1
               | het  | hom  | subset
        -------+------+------+-------
         het   |    1 |    0 |   21 |
        -------+------+------+-------
BAM 2    hom   |    0 |   12 |   -  |
        -------+------+------+-------
         subset|   43 |   -  |   -  |
________________________________________

Total sites compared: 216
Fraction of common: 0.643519 (139/216)
________________________________________
CONCLUSION:
LIKELY FROM DIFFERENT SOURCES

As the input VCF file has changed, BAM-matcher will not use the cached data and will recalculate.

In general, comparing BAM files using VCF with more variants should lead to more sites being compared and thus better judgement for whether the two files are from the same source.

However, in this case, there are actually fewer sites compared when switching to a VCF file with more variants. This is because the example data are artificially constructed around the sites in the first VCF file, and not all sites overlap with the second VCF file.

Debug and Verbose modes

The debug (--debug/-d) and verbose (--verbose/-v) modes are useful for troubleshooting in general.

Enabling the debug mode will basically keep all the temporary files in the scratch directory. For example, running:

bam-matcher.py -B1 sample1.bam -B2 sample2.bam --debug

should give you the same output, but also have this message at the end:

##DEBUG##
Temporary files were written to: /tmp/############

where ############ is a randomly generated string. Normally, this scratch directory is deleted at the end of a run, but not in the debug mode.

The location of the scratch directory can be specified run time with (--scratch-dir/-s), but if not specified, it will use a subdirectory under /tmp.

The verbose mode is enabled by --verbose/-v, it will simply provide more detailed report on the data processing.

Using GATK

If you would like to set GATK as your default genotype caller, then you should have these parameters set in the configuration file:

caller:    gatk
GATK:      /full/path/to/GenomeAnalysisTK.jar
java:      java
GATK_MEM:  4
GATK_nt:   1

VCF_file:  /full/path/to/1kg.exome.highAF.1511.vcf

REFERENCE: /full/path/to/hg19.fasta

CACHE_DIR: /full/path/to/cache_directory

The value for "java" parameter should be the command to invoke Java VM.

You should check that your java VM version is compatible with the version of GATK you are using.

If this is configured correctly, then just run (from the directory containing the test data):

bam-matcher.py -B1 sample1.bam -B2 sample2.bam

which should give you this output (calculated using GATK 3.5-0-g36282e4):

bam1:   sample1.bam
bam2:   sample2.bam
variants:   1kg.exome.highAF.1511.vcf
depth threshold: 15
________________________________________

Positions with same genotype:   170
     breakdown:    hom: 73
                   het: 97
________________________________________

Positions with diff genotype:   112
     breakdown:
                       BAM 1
               | het  | hom  | subset
        -------+------+------+-------
         het   |    0 |    0 |   32 |
        -------+------+------+-------
BAM 2    hom   |    0 |   22 |   -  |
        -------+------+------+-------
         subset|   58 |   -  |   -  |
________________________________________

Total sites compared: 282
Fraction of common: 0.602837 (170/282)
________________________________________
CONCLUSION:
LIKELY FROM DIFFERENT SOURCES

Remember if you have calculated this previously using Freebayes, it will just report the cached data, unless you specify --recalculate/-RC.

Using VarScan

Using VarScan should have these parameters set in the configuration file:

caller:    varscan
samtools:  samtools
varscan:   /full/path/to/VarScan.jar
java:      java

VARSCAN_MEM: 4

VCF_file:  /full/path/to/1kg.exome.highAF.1511.vcf

REFERENCE: /full/path/to/hg19.fasta

CACHE_DIR: /full/path/to/cache_directory

VarScan is the slowest of the three callers, as it needs samtools to generate pileup data first. However, it is likely also the most robust method.

Running BAM-matcher with VarScan and settings above should generate this output (VarScan v2.3.9):

bam1:   sample1.bam
bam2:   sample2.bam
variants:   1kg.exome.highAF.1511.vcf
depth threshold: 15
________________________________________

Positions with same genotype:   158
     breakdown:    hom: 71
                   het: 87
________________________________________

Positions with diff genotype:   107
     breakdown:
                       BAM 1
               | het  | hom  | subset
        -------+------+------+-------
         het   |    0 |    0 |   30 |
        -------+------+------+-------
BAM 2    hom   |    0 |   24 |   -  |
        -------+------+------+-------
         subset|   53 |   -  |   -  |
________________________________________

Total sites compared: 265
Fraction of common: 0.596226 (158/265)
________________________________________
CONCLUSION:
BAM FILES ARE FROM DIFFERENT SOURCES

Wiki

BAM-matcher / Usage