Wiki
Clone wikiBAM-matcher / Usage
Tutorial: Testing BAM-matcher installation using example data
How to get example data
If you installed via the git method, then you should find a sub-directory "test_data" under your BAM-matcher directory which contains some BAM files and VCF file.
If these files are missing, or if you installed via pip, you can download these files by:
- visit: https://bitbucket.org/sacgf/bam-matcher/src
- click on "test_data"
- click on each file to download.
Example Data
In the directory, you should see these files:
merged.RG.bam merged.RG.bam.bai sample1.bam sample1.bam.bai sample2.bam sample2.bam.bai test.vcf expected_outcomes
Genome Reference
These BAM files have all been mapped to human genome reference hg19/GRCh37 assembly. The FASTA file can be downloaded from either Broad Institute (as part of the GATK Resource Bundle), or from [Ensembl] (ftp://ftp.ensembl.org/pub/release-75//fasta/homo_sapiens/dna/). This is the version of hg19 that does not contain "chr" in the chromosome names.
To use the example BAM files, you should have a copy of the hg19 genome, and enter the full path to the fasta file in the configuration file (REFERENCE). Alternatively, you can also specify the path to the fasta file during runtime, using the --reference argument.
Variants file (VCF)
Three VCF files have been provided with BAM-matcher:
1kg.exome.highAF.1511.vcf 1kg.exome.highAF.3680.vcf 1kg.exome.highAF.7550.vcf
These are all extracted from 1000Genomes Project database. The variants are all exonic and have global minor allele frequency between 0.45 and 0.55. The number in the file names (1511, 3680, 7550) indicates the number of variants contained in each.
Testing BAM-matcher using the example data
Assuming that BAM-matcher has been installed correct, we can now test the installation using the example data and also demonstrate the various features in BAM-matcher.
Using Freebayes
The configuration file should have these parameters set as follows:
caller: freebayes VCF_file: /full/path/to/1kg.exome.highAF.1511.vcf fast_freebayes: True REFERENCE: /full/path/to/hg19.fasta CACHE_DIR: /full/path/to/cache_directory
If "freebayes" is not a system-recognised command, you may need to specify the full path to the freebayes executable binary.
Then to test the example data:
# navigate to directory containing the example data cd /path/to/example/data/directory # run bam-matcher bam-matcher.py -B1 sample1.bam -B2 sample2.bam
bam1: sample1.bam bam2: sample2.bam variants: 1kg.exome.highAF.1511.vcf depth threshold: 15 ________________________________________ Positions with same genotype: 162 breakdown: hom: 71 het: 91 ________________________________________ Positions with diff genotype: 108 breakdown: BAM 1 | het | hom | subset -------+------+------+------- het | 1 | 0 | 31 | -------+------+------+------- BAM 2 hom | 0 | 21 | - | -------+------+------+------- subset| 55 | - | - | ________________________________________ Total sites compared: 270 Fraction of common: 0.600000 (162/270) ________________________________________ CONCLUSION: BAM FILES ARE FROM DIFFERENT SOURCES
The exact number of sites reports in each category may differ from the above if different version of Freebayes was used. The results above was generated using Freebayes v1.0.1-1-g683b3cc.
Common errors
1) bam-matcher.py: command not found
The BAM-matcher executable ('bam-matcher.py') was not in the PATH variable. See installation guide on how to do this.
2)
+--------------+ | CONFIG ERROR | +--------------+ Cannot access config file (/path/to/install/bam-matcher/bam-matcher.conf). It either does not exist or is not readable.
The default configuration file is missing or has not been generated. If you have generated a configuration file, but it is not in the default path/name, then you can specify at run time using --config/-c
bam-matcher.py --config /path/to/config_file -B1 sample1.bam -B2 sample2.bam
3)
+-----------------------+ | VARIANT CALLING ERROR | +-----------------------+ Caller check failed. Command tested: freebayes --version Python error msg: [Errno 2] No such file or directory Please check the caller command or path to the binary.
If you encounter more problems, please see the Troubleshooting guide.
Output
As no output file was specified at run time, BAM-matcher will write an output file:
bam_matcher_report.sample1_sample2_########
in the same directory from which the command was generated, where ######## is randomly generated string.
1) To save the output file to a specific location, use --output/-o, e.g.
bam-matcher.py -B1 sample1.bam -B2 sample2.bam -o bam-matcher.output
2) If you do not want to write an output file at all, using the --no-report/-n option.
3) If you want to generate a short-form output, use --short-output/-so. This will generate a tab-separated values (TSV) file instead.
3) To generate an HTML output, use --html/-H.
Cached data and other parameters
If you tried running the same command again, you may have noticed that BAM-matcher takes much less time to report the results. This is because it stores the genotype data of previous called samples.
However, the cached data for each BAM file is specific to:
- BAM file path
- VCF file path
- threshold depth
- genome reference path
- NUMBER_OF_SNPs parameter
- BAM file timestamp
- BAM file header
Therefore, if any of these are altered, BAM-matcher will recalculate the genotype data. However, the specific caller used to generate the genotype data does not matter. So even if you change the caller, it will still use the cached data rather than recalculating with a different caller.
For example, if you run the following:
bam-matcher.py -B1 sample1.bam -B2 sample2.bam --caller gatk
The --caller/-CL
argument overrides the caller setting in the configuration file, and in this case it tells BAM-matcher to use GATK instead of Freebayes, the default set in the configuration.
However, the command above should give you exactly the same result, even if you don't have GATK installed/configured. This is because it's using the cached genotype data, previously calculated by Freebayes.
If you need to recalculate the results using GATK and not use the cached data, use the --recalculate/-RC option:
bam-matcher.py -B1 sample1.bam -B2 sample2.bam --caller gatk --recalculate
If you have GATK configured as well, then it should give you this output (calculated using GATK 3.5-0-g36282e4):
bam1: sample1.bam bam2: sample2.bam variants: 1kg.exome.highAF.1511.vcf depth threshold: 15 ________________________________________ Positions with same genotype: 170 breakdown: hom: 73 het: 97 ________________________________________ Positions with diff genotype: 112 breakdown: BAM 1 | het | hom | subset -------+------+------+------- het | 0 | 0 | 32 | -------+------+------+------- BAM 2 hom | 0 | 22 | - | -------+------+------+------- subset| 58 | - | - | ________________________________________ Total sites compared: 282 Fraction of common: 0.602837 (170/282) ________________________________________ CONCLUSION: LIKELY FROM DIFFERENT SOURCES
Changing threshold depth
If you alter the threshold depth, it should also prompt a recalculation of the genotype data:
bam-matcher.py -B1 sample1.bam -B2 sample2.bam -DP 8
The --dp-threshold/-DP option overrides the DP_threshold value in the configuration file. This command should generate this output (back to Freebayes as the default caller):
bam1: sample1.bam bam2: sample2.bam variants: 1kg.exome.highAF.1511.vcf depth threshold: 8 ________________________________________ Positions with same genotype: 128 breakdown: hom: 30 het: 98 ________________________________________ Positions with diff genotype: 43 breakdown: BAM 1 | het | hom | subset -------+------+------+------- het | 0 | 0 | 18 | -------+------+------+------- BAM 2 hom | 0 | 0 | - | -------+------+------+------- subset| 25 | - | - | ________________________________________ Total sites compared: 171 Fraction of common: 0.748538 (128/171) ________________________________________ CONCLUSION: LIKELY FROM DIFFERENT SOURCES
Compared to the original Freebayes result, there are more sites compared as the threshold depth is lower. However, using lower threshold depth will lead to inclusion of less reliable genotype calls (sites with fewer reads).
Changing VCF file
The run time argument --vcf/-V can be used to change the variant list used for comparison. For example, we can switch to another of the provided VCF files by:
bam-matcher.py -B1 sample1.bam -B2 sample2.bam --vcf /path/to/1KG.exome.highAF.7550.AF
which should have this result:
bam1: sample1.bam bam2: sample2.bam variants: 1kg.exome.highAF.7550.vcf depth threshold: 15 ________________________________________ Positions with same genotype: 139 breakdown: hom: 57 het: 82 ________________________________________ Positions with diff genotype: 77 breakdown: BAM 1 | het | hom | subset -------+------+------+------- het | 1 | 0 | 21 | -------+------+------+------- BAM 2 hom | 0 | 12 | - | -------+------+------+------- subset| 43 | - | - | ________________________________________ Total sites compared: 216 Fraction of common: 0.643519 (139/216) ________________________________________ CONCLUSION: LIKELY FROM DIFFERENT SOURCES
As the input VCF file has changed, BAM-matcher will not use the cached data and will recalculate.
In general, comparing BAM files using VCF with more variants should lead to more sites being compared and thus better judgement for whether the two files are from the same source.
However, in this case, there are actually fewer sites compared when switching to a VCF file with more variants. This is because the example data are artificially constructed around the sites in the first VCF file, and not all sites overlap with the second VCF file.
Debug and Verbose modes
The debug (--debug/-d) and verbose (--verbose/-v) modes are useful for troubleshooting in general.
Enabling the debug mode will basically keep all the temporary files in the scratch directory. For example, running:
bam-matcher.py -B1 sample1.bam -B2 sample2.bam --debug
should give you the same output, but also have this message at the end:
##DEBUG## Temporary files were written to: /tmp/############
where ############ is a randomly generated string. Normally, this scratch directory is deleted at the end of a run, but not in the debug mode.
The location of the scratch directory can be specified run time with (--scratch-dir/-s), but if not specified, it will use a subdirectory under /tmp.
The verbose mode is enabled by --verbose/-v, it will simply provide more detailed report on the data processing.
Using GATK
If you would like to set GATK as your default genotype caller, then you should have these parameters set in the configuration file:
caller: gatk GATK: /full/path/to/GenomeAnalysisTK.jar java: java GATK_MEM: 4 GATK_nt: 1 VCF_file: /full/path/to/1kg.exome.highAF.1511.vcf REFERENCE: /full/path/to/hg19.fasta CACHE_DIR: /full/path/to/cache_directory
The value for "java" parameter should be the command to invoke Java VM.
You should check that your java VM version is compatible with the version of GATK you are using.
If this is configured correctly, then just run (from the directory containing the test data):
bam-matcher.py -B1 sample1.bam -B2 sample2.bam
which should give you this output (calculated using GATK 3.5-0-g36282e4):
bam1: sample1.bam bam2: sample2.bam variants: 1kg.exome.highAF.1511.vcf depth threshold: 15 ________________________________________ Positions with same genotype: 170 breakdown: hom: 73 het: 97 ________________________________________ Positions with diff genotype: 112 breakdown: BAM 1 | het | hom | subset -------+------+------+------- het | 0 | 0 | 32 | -------+------+------+------- BAM 2 hom | 0 | 22 | - | -------+------+------+------- subset| 58 | - | - | ________________________________________ Total sites compared: 282 Fraction of common: 0.602837 (170/282) ________________________________________ CONCLUSION: LIKELY FROM DIFFERENT SOURCES
Remember if you have calculated this previously using Freebayes, it will just report the cached data, unless you specify --recalculate/-RC.
Using VarScan
Using VarScan should have these parameters set in the configuration file:
caller: varscan samtools: samtools varscan: /full/path/to/VarScan.jar java: java VARSCAN_MEM: 4 VCF_file: /full/path/to/1kg.exome.highAF.1511.vcf REFERENCE: /full/path/to/hg19.fasta CACHE_DIR: /full/path/to/cache_directory
VarScan is the slowest of the three callers, as it needs samtools to generate pileup data first. However, it is likely also the most robust method.
Running BAM-matcher with VarScan and settings above should generate this output (VarScan v2.3.9):
bam1: sample1.bam bam2: sample2.bam variants: 1kg.exome.highAF.1511.vcf depth threshold: 15 ________________________________________ Positions with same genotype: 158 breakdown: hom: 71 het: 87 ________________________________________ Positions with diff genotype: 107 breakdown: BAM 1 | het | hom | subset -------+------+------+------- het | 0 | 0 | 30 | -------+------+------+------- BAM 2 hom | 0 | 24 | - | -------+------+------+------- subset| 53 | - | - | ________________________________________ Total sites compared: 265 Fraction of common: 0.596226 (158/265) ________________________________________ CONCLUSION: BAM FILES ARE FROM DIFFERENT SOURCES
Updated