Wiki

Clone wiki

BAM-matcher / Configuration

Setting up the configuration file for BAM-matcher

BAM-matcher uses a configuration file to set many variables it requires to run a sample comparison. Many of these variables can also be set at runtime, but providing the default values in the configuration files simplifies the command to BAM-matcher.

However, there are many variables can only be set in the configuration file, so a configuration file is always required when running BAM-matcher to compare BAM files.


Where is the configuration file and how to get it

By default, BAM-matcher will look for a configuration file ("bam-matcher.conf") in the same directory where the BAM-matcher script ("bam-matcher.py") is located.

.

TRY: Assuming you have just installed BAM-matcher and have not yet set up the configuration file, if you try running a comparison of the provided test data:

cd /path/to/install/bam-matcher/
bam-matcher.py --bam1 test_data/sample1.bam --bam2 test_data/sample2.bam

you should see this error message:

+--------------+
| CONFIG ERROR |
+--------------+
Cannot access config file (/path/to/install/bam-matcher/bam-matcher.conf).
It either does not exist or is not readable.

This means that as the BAM-matcher script (bam-matcher.py) is installed at /path/to/install/bam-matcher/bam-matcher.py, by default it will look for the configuration file at the path /path/to/install/bam-matcher/bam-matcher.conf. And as this file does not yet exist, it will generate the above error.

.

However, in the BAM-matcher directory, you should find a file named: bam-matcher.conf.template.

DO THIS: Make a copy of this file (in the same directory) and name it bam-matcher.conf:

cd /path/to/install/bam-matcher
cp bam-matcher.conf.template bam-matcher.conf

.

TRY: Now, if you execute the same command:

bam-matcher.py --bam1 test_data/sample1.bam --bam2 test_data/sample2.bam
you should no longer see the configuration error (although you should a different error):
+--------------+
| CONFIG ERROR |
+--------------+
Cannot access variants VCF file (/home/paul/localwork/bam-matcher/variants.vcf).
It either does not exist or is not readable.

Multiple configuration files

If you wish to use a different configuration file, you can specify this by using (--conf/-c):

bam-matcher.py --conf a_different_config_file

Missing configuration template file

If the configuration template file is missing, you can also generate it by running:

bam-matcher.py --generate-config [template_name]
or just:
bam-matcher.py -G [template_name]

If no template file name is supplied, it will write to bam-matcher.conf.template, otherwise, it will write to the specified file.


Setting up configuration file

Open up the configuration file (bam-matcher.conf) using a text editor, and follow the steps below.

Comments

# BAM-matcher configuration file
# If not setting a specific parameter, just leave it blank, rather than deleting or commenting out the line
# Missing parameter keywords will generate errors

Lines starting with "#" are comments and will be ignored by BAM-matcher.

.

SECTION: VariantCallers

[VariantCallers]
# file paths to variant callers and other binaries
# sometime you may need to specify full path to the binary (for freebayes, samtools and java)
# full paths is always required for *.jar files (GATK and VarScan2)
caller:    
GATK:      GenomeAnalysisTK.jar
freebayes: freebayes
samtools:  samtools
varscan:   VarScan.jar
java:      java

The square brackets ([ ]) are section headers. Do not remove any section headers, or else BAM-matcher will fail and report this error:

+--------------+
| CONFIG ERROR |
+--------------+
Missing required section in config file: VariantCallers

Similarly, do not remove any parameter keywords . If you are not specifying a value for a particular parameter, just leave it blank, but keep the parameter keyword (followed by colon, e.g. "caller:").

As the header implies, this section is to do with the variant caller to use for genotype calling.

caller: This specifies the default caller to use. Acceptable values are: 'gatk', 'freebayes', 'varscan', and blank. Anything else will fail. If you don't specify a caller here, you can do so at runtime with the option --caller/-CL. If none are specified at configuration or runtime arguments, then it will default to 'freebayes'.

GATK: Provide the full path to the GATK .jar file. You can leave this blank if you don't intend to use GATK.

freebayes: The command to call Freebayes. Full path is not needed if the command is recognised, otherwise, you may need to provide the full path to Freebayes executable.

samtools: The command to call SAMtools. Full path required only if command is not recognised. Only required if using VarScan.

varscan: Provide the full path to the VarScan .jar file. You can leave this blank if you don't intend to use VarScan.

java: The command to call java. Full path is not needed if the command is recognised. Only required if using GATK or VarScan.

.

SECTION: ScriptOptions

[ScriptOptions]
DP_threshold:   15
number_of_SNPs:

# fast_freebayes enables --targets option for Freebayes, faster but more prone to Freebayes errors
# set to False will use --region, each variant is called separately
fast_freebayes: True

VCF_file: variants.vcf

DP_threshold: minimum read-depth required to make a genotype call. If not sure what to use, leave at default (15). If not specified in the configuration file or at run time, the default value will be used.

number_of_SNPs: Number of SNPs from the input VCF file to compare. This is mainly for testing purposes. Leave at default (0) or blank if unsure what to use.

fast_freebayes: This only concerns genotype calling using Freebayes. In some previous version of Freebayes, the calling will occasionally fail when calling using --targets option. So the default behaviour in BAM-matcher was to use Freebayes to call each site individually (using --region), which is fault-tolerant, but is much slower. However, this problem appears to be fixed in later versions of Freebayes (v1.0+). So fast_freebayes = True will use --targets when calling with Freebayes, and should be left as the default, and users should only set this to False if errors with Freebayes calling are encountered.

VCF_file: Provide the full path to the VCF file containing the variant loci to be used for genotype comparison. Three VCF files are provided with BAM-matcher, however, these are only for human samples (hg19). The genomic positions in VCF file should be referring to the same genome reference as the specified default reference (REFERENCE in configuration, or --reference in runtime arguments).

1kg.exome.highAF.1511.vcf  - contains 1511 variants
1kg.exome.highAF.3680.vcf  - contains 3680 variants
1kg.exome.highAF.7550.vcf  - contains 7550 variants

These are all extracted from 1000Genomes (http://www.1000genomes.org/data#download) database. The variants are all exonic and have global minor allele frequency between 0.45 and 0.55.

To use these VCF files, the matching hg19 genome reference can be downloaded from the Broad Institute, as part of the GATK resource bundle (https://www.broadinstitute.org/gatk/download/). This is the version of hg19 that doesn't contain 'chr' in the chromosome names.

For more details about the VCF files and genome references, see the page on running example data and using multiple genome references.

.

SECTION: VariantCallerParameters

[VariantCallerParameters]
# GATK memory usage in GB
GATK_MEM: 4

# GATK threads (-nt)
GATK_nt:  1

# VarScan memory usage in GB
VARSCAN_MEM: 4

These parameters mainly concerns java VM parameters when running GATK or VarScan.

GATK_MEM: Maximum Java VM heap size in GB for running GATK. Typically, we have encountered no problem with 4 GB.

GATK_nt: Number of threads to run GATK UnifiedGenotyper, this value is passed onto -nt option in GATK.

VARSCAN_MEM: Maxmum Java VM heap size in GB for running VarScan. Typically, we have no problems using 4 GB.

.

SECTION: GenomeReference

[GenomeReference]
# default reference fasta file
REFERENCE: hg19.fasta
REF_ALTERNATE:
# CHROM_MAP is required if using two different genome references that have different (but compatible) chromosome names
# this is mainly to deal with the hg19 "chr" issue
CHROM_MAP:

REFERENCE: Provide the full path to the genome reference FASTA file to use for genotype calling. This should be the same reference file to which the input BAM file reads are mapped. The FASTA file needs to be indexed by samtools (faidx). If REFERENCE is not specified here, it must be provided at run time (--reference/-R).

REF_ALTERNATE: Provide the full path to the alternate genome reference FASTA file. REF_ALTERNATE should not be used without REFERENCE specified.

CHROM_MAP: chromosome map is required if the two input BAM files are mapped to different (but compatible) genome references. An example chromosome map file is provided (hg19.chromsome_map).

For more details on REF_ALTERNATE and CHROM_MAP see this page on comparing BAM files mapped to different reference genomes.

These GenomeReference parameters can all be specified at run time. Arguments provided at run time will always override values provided in the configuration file.

.

SECTION: BatchOperations

[BatchOperations]
CACHE_DIR:  cache_dir

You must specify a valid directory for writing cache data, ideally it should be read- and write-able by any potential BAM-matcher user.

.

SECTION: Miscellaneous

[Miscellaneous]

Nothing here. But leave the header intact.

Updated