Wiki
Clone wikiclever-toolkit / Home
Table of Contents
Introduction
The CLEVER TOOLKIT (CTK) is a set of tools to manipulate next-generation sequencing data and discover and genotype structural variations such as insertions and deletions written in C++. It has been developed and tested for Illumina paired-end reads. Depending on the application it can also be used for analyzing data from other platforms.
Installation via Bioconda
This is usually the easiest way to deploy the CLEVER TOOLKIT: Just install Bioconda and then run
conda install clever-toolkit
Manual Installation
Dependencies
To compile the CTK, you need the following tools / libraries:
If you are using Linux, all these dependencies are most likely available through the package manager that comes with your distribution (or even pre-installed).
Bundled libraries
To ease the installation process, the BamTools library is included in the CTK and automatically compiled and installed through the CTKs installation routines. Bamtools is developed by Derek Barnett and distributed under the terms of the MIT License.
Installation instructions
Make sure you have installed all needed software as listed above.
First, run cmake:
cmake .
-DCMAKE_INSTALL_PREFIX=<prefix-path>
to your cmake call. If your BOOST library is installed in a non-standard location, you can add -DBOOST_ROOT=<path>
. That is, your cmake call could look like this:
cmake -DCMAKE_INSTALL_PREFIX=<prefix-path> -DBOOST_ROOT=<path> .
make make install
The last command installs the following executables to <prefix-path>/bin
. (Make sure this directory is in your PATH.)
A list of tools part of the CTK can be found on [CustomPipelines this page]. Any of the tools can be called without parameters to show usage information. The two most important tools are CLEVER and LASER, see next section.
Main Tools
To cater to different user's needs, there are two different ways of using the CTK. First, the most common use cases are accessible through the two main tools CLEVER and LASER. These are easy-to-use wrapper scripts written in Python that execute a pipeline of CTK tools for you. Second, If you want full control over all parameters and options, you can build your own pipeline using the individual tools part of the CTK. Some of the more advanced features are only available via this route.
CLEVER
CLEVER stands for Clique-Enumerating Variant Finder and discovers structural variations such as insertions and deletions from paired-end sequencing data. It is a so-called internal-segment-based method (aka read pair method) that searches for groups of read pairs whose insert size deviate from the null distribution. Assuming that you have aligned the read pairs (using a read mapper such as BWA), CLEVER can be run as follows.
clever --use_xa input.bam reference.fasta result-directory
--use_xa
tells CLEVER to interprete alternative alignments encoded in XA tags. This form of encoding is used, for instance, by BWA and Stampy. If you use a read mapper that gives alternative alignment in separate lines like, for instance, bowtie2, you can omit this option. The presence of alternative alignments in your BAM file is important to get the best results from CLEVER. If your BAM file is sorted by position, you have to add the option --sorted
.
After CLEVER has been run successfully, a VCF file with the made predictions can be found in the given result-directory.
Possible Pitfalls
Sorted BAM files without XA tags
If your read mapper returns alternative alignments in separate rows, your BAM file must not be sorted by position. The reason is that in this case, it is very hard to find all alignments for a given read.
Non-Gaussian insert size distributions
As per default, CLEVER assumes the insert size distribution to be a normal distribution. If your data slightly deviates from this assumption, this might not be a problem. One way of adjusting for too heavy tails is to discard predictions supported by only few read pairs during postprocessing (default cutoff: 2). To do that, additional parameters can be passed on to the postprocessing step as follows.
clever --use_xa -P '-d 2 -i 5' input.bam reference.fasta result-directory
clever-core -A
). If the distributions for each read group are normal distributions this results in as good performance as if the reads came from the same library. Second, you can try our experimental feature for using arbitrary null distributions (clever-core -d
).
LASER
LASER is long-indel-aware read mapper. It proceeds in different phases. First, it aligns prefixes and suffixes of the input reads using a standard read aligner (BWA by default). Then it takes the obtained seed alignments and aligns the reads with an emphasis on sensitivity and on being aware of long(er) insertions and deletions. After that, LASER takes the position of putative SNPs, insertions and deletions into account and recalibrates the alignment score of all candidate alignments. This improves mapping quality estimates and allows to decide on the correct alignment in the presence of alternative alignments.
To use LASER, you have to install BWA on your system and build an index. (If you build your [CustomPipelines custom pipeline], you can plug in your favorite read mapper instead of using BWA). Then, you can use LASER as follows:
laser reference.fasta(.gz) reads.1.fastq.gz reads.2.fastq.gz outprefix
reference.amb
, reference.ann
, reference.bwt
, reference.pac
, and reference.sa
must exist).
LASER outputs the following files:
outprefix.bam
: The main output.outprefix.putative-snps
: List of positions that might be SNPs with their expected support.outprefix.putative-indels
: List of putative insertions and deletions with their expected support.outprefix.insert-size-dist
: Internal segment size distribution (where internal segment size = fragment size - 2x read lengths).outprefix.insertion-size-dist
: Empirical distribution of insertion lengths in uniquely mappable reads.outprefix.deletion-size-dist
: Empirical distribution of deletion lengths in uniquely mappable reads.
List of tools
The CLEVER Toolkit (CTK) aims at being a flexible collection of tools that allow you to build your own pipeline that suits your needs. Below, you find a short description of the tools it consists of and, further below, a set of common use-cases and examples on how to use the CTK.
Tool | Description |
---|---|
laser | Wrapper script for LASER read aligner |
clever | Wrapper script for CLEVER SV discovery |
bam-to-alignment-priors | Converts a BAM to a list of alignment pairs with prior probabilities. This is the input to clever-core . |
split-priors-by-chromosome | Reads output of bam-to-alignment-prior and splits it to separate files for each chromosome |
clever-core | CLEVER core algorithm. |
postprocess-predictions | Reads output of clever-core , removes redundant predictions and outputs a VCF. |
evaluate-sv-predictions | Compare two sets of SV calls (given in VCF format) and print statistics. |
split-reads | Reads read pairs in FASTQ files and creates a FASTQ file with prefixes/suffixes |
laser-core | LASER core algorithm |
laser-recalibrate | Recalibrate alignment scores in a BAM file written by LASER. |
genotyper | Genotype a given list of deletions based on a BAM file. |
insert-length-histogram | Computes an insert size histogram from uniquely mappable reads in a BAM file |
add-score-tags-to-bam | Adds AS tags with alignment scores to a BAM file. |
bam2fastq | Extracts reads from a BAM file. |
remove-redundant-variations | From a set of insertions/deletions, removes those which are equivalent due to repeats. |
precompute-distributions | Precompute insert size distributions needed when using clever-core -d (experimental). |
extract-bad-reads | Extract reads from a BAM file that meet certain criteria and output FASTQ. |
filter-variations | Reads insertion/deletion predictions made by CLEVER and LASER and retains only those made by both |
merge-to-vcf | Reads CLEVER and LASER predictions and creates a joint VCF file. |
multiline-to-xa | Turns a BAM file with alternative alignments in multiple lines to a BWA-style BAM file with XA tags. |
filter-bam | Exclude some reads from a BAM file, e.g. based on read groups. |
read-group-stats | Print statistics on insert size distributions per read group |
Custom Pipelines: Example use cases
The above tools can be combined towards many tasks. Examples are collected below.
Running CLEVER without wrapper script
The main CLEVER executable is a Python script that provides a convenient way of running all steps to make SV predictions from a BAM file. In order to fine-tune parameters, distribute the work on a compute cluster, or use features unavailable through the wrapper scripts, it can be beneficial to build an own pipeline. To this end, all the necessary steps are explained below and an example bash script is available here. One important advantage of the workflow below is that an individual insert size distribution is estimated per read group, which is (currently) not possible using the wrapper.
Let this be the start of our BASH script:
#!/bin/bash set -e bam="<your-bam-filename-here>" mkdir work
The next step is to estimate the insert size distributions for each read group. This can be done using the insert-length-histogram
tool. Per default, it uses the first million uniquely mapping read pairs for this. If you have many read groups or some read groups are rare, you might increase that number (option -C
) to get a robust estimate for all read groups. The read-group-stats
command computes mean and standard deviation of the internal segment length (i.e. fragment size minus read length) and gives maximum and minimum over all read groups. We store the maximum mean and the maximum stddev in separate variables for later use.
insert-length-histogram -R --sorted -o work/rg.{readgroup}.dist -L work/rg-list < ${bam} read-group-stats work/rg-list > work/rg-list.summary max_mean="$(tail -n1 work/rg-list.summary | cut -d ' ' -f 3)" max_stddev="$(tail -n1 work/rg-list.summary | cut -d ' ' -f 5)"
As a preprocessing step before running the main CLEVER algorithm, we need to extract alignment pair probabilities from the BAM file and split them to create one file per chromosome. This command-line create files named work/chr.<chr-number>.aln-priors.gz
.
bam-to-alignment-priors -r work/rg-list human_g1k_v37.fasta ${bam} | split-priors-by-chromosome -zg work/chr
In a compute cluster environment, clever-core
can then be run on each chromosome in parallel and/or on different hosts. Here, we just use a BASH for loop. The clever-core
tool expects the input to be sorted by position, which we ensure by using the standard unix sort utility.
for f in work/*.aln-priors.gz ; do zcat ${f} | sort -k7,7 -g | clever-core -v -R ${bam} -A work/rg-list > ${f/\.aln-priors.gz/.clever.out} done
work/chr.<chr-number>.clever.out
has been created containing one line for every significant clique of alignments found by CLEVER. At this stage there might be multiple overlapping cliques for the same SV. To merge overlapping predictions and to generate one VCF file for all chromosomes, we use the tool postprocess-predictions
as follows.
cat work/*.clever.out > work/all.clever.out postprocess-predictions --vcf --stddev ${max_stddev} work/all.clever.out ${max_mean} > results.vcf
References
More details on the algorithms underlying our tools can be found in the following papers. If you use our tools for your research, please cite us.
Tobias Marschall, Ivan Costa, Stefan Canzar, Markus Bauer, Gunnar Klau, Alexander Schliep and Alexander Schoenhuth. CLEVER: Clique-Enumerating Variant Finder. Bioinformatics, 28(22), pages 2875-2882, 2012. DOI: 10.1093/bioinformatics/bts566.
Tobias Marschall and Alexander Schönhuth. LASER: Sensitive Long-Indel-Aware Alignment of Sequencing Reads. arXiv: 1303.3520.
Updated