1. Tobias Marschall
  2. clever-toolkit

Wiki

Clone wiki

clever-toolkit / Home

Table of Contents

Introduction

The CLEVER TOOLKIT (CTK) is a set of tools to manipulate next-generation sequencing data and discover and genotype structural variations such as insertions and deletions written in C++. It has been developed and tested for Illumina paired-end reads. Depending on the application it can also be used for analyzing data from other platforms.

Installation

Dependencies

To compile the CTK, you need the following tools / libraries:

If you are using Linux, all these dependencies are most likely available through the package manager that comes with your distribution (or even pre-installed).

Bundled libraries

To ease the installation process, the BamTools library is included in the CTK and automatically compiled and installed through the CTKs installation routines. Bamtools is developed by Derek Barnett and distributed under the terms of the MIT License.

Installation instructions

Make sure you have installed all needed software as listed above.

First, run cmake:

cmake .

If you want to install the CTK to a non-standard location, you can add the argument -DCMAKE_INSTALL_PREFIX=<prefix-path> to your cmake call. If your BOOST library is installed in a non-standard location, you can add -DBOOST_ROOT=<path>. That is, your cmake call could look like this:

cmake -DCMAKE_INSTALL_PREFIX=<prefix-path> -DBOOST_ROOT=<path> .

When cmake is done, you can run:

make
make install

The last command installs the following executables to <prefix-path>/bin. (Make sure this directory is in your PATH.)

A list of tools part of the CTK can be found on [CustomPipelines this page]. Any of the tools can be called without parameters to show usage information. The two most important tools are CLEVER and LASER, see next section.

Main Tools

To cater to different user's needs, there are two different ways of using the CTK. First, the most common use cases are accessible through the two main tools CLEVER and LASER. These are easy-to-use wrapper scripts written in Python that execute a pipeline of CTK tools for you. Second, If you want full control over all parameters and options, you can build your own pipeline using the individual tools part of the CTK. Some of the more advanced features are only available via this route.

CLEVER

CLEVER stands for Clique-Enumerating Variant Finder and discovers structural variations such as insertions and deletions from paired-end sequencing data. It is a so-called internal-segment-based method (aka read pair method) that searches for groups of read pairs whose insert size deviate from the null distribution. Assuming that you have aligned the read pairs (using a read mapper such as BWA), CLEVER can be run as follows.

clever --use_xa input.bam reference.fasta result-directory

The options --use_xa tells CLEVER to interprete alternative alignments encoded in XA tags. This form of encoding is used, for instance, by BWA and Stampy. If you use a read mapper that gives alternative alignment in separate lines like, for instance, bowtie2, you can omit this option. The presence of alternative alignments in your BAM file is important to get the best results from CLEVER. If your BAM file is sorted by position, you have to add the option --sorted.

After CLEVER has been run successfully, a VCF file with the made predictions can be found in the given result-directory.

Possible Pitfalls

Sorted BAM files without XA tags

If your read mapper returns alternative alignments in separate rows, your BAM file must not be sorted by position. The reason is that in this case, it is very hard to find all alignments for a given read.

Non-Gaussian insert size distributions

As per default, CLEVER assumes the insert size distribution to be a normal distribution. If your data slightly deviates from this assumption, this might not be a problem. One way of adjusting for too heavy tails is to discard predictions supported by only few read pairs during postprocessing (default cutoff: 2). To do that, additional parameters can be passed on to the postprocessing step as follows.

clever --use_xa -P '-d 2 -i 5' input.bam reference.fasta result-directory

This requires support 2 for deletions and support 5 for insertions to correct for a left-heavy insert size distribution. If your distribution heavily deviates from being a normal distribution, there are several options (the best one, from our point of view, being to revisit your library prep protocol, but we assume here that you still want to analyze your dataset with the non-normal distribution). When building your [CustomPipelines custom pipeline], CTK offers two ways of dealing with non-Gaussian insert size distributions. First, when processing a data set with read groups coming from different libraries with different insert sizes, you can estimate separate normal distributions for every read group and let CLEVER take that into account (use clever-core -A). If the distributions for each read group are normal distributions this results in as good performance as if the reads came from the same library. Second, you can try our experimental feature for using arbitrary null distributions (clever-core -d).

LASER

LASER is long-indel-aware read mapper. It proceeds in different phases. First, it aligns prefixes and suffixes of the input reads using a standard read aligner (BWA by default). Then it takes the obtained seed alignments and aligns the reads with an emphasis on sensitivity and on being aware of long(er) insertions and deletions. After that, LASER takes the position of putative SNPs, insertions and deletions into account and recalibrates the alignment score of all candidate alignments. This improves mapping quality estimates and allows to decide on the correct alignment in the presence of alternative alignments.

To use LASER, you have to install BWA on your system and build an index. (If you build your [CustomPipelines custom pipeline], you can plug in your favorite read mapper instead of using BWA). Then, you can use LASER as follows:

laser reference.fasta(.gz) reads.1.fastq.gz reads.2.fastq.gz outprefix

For this to succeed, you have to have BWA in your PATH and the BWA index must exist at the same place where the reference is stored (i.e., the files reference.amb, reference.ann, reference.bwt, reference.pac, and reference.sa must exist).

LASER outputs the following files:

  • outprefix.bam: The main output.
  • outprefix.putative-snps: List of positions that might be SNPs with their expected support.
  • outprefix.putative-indels: List of putative insertions and deletions with their expected support.
  • outprefix.insert-size-dist: Internal segment size distribution (where internal segment size = fragment size - 2x read lengths).
  • outprefix.insertion-size-dist: Empirical distribution of insertion lengths in uniquely mappable reads.
  • outprefix.deletion-size-dist: Empirical distribution of deletion lengths in uniquely mappable reads.

List of tools

The CLEVER Toolkit (CTK) aims at being a flexible collection of tools that allow you to build your own pipeline that suits your needs. Below, you find a short description of the tools it consists of and, further below, a set of common use-cases and examples on how to use the CTK.

Tool Description
laser Wrapper script for LASER read aligner
clever Wrapper script for CLEVER SV discovery
bam-to-alignment-priors Converts a BAM to a list of alignment pairs with prior probabilities. This is the input to clever-core.
split-priors-by-chromosome Reads output of bam-to-alignment-prior and splits it to separate files for each chromosome
clever-core CLEVER core algorithm.
postprocess-predictions Reads output of clever-core, removes redundant predictions and outputs a VCF.
evaluate-sv-predictions Compare two sets of SV calls (given in VCF format) and print statistics.
split-reads Reads read pairs in FASTQ files and creates a FASTQ file with prefixes/suffixes
laser-core LASER core algorithm
laser-recalibrate Recalibrate alignment scores in a BAM file written by LASER.
genotyper Genotype a given list of deletions based on a BAM file.
insert-length-histogram Computes an insert size histogram from uniquely mappable reads in a BAM file
add-score-tags-to-bam Adds AS tags with alignment scores to a BAM file.
bam2fastq Extracts reads from a BAM file.
remove-redundant-variations From a set of insertions/deletions, removes those which are equivalent due to repeats.
precompute-distributions Precompute insert size distributions needed when using clever-core -d (experimental).
extract-bad-reads Extract reads from a BAM file that meet certain criteria and output FASTQ.
filter-variations Reads insertion/deletion predictions made by CLEVER and LASER and retains only those made by both
merge-to-vcf Reads CLEVER and LASER predictions and creates a joint VCF file.
multiline-to-xa Turns a BAM file with alternative alignments in multiple lines to a BWA-style BAM file with XA tags.
filter-bam Exclude some reads from a BAM file, e.g. based on read groups.
read-group-stats Print statistics on insert size distributions per read group

Custom Pipelines: Example use cases

The above tools can be combined towards many tasks. Examples are collected below.

Running CLEVER without wrapper script

The main CLEVER executable is a Python script that provides a convenient way of running all steps to make SV predictions from a BAM file. In order to fine-tune parameters, distribute the work on a compute cluster, or use features unavailable through the wrapper scripts, it can be beneficial to build an own pipeline. To this end, all the necessary steps are explained below and an example bash script is available here. One important advantage of the workflow below is that an individual insert size distribution is estimated per read group, which is (currently) not possible using the wrapper.

Let this be the start of our BASH script:

#!/bin/bash
set -e
bam="<your-bam-filename-here>"
mkdir work

The next step is to estimate the insert size distributions for each read group. This can be done using the insert-length-histogram tool. Per default, it uses the first million uniquely mapping read pairs for this. If you have many read groups or some read groups are rare, you might increase that number (option -C) to get a robust estimate for all read groups. The read-group-stats command computes mean and standard deviation of the internal segment length (i.e. fragment size minus read length) and gives maximum and minimum over all read groups. We store the maximum mean and the maximum stddev in separate variables for later use.

insert-length-histogram -R --sorted -o work/rg.{readgroup}.dist -L work/rg-list < ${bam}
read-group-stats work/rg-list > work/rg-list.summary
max_mean="$(tail -n1 work/rg-list.summary | cut -d ' ' -f 3)"
max_stddev="$(tail -n1 work/rg-list.summary | cut -d ' ' -f 5)"

As a preprocessing step before running the main CLEVER algorithm, we need to extract alignment pair probabilities from the BAM file and split them to create one file per chromosome. This command-line create files named work/chr.<chr-number>.aln-priors.gz.

bam-to-alignment-priors -r work/rg-list human_g1k_v37.fasta ${bam} | split-priors-by-chromosome -zg work/chr

In a compute cluster environment, clever-core can then be run on each chromosome in parallel and/or on different hosts. Here, we just use a BASH for loop. The clever-core tool expects the input to be sorted by position, which we ensure by using the standard unix sort utility.

for f in work/*.aln-priors.gz ; do
    zcat ${f} | sort -k7,7 -g | clever-core -v -R ${bam} -A work/rg-list > ${f/\.aln-priors.gz/.clever.out}
done

After this step, a file named work/chr.<chr-number>.clever.out has been created containing one line for every significant clique of alignments found by CLEVER. At this stage there might be multiple overlapping cliques for the same SV. To merge overlapping predictions and to generate one VCF file for all chromosomes, we use the tool postprocess-predictions as follows.

cat work/*.clever.out > work/all.clever.out
postprocess-predictions --vcf --stddev ${max_stddev} work/all.clever.out ${max_mean} > results.vcf

References

More details on the algorithms underlying our tools can be found in the following papers. If you use our tools for your research, please cite us.

Tobias Marschall, Ivan Costa, Stefan Canzar, Markus Bauer, Gunnar Klau, Alexander Schliep and Alexander Schoenhuth. CLEVER: Clique-Enumerating Variant Finder. Bioinformatics, 28(22), pages 2875-2882, 2012. DOI: 10.1093/bioinformatics/bts566.

Tobias Marschall and Alexander Schönhuth. LASER: Sensitive Long-Indel-Aware Alignment of Sequencing Reads. arXiv: 1303.3520.

Updated