Wiki

#Getting started#

Download
- InFusion package
- Reference datasets
Installation from binary package
Build from source code
First Run
Simulated sample datasets
Public RNA-seq dataset: VCaP cell line

Download

InFusion package

There are prebuilt binary packages of InFusion available for download.

Download link to the latest version: InFusion v0.8

Previous and build-candidate versions can be found here.

NOTE: prebuilt binaries are currently only available for 64-bit Linux system. However it is possible to build the toolkit from the source code. Refer to corresponding section for more information.

Reference datasets

Homo_sapiens.GRCh38 : based on Ensembl v.84

Homo_sapiens.GRCh37 : based on Ensembl v.68

Installation from binary package

InFusion depends on Python (v 2.6 or higher) and GLIBC library (v2.14 or higher), make sure that you have them installed.

To install the binary package just unpack it.

For example:

#!shell

tar -xf InFusion-0.6.2-linux-x86_64.tar.gz
cd InFusion-0.6.2
./infusion -v

To make InFusion available from the environment either add the InFusion folder to $PATH or create an alias in your bash profile:

alias infusion='/home/kokonech/tools/InFusion-0.6.2/infusion'

Build from source code

The general build process is explained in the README file.

Note that InFusion uses some features of C++ 11 and TR1, which are only availble in gcc 4.4 or higher, so make sure your compiler is up-to-date.

Below is an example on how to build InFusion on a clean Ubuntu 12.04 system, but it should be easy to adapt the process for other Unix-based distributions.

###Install dependencies###

Install gcc, zlib and boost-dev libraries:

$ sudo apt-get install build-essential libz-dev libboost-all-dev cmake

Install Bowtie2 (>= 2.0.2) and Samtools (>= 0.18).

Make sure both executables are available via $PATH.

###Clone and build the code###

$ git clone https://kokonech@bitbucket.org/kokonech/infusion.git
$ cd infusion
$ mkdir -p build/Release
$ cd build/Release
$ cmake ../../src -DCMAKE_BUILD_TYPE=Release
$ make

After the project is built successfully, the directory with source code can be added to PATH.

First Run

InFusion is launched via Python script called infusion, located in the home folder of the program.

Before using the toolkit it is required to obtain reference index for your genome and create a configuration file, describing the index.

Reference index includes:

reference genome sequence in FASTA format and its index
transcriptome annotations in Ensembl GTF format
cdna sequences in FASTA and their index
repeat annotations in UCSC format (optional)

The reference index can be automatically created and configured by using python script setup_reference_dataset.py, which is located in the home folder of the toolkit. This script allows to reuse existing data from your machine (such as genome sequence or annotations) or download all required data directly from web. Additionally it will create a configuration file, which is used by InFusion to locate all required resources.

$ python setup_reference_dataset.py -o [path_to_index_folder]

Additionally already created human reference dataset with configuration file can be downloaded from here.

The typical input data for InFusion are the raw sequencing reads from an RNA-seq experiment.

In case of paired-end reads Infusion is launched using the following command:

infusion -1 path_to_upstream_mates -2 path_to_downstream_mates [configuration_file_path]

For single-end reads the command will be different:

infusion -r path_to_reads [configuration_file_path]

Check all available options using command -h. Detailed description of all parameters are provided in the User Manual

By default the results will be located in the subfolder infusion_output of the current working directory. Use option --out-dir to specify the output folder.

The resulting fusions will be in files fusions.txt and fusions.detailed.txt (detailed report). The format of the output file is described in the User Manual

Simulated sample datasets

To make sure that InFusion is properly installed and learn how to use InFusion there are two simulated datasets available for download.

The datasets are created using simulation pipeline which is part of InFusion source code package.

###DataSet 1: InFusion_test_dataset_01###

This simulated dataset consists of reads from artificial fusion transcripts for C. Elegans. The dataset can be downloaded here.

After downloading, unpack the dataset:

$ tar -xf InFusion_test_dataset_01.tar.gz

The example data already contains a reference dataset for C. Elegans, so it is not required to build one. One can launch InFusion using the following command:

$ cd InFusion_test_dataset_01 
$ infusion -1 reads_1.fastq -2 reads_2.fastq reference_dataset/infusion.cfg

The results are located in the folder infusion_output which will be located in the same directory as the initial dataset. The fusions simulated in this experiment can be found in file simulated_fusions.txt.

The expected analysis results (v0.7.2): fusions.txt / fusions.detailed.txt / run.log

###DataSet 2: InFusion_test_dataset_02###

The dataset is based on reads simulated from human fusion transcripts. It is one of 100 datasets that was used to assess performance of InFusion. The dataset includes evidence for 100 fusion transcripts of different types.

To run InFusion on this dataset it is first required to create or obtain a reference index for human.

We will create a reference index using script setup_reference_dataset.py. This dataset can be also downloaded from here.

The script automatically creates reference index by either downloading the data from internet or by reusing existig resources from your computer. It also creates a configuration file, which describes the index. To list all the options of the script use the following command:

$ python setup_reference_dataset.py -h

In this example we will create index from scratch. This step can take up to several hours. For example, on a 8-core Intel 2.2 GHz it takes approximately 4 hours to download the data and index all sequences. Once the index is created it can be reused for future analysis.

Run this command to create the index using Ensembl database version 68 as reference:

$ python setup_reference_dataset.py --ens-ver 68 -o [path_to_reference_index_directory]

The configuration file for resulting index will be also created and saved in the index directory.

Once the reference dataset is available, we can launch InFusion on the test data.

$ tar -xf InFusion_test_dataset_02.tar.gz
$ cd InFusion_test_dataset_02/
$ infusion -1 reads_1.fastq.gz -2 reads_2.fastq.gz [path_to_reference_index_directory]/infusion.cfg

The results can be found in folder infusion_output which is located in the same directory as the input data.

The expected analysis results (v0.7.2): fusions.txt / fusions.detailed.txt / run.log

Public RNA-seq dataset: VCaP cell line

In this section we will go through an example of running InFusion on a public RNA-seq dataset. We will use RNA-seq data from VCaP cell line, which was referenced in the following publication.

The sequencing reads in FASTQ format are available for download from here.

The original SRA file: SRX061854

It is assumed that the InFusion reference dataset for human genome is already available. If not, please check instructions from previous simulated sample dataset 2.

Now we will analyze the public RNA-seq dataset. The default InFusion parameters are tuned to balance between sensitivity and specificity. However, for low coverage datasets as the given one, it makes sense to relax various thresholds.

We will run InFusion using the following parameters:

$ infusion --min-fragments 3 -1 SRR201779_1.fastq.gz  -2 SRR201779_2.fastq.gz  [path_to_reference_index_directory]/infusion.cfg

After the run is finished we can examine the results in the file fusions.txt:

$ cd infusion_output
$ cut -f 2,3,5,6,10- fusions.txt | head -n 7
ref1    break_pos1      ref2    break_pos2      genes_1 genes_2 fusion_class
21      42880007        21      39817543        TMPRSS2 ERG     intra-chromosomal
16      85023909        12      123444869       ZDHHC7  ABCB9   inter-chromosomal
9       125622196       9       116299073       RC3H2   RGS3    intra-chromosomal
2       234749255       2       233421124       HJURP   EIF4E2  intra-chromosomal
11      17229393        11      12883795        PIK3C2A TEAD1   intra-chromosomal
2       234746299       2       99193606        HJURP   INPP4A  intra-chromosomal

The resulting file contains information about fusions which are known to be present in the VCAP cell line, such as for example TMPRSS2-ERG, ZDHHC7-ABCB9, RC3H2-RGS3 and others . Detailed information about fusions can be found in a file fusions.detailed.txt.

The expected analysis results (v0.7.2): fusions.txt / fusions.detailed.txt / run.log