Wiki
Clone wikiInFusion / Getting Started
#Getting started#
Download
InFusion package
There are prebuilt binary packages of InFusion available for download.
Download link to the latest version: InFusion v0.8
Previous and build-candidate versions can be found here.
NOTE: prebuilt binaries are currently only available for 64-bit Linux system. However it is possible to build the toolkit from the source code. Refer to corresponding section for more information.
Reference datasets
Homo_sapiens.GRCh38 : based on Ensembl v.84
Homo_sapiens.GRCh37 : based on Ensembl v.68
Installation from binary package
InFusion depends on Python (v 2.6 or higher) and GLIBC library (v2.14 or higher), make sure that you have them installed.
To install the binary package just unpack it.
For example:
#!shell tar -xf InFusion-0.6.2-linux-x86_64.tar.gz cd InFusion-0.6.2 ./infusion -v
To make InFusion available from the environment either add the InFusion folder to $PATH or create an alias in your bash profile:
alias infusion='/home/kokonech/tools/InFusion-0.6.2/infusion'
Build from source code
The general build process is explained in the README file.
Note that InFusion uses some features of C++ 11 and TR1, which are only availble in gcc 4.4 or higher, so make sure your compiler is up-to-date.
Below is an example on how to build InFusion on a clean Ubuntu 12.04 system, but it should be easy to adapt the process for other Unix-based distributions.
###Install dependencies###
Install gcc, zlib and boost-dev libraries:
$ sudo apt-get install build-essential libz-dev libboost-all-dev cmake
Install Bowtie2 (>= 2.0.2) and Samtools (>= 0.18).
Make sure both executables are available via $PATH.
###Clone and build the code###
$ git clone https://kokonech@bitbucket.org/kokonech/infusion.git $ cd infusion $ mkdir -p build/Release $ cd build/Release $ cmake ../../src -DCMAKE_BUILD_TYPE=Release $ make
First Run
InFusion is launched via Python script called infusion, located in the home folder of the program.
Before using the toolkit it is required to obtain reference index for your genome and create a configuration file, describing the index.
Reference index includes:
- reference genome sequence in FASTA format and its index
- transcriptome annotations in Ensembl GTF format
- cdna sequences in FASTA and their index
- repeat annotations in UCSC format (optional)
The reference index can be automatically created and configured by using python script setup_reference_dataset.py, which is located in the home folder of the toolkit. This script allows to reuse existing data from your machine (such as genome sequence or annotations) or download all required data directly from web. Additionally it will create a configuration file, which is used by InFusion to locate all required resources.
$ python setup_reference_dataset.py -o [path_to_index_folder]
Additionally already created human reference dataset with configuration file can be downloaded from here.
The typical input data for InFusion are the raw sequencing reads from an RNA-seq experiment.
In case of paired-end reads Infusion is launched using the following command:
infusion -1 path_to_upstream_mates -2 path_to_downstream_mates [configuration_file_path]
For single-end reads the command will be different:
infusion -r path_to_reads [configuration_file_path]
By default the results will be located in the subfolder infusion_output of the current working directory. Use option --out-dir to specify the output folder.
The resulting fusions will be in files fusions.txt and fusions.detailed.txt (detailed report). The format of the output file is described in the User Manual
Simulated sample datasets
To make sure that InFusion is properly installed and learn how to use InFusion there are two simulated datasets available for download.
The datasets are created using simulation pipeline which is part of InFusion source code package.
###DataSet 1: InFusion_test_dataset_01###
This simulated dataset consists of reads from artificial fusion transcripts for C. Elegans. The dataset can be downloaded here.
After downloading, unpack the dataset:
$ tar -xf InFusion_test_dataset_01.tar.gz
$ cd InFusion_test_dataset_01
$ infusion -1 reads_1.fastq -2 reads_2.fastq reference_dataset/infusion.cfg
The results are located in the folder infusion_output which will be located in the same directory as the initial dataset. The fusions simulated in this experiment can be found in file simulated_fusions.txt.
The expected analysis results (v0.7.2): fusions.txt / fusions.detailed.txt / run.log
###DataSet 2: InFusion_test_dataset_02###
The dataset is based on reads simulated from human fusion transcripts. It is one of 100 datasets that was used to assess performance of InFusion. The dataset includes evidence for 100 fusion transcripts of different types.
To run InFusion on this dataset it is first required to create or obtain a reference index for human.
We will create a reference index using script setup_reference_dataset.py. This dataset can be also downloaded from here.
The script automatically creates reference index by either downloading the data from internet or by reusing existig resources from your computer. It also creates a configuration file, which describes the index. To list all the options of the script use the following command:
$ python setup_reference_dataset.py -h
In this example we will create index from scratch. This step can take up to several hours. For example, on a 8-core Intel 2.2 GHz it takes approximately 4 hours to download the data and index all sequences. Once the index is created it can be reused for future analysis.
Run this command to create the index using Ensembl database version 68 as reference:
$ python setup_reference_dataset.py --ens-ver 68 -o [path_to_reference_index_directory]
Once the reference dataset is available, we can launch InFusion on the test data.
$ tar -xf InFusion_test_dataset_02.tar.gz $ cd InFusion_test_dataset_02/ $ infusion -1 reads_1.fastq.gz -2 reads_2.fastq.gz [path_to_reference_index_directory]/infusion.cfg
The results can be found in folder infusion_output which is located in the same directory as the input data.
The expected analysis results (v0.7.2): fusions.txt / fusions.detailed.txt / run.log
Public RNA-seq dataset: VCaP cell line
In this section we will go through an example of running InFusion on a public RNA-seq dataset. We will use RNA-seq data from VCaP cell line, which was referenced in the following publication.
The sequencing reads in FASTQ format are available for download from here.
The original SRA file: SRX061854
It is assumed that the InFusion reference dataset for human genome is already available. If not, please check instructions from previous simulated sample dataset 2.
Now we will analyze the public RNA-seq dataset. The default InFusion parameters are tuned to balance between sensitivity and specificity. However, for low coverage datasets as the given one, it makes sense to relax various thresholds.
We will run InFusion using the following parameters:
$ infusion --min-fragments 3 -1 SRR201779_1.fastq.gz -2 SRR201779_2.fastq.gz [path_to_reference_index_directory]/infusion.cfg
After the run is finished we can examine the results in the file fusions.txt:
$ cd infusion_output $ cut -f 2,3,5,6,10- fusions.txt | head -n 7 ref1 break_pos1 ref2 break_pos2 genes_1 genes_2 fusion_class 21 42880007 21 39817543 TMPRSS2 ERG intra-chromosomal 16 85023909 12 123444869 ZDHHC7 ABCB9 inter-chromosomal 9 125622196 9 116299073 RC3H2 RGS3 intra-chromosomal 2 234749255 2 233421124 HJURP EIF4E2 intra-chromosomal 11 17229393 11 12883795 PIK3C2A TEAD1 intra-chromosomal 2 234746299 2 99193606 HJURP INPP4A intra-chromosomal
The resulting file contains information about fusions which are known to be present in the VCAP cell line, such as for example TMPRSS2-ERG, ZDHHC7-ABCB9, RC3H2-RGS3 and others . Detailed information about fusions can be found in a file fusions.detailed.txt.
The expected analysis results (v0.7.2): fusions.txt / fusions.detailed.txt / run.log
Updated