InFusion is a computational pipeline for discovery of chimeric transcripts and gene fusions from RNA-seq data. It is capable of detecting a large spectra of chimeric RNAs including alternatively spliced fusion isoforms and fusions that involve intronic, non-coding or even intergenic regions.
InFusion works with data from Illumina platform. It supports both single- and paired-end reads and is optimized to analyze reads in size range 36 - 150 bp. Analysis of data from other sequencing platforms or with different characteristics requires additional tuning of the pipeline settings.
The pipeline is implemented as a command line tool.
Reference dataset configuration
InFusion depends on a reference dataset, which consists of genome and transcriptome sequences, their indexes and gene annotations. Reference dataset is described in a configuration file, which contains paths to required reference items and several important parameters.
The reference dataset for human genome can be created automatically using script setup_reference_dataset. By default the script downloads and prepares all the configuration. However, it is also possible to use available annotation, reference with indexes instead of downloading and generating them by adjusting the parameters of the script:.
-gtf GTFFILE Path to annotations in GTF format. Make sure the chromosome names the same as in the reference genome. -f GENOMEFILE Reference genome sequence in FASTA format. -t TRANSCRIPTSFILE Transcript sequences (cDNA) in FASTA format. -r REPEATSTRACK Repeats track from UCSC. hg19 genome is assumed. -fIdx GENOMEINDEXFILE Genome sequence index Bowtie2 -tIdx TRANSCRIPTSINDEXFILE Transcript sequences index Bowtie2 -o OUTDIR Path to resulting reference dataset directory --num-threads NUMTHREADS Number of threads to use when running InFusion. Default equals to the number of available CPUs. --ens-ver ENSVERSION Ensembl release number used to create database, required only for downloading the data. Default is "68" -a INDEXERPATH Path to indexer (Default: bowtie2-build)
Finally, the configuration can be constructed manually. Below there is an example of a typical InFusion configuration file template with explanations:
# InFusion configuration file template # Path to reference sequence in FASTA format reference = [PATH] # Path to bowtie2 index of the reference bowtie2_index = [PATH] # Path to gene annotations in Ensembl GTF format transcripts = [PATH] # Path to the bowtie2 index of all transcript sequences transcriptome_index = [PATH] # Path to repeats in UCSC format (optional) repeats_track = [PATH] # Number of computational threads to use (optional) num_threads = 4 # Maximum number of multimapped genomic alignments (optional) num_genomic_alignments_per_read = 20 # Number of seed mismatches when aligning short reads (optional) num_seed_mismatches = 1 # Comma-separated names of the chromosomes that are ignored in the analysis (optional) ignore_chromosomes = MT # Minimum distance between local alignments in a SPLIT read local_alignment_in_read_tolerance = 2 # Maximum distance between local alignments in a SPLIT read local_alignment_in_read_max_overlap = 10 # Maximum intron size, everything bigger than that considered as possible abberation max_intron_size = 20000 # Annotation type indicates pattern for chromosome names. By default it's Ensembl (no "chr" prefix") # There is also support for Gencode (with "chr" prefix") annotation_type = Ensembl # Fusions that involve genes with this biotypes are filtered out # By default ensembl68 also excludes processed_transcript marker, however this can be configured excluded_biotypes = pseudogene,polymorphic_pseudogene
Note, when setting a path either use absolute path to the file or copy it (symbolic links should also work) into the directory where the configuration file is located
The typical output of InFusion includes the following files:
- fusions.txt - the concise report listing detected fusions that pass the filters
- fusions.detailed.txt - the detailed report listing detected fusions that pass the filters
- fusions.detailed.full.txt - the detailed report that includes all detected fusions (including those that have not passed the filter)
Fusions report is a tab-separated table. Each row describes one putative fusion event. The columns of the report are the following:
- #id Fusion identifier
- ref1 Reference sequence of the 5' fusion region
- break_pos1 Genomic breakpoint position of the first fusion region
- region1 Genomic coordinates of the first fusions region
- ref2 Reference sequence of the second fusion region
- break_pos2 Genomic breakpoint position of the second fusion region
- region2 Genomic coordinates of the second fusions region
- num_split Number of SPLIT reads spanning the fusion junction
- num_paired Number of paired BRIDGE reads supporting the fusion event
- num_split_with_pair Number of split reads which have a proper mate pair
- num_split_rescued Number of SPLIT reads that were rescued
- num_uniq_starts Unique Starts Rate metrics: proportion of SPLIT read alignments with unique coordinates (based on read start and end for 5' and 3' segments) and a number of SPLIT reads forming only one possible fusion
- pap_rate Properly Aligned Pairs metrics
- mean_split_pos Normalized mean split position in the supporting reads
- split_pos_std Standard deviation of normalized split position in the supporting reads
- homogeneity Homogeneity metrics - weight of the fusion in the metalcuster. Reported for both fusion parts (5'|3')
- coverage_context Detected number of reads, supporting only the regions forming the fusion. (5'|3')
- ssp Strand-specificity metrics - proportion of reads that were not mapped according to the annotated gene strand. Reported for both fusion parts (5'|3').
- fusion_class Type of the predicted fusion: inter-chromosomal, intra-chromosomal or read-through. Read-through is reported when fusion includes adjacent or intersecting genes.
- break_on_exon States "yes" if the breakpoint is located on exon border, otherwise "No". Reported for both fusion parts (5'|3')
- feature_1 Type of genomic region in the first fusion part (could be exon,intron or intergenic_region)
- gene_1 The corresponding gene for the first fusion part
- transcript_1 The corresponding transcripts for the first fusion part
- gene_1_strand Strand of the gene according to Ensembl annotation ( + : positive, - : negative)
- biotype_1 Biotype according to Ensembl annotation of the first fusion part
- expression_1 Expression of the first fusion gene (RPKM)
- feature_2 Type of genomic region in the first fusion part (could be exon,intron or intergenic_region)
- gene_2 The corresponding gene for the second fusion part
- transcript_2 The corresponding transcripts for the second fusion part
- gene_2_strand The strand of the gene according to Ensembl annotation ( + : positive, - : negative)
- biotype_2 Biotype according to Ensembl annotation of the second fusion part
- expression_2 Expression of the second fusion gene (RPKM)
- splice_motif Splice junction type (1 : GT/AG, 0 : non-canonical)
- filters Filters that were not passed by fusion event, otherwise the value is pass
Based on the provided options InFusion can balance between sensitivity and specificity of the discovery.
Example 1: highly specific discovery
By default InFusion options, such as for example minimum number of unique SPLIT reads and unique homogeneity weight are selected for high specificity. To get the list of confident fusions, one should also discard fusions which are supported only with multimapped reads. Finally, high threshold for the number of supporting fragments can be applied. In case of paired-end reads there should be both split and span pairs.
infusion --min-split-reads 3 --min-span-pairs 2 --min-fragments 4 -1 reads_1.fastq.gz -2 reads_2.fastq.gz infusion.cfg
Example 2: very sensitive discovery
To increase the sensitivity of fusion discovery it is beneficial to allow intergenic, intronic and non-coding fragments. Additionally, the thresholds for the number of supporting fragments can be lowered. Note that these limits depend on the total number of reads.
infusion --allow-intronic --allow-intergenic --allow-non-coding --min-split-reads 2 --min-span-pairs 0 --min-fragments 3 --min-unique-split-reads 0 -1 reads_1.fastq.gz -2 reads_2.fastq.gz infusion.cfg
Command line options
infusion [options] [-1 <r1>] [-2 <r2>] [-r <sr>] [--out-dir <path>] CONFIG
CONFIG path to InFusion configuration file -1 <r1> path to paired-end reads, upstream mates -2 <r2> path to paired-end reads, downstream mates -r <sr> path to single-end reads --out-dir <path> path to InFusion output folder --library <type> library protocol; possible options: "NA" - non-strand- specific, "FR" - strand-specific-forward, "RF" - strand-specific-reverse. Default is NA. --skip-finished force skipping pipeline steps if result data exists --keep-tmp keep intermediate analysis results --bin-dir <path> path to folder with InFusion required binaries. only required for debugging purposes. -h, --help show help -v, --version show program version
--skip-genomic-aln skip alignment to the genome after transcriptome alignment --skip-repeat-filter skip filtering of local alignments mapped to repeat regions --match-score <value> match score in local alignment. default: 2 --mismatch-penalty <value> mismatch penalty in local alignment. default: 2 -k <value>, --num-genomic-mapping-per-read <value> allow up to this number of genomic mappings per read. default: 20 --allow-trans-multimapped allow reads to be multimapped to transcriptome
--max-aln-overlap <value> maximum overlap size between local alignments forming a SPLIT candidates --anchor-length <value> minimum anchor length for local alignment to consider for breakpoint candidate generation. estimated as one third of a read length by default --score-percentage <value> minimum score percentage from max score to keep the local alignment for rescue step. default: 0.5 --rescued-max-mismatches <value> max edit distance to keep the local alignment for rescue step. default: 2 --no-pairs-from-genomic-aln skip search for discordant pairs from genomic alignment --skip-reads-rescue force skipping search for supporting reads from local alignments --seq-offset-for-paired-only-clusters <value> sequence offset for clusters consisting only from BRIDGE reads --split-bpc-weight <value> weight of a SPLIT read in a putative fusion as used to calculate fusion score. default: 1.0 --bridge-bpc-weight <value> weight of a BRIDGE read pair in a putative fusion as used to calculate fusion score. default: 1.0
--min-split-reads <value> minimum number of split read alignments to support a fusion --min-span-pairs <value> minimum number of spanning pairs to support a fusion --min-fragments <value> minimum number of fragments to support a fusion. in case of paired-end data this is a sum of split reads and span pairs supporting the fusion. in case of single-end data this equals to number of split reads --min-not-rescued <value> minimum number of not rescued split reads supporting the fusion --min-unique-alignment-rate <value> minimum proportion of unique split alignments supporting the fusion (includes multimapped reads) --min-unique-split-reads <value> minimum number of candidates originating from unique alignment --min-homogeneity-weight <value> minimum weight of a each fusion partner in the "metacluster". default: 0.1 --req-homogeneity-weight <value> required minimum weight of the fusion part which has the largest weight in the "metacluster". default: 0.4 --split-pos-deviation <value> max allowed deviation of mean normalized split position from 0.5 --allow-non-coding allow to form fusions between non-coding regions (includes ncRNAs etc.) --allow-all-biotypes allow to form fusions with all possible biotypes (including pseudogenes) --allow-intergenic allow to form fusions with one of the parts originating from an intergenic region.
InFusion pipeline consists of several steps, each of which is controlled by the set of the options. The overview of the pipeline is demonstrated on the figure below:
The pipeline starts with the alignment of the reads to the transcriptome and optionally to the genome in order to filter out reads which are mapped to "normal" genes with keeping track of unmapped reads (Step 1). The next step is the local alignment of reads to the genome (Step 2). The resulting local alignment are used to detect possible reads which span the fusion transcript (Step 3). The normal alignments are being processed to analyze gene coverage and compute insert size distribution during Step 4. If the data is paired-end the pipeline includes a procedure to detect those read pairs which span the fusion junction with unsequenced insert part. Next all detected reads that might support a fusion are clustered in Step 5. Putative fusions are further analyzed and filtered (Step 6). The results are reported in Step 7.
Comparing the results of fusion discovery tools
There are currently a lot of tools to detect fusions from RNA-seq data available. Some of these tools focus on specific tasks, such as for example detailed isoform detection. Additionally, according to our research, there could be significant differences between results of various tools. A common strategy in fusion discovery is to apply several tools and select for validation fusions detected by all algorithms.
InFusion toolkit provides a specific Python script to perform a detailed comparison of detected fusions. The script allows to compare output data of several fusion discovery tools and detect fusions that were reported by several tools. Currently the script supports the following tools:
Additionally it is possible to include additional tool based on setting custom format properties.
To run the script a configuration file is required. The file should provide the path to the results of each tool. Here's a typical example of a configuration file:
<infusion> name=InFusion_v0.7.1 fusions_output=infusion_output_v0.7.1/fusions.txt </infusion> <defuse> name=deFuse fusions_output=defuse_output/results.filtered.tsv </defuse> <chimerascan> name=ChimeraScan fusions_output=chimerascan_out/chimeras.bedpe </chimerascan> <fusioncatcher> name=fusionCatcher fusions_output=fusioncatcher_output/final-list_candidate-fusion-genes.GRCh37.txt </fusioncatcher>
If the tool is not present, it is possible to use custom format by setting result column delimiter and gene column numbers (ids). Here's example how to describe InFusion:
<custom> name=InFusion_v0.7.1 fusions_output=infusion_output_v0.7.1/fusions.txt sep="\t" gene1=9 gene2=10 </custom>
The script can be launched like this:
python ~/path-to-InFusion/comparison/run_comparison.py cmp.cfg
The results of the script are saved in the folder cmp_resutls by default. This folder includes file report.txt, which provides a table of fusions detected by tools along with a subgroup of fusions that were reported by at least by two tools. Additionally, these results are available in plots results_plot.svg and common_plot.svg.
Here's an example of a plot demonstrating all the fusions detected by at least 2 tools in the dataset by InFusion, deFuse, chimeraScan and fusonCatcher:
Additional options for comparison analysis are available in the script. Use --help option to check the details.