1. Konstantin Okonechnikov
  2. InFusion

Wiki

Clone wiki

InFusion / User Manual

Introduction

InFusion is a computational pipeline for discovery of chimeric transcripts and gene fusions from RNA-seq data. It is capable of detecting a large spectra of chimeric RNAs including alternatively spliced fusion isoforms and fusions that involve intronic, non-coding or even intergenic regions.

InFusion works with data from Illumina platform. It supports both single- and paired-end reads and is optimized to analyze reads in size range 36 - 150 bp. Analysis of data from other sequencing platforms or with different characteristics requires additional tuning of the pipeline settings.

The pipeline is implemented as a command line tool.

Reference dataset configuration

InFusion depends on a reference dataset, which consists of genome and transcriptome sequences, their indexes and gene annotations. Reference dataset is described in a configuration file, which contains paths to required reference items and several important parameters.

The reference dataset for human genome can be created automatically using script setup_reference_dataset. By default the script downloads and prepares all the configuration. However, it is also possible to use available annotation, reference with indexes instead of downloading and generating them by adjusting the parameters of the script:.

-gtf GTFFILE          Path to annotations in GTF format. Make sure the chromosome names the same as in the reference genome.
-f GENOMEFILE         Reference genome sequence in FASTA format.
-t TRANSCRIPTSFILE    Transcript sequences (cDNA) in FASTA format.
-r REPEATSTRACK       Repeats track from UCSC. hg19 genome is assumed.
-fIdx GENOMEINDEXFILE   Genome sequence index Bowtie2 
-tIdx TRANSCRIPTSINDEXFILE      Transcript sequences index Bowtie2
-o OUTDIR             Path to resulting reference dataset directory
--num-threads NUMTHREADS  Number of threads to use when running InFusion. Default equals to the number of available CPUs.
--ens-ver ENSVERSION  Ensembl release number used to create database, required only for downloading the data. Default is "68"
-a INDEXERPATH        Path to indexer (Default: bowtie2-build)

Finally, the configuration can be constructed manually. Below there is an example of a typical InFusion configuration file template with explanations:

# InFusion configuration file template

# Path to reference sequence in FASTA format
reference = [PATH]

# Path to bowtie2 index of the reference
bowtie2_index = [PATH]

# Path to gene annotations in Ensembl GTF format
transcripts = [PATH]

# Path to the bowtie2 index of all transcript sequences 
transcriptome_index = [PATH]

# Path to repeats in UCSC format (optional)
repeats_track = [PATH]

# Number of computational threads to use (optional)
num_threads = 4

# Maximum number of multimapped genomic alignments (optional)
num_genomic_alignments_per_read = 20

# Number of seed mismatches when aligning short reads (optional)
num_seed_mismatches = 1

# Comma-separated names of the chromosomes that are ignored in the analysis (optional)
ignore_chromosomes = MT

# Minimum distance between local alignments in a SPLIT read
local_alignment_in_read_tolerance = 2

# Maximum distance between local alignments in a SPLIT read
local_alignment_in_read_max_overlap = 10

# Maximum intron size, everything bigger than that considered as possible abberation
max_intron_size = 20000

# Annotation type indicates pattern for chromosome names. By default it's Ensembl (no "chr" prefix")
# There is also support for Gencode (with "chr" prefix")
annotation_type = Ensembl

# Fusions that involve genes with this biotypes are filtered out
# By default ensembl68 also excludes processed_transcript marker, however this can be configured
excluded_biotypes = pseudogene,polymorphic_pseudogene

Note, when setting a path either use absolute path to the file or copy it (symbolic links should also work) into the directory where the configuration file is located

Program output

The typical output of InFusion includes the following files:

  • fusions.txt - the concise report listing detected fusions that pass the filters
  • fusions.detailed.txt - the detailed report listing detected fusions that pass the filters
  • fusions.detailed.full.txt - the detailed report that includes all detected fusions (including those that have not passed the filter)

Fusions report is a tab-separated table. Each row describes one putative fusion event. The columns of the report are the following:

  1. #id Fusion identifier
  2. ref1 Reference sequence of the 5' fusion region
  3. break_pos1 Genomic breakpoint position of the first fusion region
  4. region1 Genomic coordinates of the first fusions region
  5. ref2 Reference sequence of the second fusion region
  6. break_pos2 Genomic breakpoint position of the second fusion region
  7. region2 Genomic coordinates of the second fusions region
  8. num_split Number of SPLIT reads spanning the fusion junction
  9. num_paired Number of paired BRIDGE reads supporting the fusion event
  10. num_split_with_pair Number of split reads which have a proper mate pair
  11. num_split_rescued Number of SPLIT reads that were rescued
  12. num_uniq_starts Unique Starts Rate metrics: proportion of SPLIT read alignments with unique coordinates (based on read start and end for 5' and 3' segments) and a number of SPLIT reads forming only one possible fusion
  13. pap_rate Properly Aligned Pairs metrics
  14. mean_split_pos Normalized mean split position in the supporting reads
  15. split_pos_std Standard deviation of normalized split position in the supporting reads
  16. homogeneity Homogeneity metrics - weight of the fusion in the metalcuster. Reported for both fusion parts (5'|3')
  17. coverage_context Detected number of reads, supporting only the regions forming the fusion. (5'|3')
  18. ssp Strand-specificity metrics - proportion of reads that were not mapped according to the annotated gene strand. Reported for both fusion parts (5'|3').
  19. fusion_class Type of the predicted fusion: inter-chromosomal, intra-chromosomal or read-through. Read-through is reported when fusion includes adjacent or intersecting genes.
  20. break_on_exon States "yes" if the breakpoint is located on exon border, otherwise "No". Reported for both fusion parts (5'|3')
  21. feature_1 Type of genomic region in the first fusion part (could be exon,intron or intergenic_region)
  22. gene_1 The corresponding gene for the first fusion part
  23. transcript_1 The corresponding transcripts for the first fusion part
  24. gene_1_strand Strand of the gene according to Ensembl annotation ( + : positive, - : negative)
  25. biotype_1 Biotype according to Ensembl annotation of the first fusion part
  26. expression_1 Expression of the first fusion gene (RPKM)
  27. feature_2 Type of genomic region in the first fusion part (could be exon,intron or intergenic_region)
  28. gene_2 The corresponding gene for the second fusion part
  29. transcript_2 The corresponding transcripts for the second fusion part
  30. gene_2_strand The strand of the gene according to Ensembl annotation ( + : positive, - : negative)
  31. biotype_2 Biotype according to Ensembl annotation of the second fusion part
  32. expression_2 Expression of the second fusion gene (RPKM)
  33. splice_motif Splice junction type (1 : GT/AG, 0 : non-canonical)
  34. filters Filters that were not passed by fusion event, otherwise the value is pass

Usage examples

Based on the provided options InFusion can balance between sensitivity and specificity of the discovery.

Example 1: highly specific discovery

By default InFusion options, such as for example minimum number of unique SPLIT reads and unique homogeneity weight are selected for high specificity. To get the list of confident fusions, one should also discard fusions which are supported only with multimapped reads. Finally, high threshold for the number of supporting fragments can be applied. In case of paired-end reads there should be both split and span pairs.

Example:

infusion --min-split-reads 3 --min-span-pairs 2 --min-fragments 4 -1 reads_1.fastq.gz -2 reads_2.fastq.gz infusion.cfg

Example 2: very sensitive discovery

To increase the sensitivity of fusion discovery it is beneficial to allow intergenic, intronic and non-coding fragments. Additionally, the thresholds for the number of supporting fragments can be lowered. Note that these limits depend on the total number of reads.

Example:

infusion  --allow-intronic --allow-intergenic --allow-non-coding --min-split-reads 2 --min-span-pairs 0 --min-fragments 3 --min-unique-split-reads 0 -1 reads_1.fastq.gz -2 reads_2.fastq.gz infusion.cfg 

Command line options

Usage

infusion [options] [-1 <r1>] [-2 <r2>] [-r <sr>] [--out-dir <path>] CONFIG

General

   CONFIG                path to InFusion configuration file
   -1 <r1>               path to paired-end reads, upstream mates
   -2 <r2>               path to paired-end reads, downstream mates
   -r <sr>               path to single-end reads
   --out-dir <path>      path to InFusion output folder
   --library <type>      library protocol; possible options: "NA" - non-strand-
                         specific, "FR" - strand-specific-forward, "RF" -
                         strand-specific-reverse. Default is NA.
   --skip-finished       force skipping pipeline steps if result data exists
   --keep-tmp            keep intermediate analysis results
   --bin-dir <path>      path to folder with InFusion required binaries. only
                         required for debugging purposes.
   -h, --help            show help
   -v, --version         show program version

Alignment

  --skip-genomic-aln    skip alignment to the genome after transcriptome
                        alignment
  --skip-repeat-filter  skip filtering of local alignments mapped to repeat
                        regions
  --match-score <value>
                        match score in local alignment. default: 2
  --mismatch-penalty <value>
                        mismatch penalty in local alignment. default: 2
  -k <value>, --num-genomic-mapping-per-read <value>
                        allow up to this number of genomic mappings per read.
                        default: 20
  --allow-trans-multimapped
                        allow reads to be multimapped to transcriptome

Fusion detection

  --max-aln-overlap <value>
                        maximum overlap size between local alignments forming
                        a SPLIT candidates
  --anchor-length <value>
                        minimum anchor length for local alignment to consider
                        for breakpoint candidate generation. estimated as one
                        third of a read length by default
  --score-percentage <value>
                        minimum score percentage from max score to keep the
                        local alignment for rescue step. default: 0.5
  --rescued-max-mismatches <value>
                        max edit distance to keep the local alignment for
                        rescue step. default: 2
  --no-pairs-from-genomic-aln
                        skip search for discordant pairs from genomic
                        alignment
  --skip-reads-rescue   force skipping search for supporting reads from local
                        alignments
  --seq-offset-for-paired-only-clusters <value>
                        sequence offset for clusters consisting only from
                        BRIDGE reads
  --split-bpc-weight <value>
                        weight of a SPLIT read in a putative fusion as used to
                        calculate fusion score. default: 1.0
  --bridge-bpc-weight <value>
                        weight of a BRIDGE read pair in a putative fusion as
                        used to calculate fusion score. default: 1.0

Filtering

  --min-split-reads <value>
                        minimum number of split read alignments to support a
                        fusion
  --min-span-pairs <value>
                        minimum number of spanning pairs to support a fusion
  --min-fragments <value>
                        minimum number of fragments to support a fusion. in
                        case of paired-end data this is a sum of split reads
                        and span pairs supporting the fusion. in case of
                        single-end data this equals to number of split reads
  --min-not-rescued <value>
                        minimum number of not rescued split reads supporting
                        the fusion
  --min-unique-alignment-rate <value>
                        minimum proportion of unique split alignments
                        supporting the fusion (includes multimapped reads)
  --min-unique-split-reads <value>
                        minimum number of candidates originating from unique
                        alignment
  --min-homogeneity-weight <value>
                        minimum weight of a each fusion partner in the
                        "metacluster". default: 0.1
 --req-homogeneity-weight <value>
                        required minimum weight of the fusion part which has
                        the largest weight in the "metacluster". default: 0.4
  --split-pos-deviation <value>
                        max allowed deviation of mean normalized split
                        position from 0.5
  --allow-non-coding    allow to form fusions between non-coding regions
                        (includes ncRNAs etc.)
  --allow-all-biotypes  allow to form fusions with all possible biotypes
                        (including pseudogenes)
  --allow-intergenic    allow to form fusions with one of the parts
                        originating from an intergenic region.

Algorithm overview

InFusion pipeline consists of several steps, each of which is controlled by the set of the options. The overview of the pipeline is demonstrated on the figure below:

InFusion pipeline.png

The pipeline starts with the alignment of the reads to the transcriptome and optionally to the genome in order to filter out reads which are mapped to "normal" genes with keeping track of unmapped reads (Step 1). The next step is the local alignment of reads to the genome (Step 2). The resulting local alignment are used to detect possible reads which span the fusion transcript (Step 3). The normal alignments are being processed to analyze gene coverage and compute insert size distribution during Step 4. If the data is paired-end the pipeline includes a procedure to detect those read pairs which span the fusion junction with unsequenced insert part. Next all detected reads that might support a fusion are clustered in Step 5. Putative fusions are further analyzed and filtered (Step 6). The results are reported in Step 7.

Comparing the results of fusion discovery tools

There are currently a lot of tools to detect fusions from RNA-seq data available. Some of these tools focus on specific tasks, such as for example detailed isoform detection. Additionally, according to our research, there could be significant differences between results of various tools. A common strategy in fusion discovery is to apply several tools and select for validation fusions detected by all algorithms.

InFusion toolkit provides a specific Python script to perform a detailed comparison of detected fusions. The script allows to compare output data of several fusion discovery tools and detect fusions that were reported by several tools. Currently the script supports the following tools:

Additionally it is possible to include additional tool based on setting custom format properties.

Note: the script requires additional Python libraries matplotlib and numpy.

To run the script a configuration file is required. The file should provide the path to the results of each tool. Here's a typical example of a configuration file:

<infusion>
name=InFusion_v0.7.1
fusions_output=infusion_output_v0.7.1/fusions.txt
</infusion>

<defuse>
name=deFuse
fusions_output=defuse_output/results.filtered.tsv
</defuse>

<chimerascan>
name=ChimeraScan
fusions_output=chimerascan_out/chimeras.bedpe
</chimerascan>

<fusioncatcher>
name=fusionCatcher
fusions_output=fusioncatcher_output/final-list_candidate-fusion-genes.GRCh37.txt
</fusioncatcher>

If the tool is not present, it is possible to use custom format by setting result column delimiter and gene column numbers (ids). Here's example how to describe InFusion:

<custom>
name=InFusion_v0.7.1
fusions_output=infusion_output_v0.7.1/fusions.txt
sep="\t"
gene1=9
gene2=10
</custom>

The script can be launched like this:

python ~/path-to-InFusion/comparison/run_comparison.py cmp.cfg

The results of the script are saved in the folder cmp_resutls by default. This folder includes file report.txt, which provides a table of fusions detected by tools along with a subgroup of fusions that were reported by at least by two tools. Additionally, these results are available in plots results_plot.svg and common_plot.svg.

Here's an example of a plot demonstrating all the fusions detected by at least 2 tools in the dataset by InFusion, deFuse, chimeraScan and fusonCatcher:

validated_plot.png

Additional options for comparison analysis are available in the script. Use --help option to check the details.

A typical example showing the comparison in the analysis of VCAP RNA-seq data, described in the "Getting started block" can be downloaded from here.

Updated