km : a software for RNA-seq investigation using k-mer decomposition
This tool was developed to identify and quantify the occurence of single nucleotide variants, insertions, deletions and duplications in RNA-seq data. Contrary to most tools that try to report all variants in a complete genome, here we instead propose to focus the analysis on small regions of interest.
Given a reference sequence of interest (typically a few hundred base pairs) around a known or suspected mutation in a gene of interest, all possible sequences that can be be created between the two end k-mers according to the sequenced reads will be reported. A ratio of variant allele vs WT will be computed for each possible sequence constructed.
- Python 2.7.6 or later
- Jellyfish 2.1 or later (http://www.genome.umd.edu/jellyfish.html)
- (Optional) Matplotlib
A jellyfish database of quality and adapter trimmed sequences must first be prepared using the following script:
usage: prepare [-h] [-z] [-m MAX_RAM] outfile [infiles [infiles ...]] prepare: error: too few arguments ./prepare -z -m 16 sample.jf fastq/*.gz
usage: find_mutation [-h] [-c [COUNT]] [-p [RATIO]] [-g] [-v] [reference_fn [reference_fn ...]] jellyfish_fn ./find_mutation query_examples/NPM1.fa sample.jf
Here we are looking for a common 4-bp duplication that occurs in some leukemias, and that is especially troublesome to detect since it occurs a few base pairs from the start of the last exon. Most standard mapping techniques will miss this variant. Running the find_mutation command takes a few seconds and returns an output similar to this:
Database Query Type Variant name Ratio Expression Sequence Reference ratio Reference expression Reference sequence Info sample.jf NPM1 Insertion 93:/TCTG:93 0.528 9020.0 CCAAGAGGCTATTCAAGATCTCTGTCTGGCAGTGGAGGAAGTCTCTT 0.472 8076.8 CCAAGAGGCTATTCAAGATCTCTGGCAGTGGAGGAAGTCTCTT cluster 1 n=1 real 0m6.712s user 0m0.203s sys 0m0.312s
which shows that a TCTG insertion was found at position 93 of the refence sequence.
Each line represents a path that was constructed from the reference sequence.
- Database: name of the Jellyfish kmer table queried
- Query: name of the reference sequence examined
- Type: type of mutation found (Insertion, Deletion or Substitution). A Reference type used to identify path without mutation
- Variant name: A description of the modification in the format start_position:deleted_bases/inserted_bases:end_position
- Ratio: estimated ratio for the mutated allele represented by this path
- Expression: estimated expression level for the mutated allele (coverage)
- Sequence: sequence of the mutated path
- Reference ratio: estimated ratio of the reference allele
- Reference expression: estimated expression level for the reference
- Reference sequence: reference sequence used
- Info: supplementary information regarding the quantification method.
.. vs_ref means that each alternate path is compared in expression with the whole reference sequence.
.. cluster indicates that all alternate path in a subregion extending by k bases on each side of all overlapping mutations are considered at once to evaluate the expression of each
Using the -g argument, one can also obtain a coverage graph for the two variants, for example:
- Complete this list.