Pipeline for the identification of LncRNAs using PacBio data

Workflow for human and mouse PacBio sequencing data:

Mapping: the pipeline uses GMAP to map sequencing data (fa/fq) to the reference genome
Extract aligned fasta sequences: fasta sequences are obtained from GMAP output gtf
CPAT classification: CPAT is used to discriminate among coding and non-coding genes
Blasr alignment: sequencing data is aligned to Gencode (both coding - lncRNA) to discard known features

Workflow for PacBio sequencing data from species with no reference genome:

PLEK is used to discriminate between coding and non-coding transcripts in input fa/fq

SOFTWARE REQUIREMENT

R
CPAT (version 1.2.2) [A pre-compiled version is released with the ncrna_pipeline]. 'cpat.py' should be added to $PATH, while 'CPAT-1.2.2/lib/python2.7/site-packages' folder should be added to $PYTHONPATH
PLEK should be installed and added to $PATH. Download and installation instructions for PLEK are available here: http://sourceforge.net/projects/plek/files/
GMAP (hg19 and mm10 indices should be available) - Tested on GMAP version 2014-03-28
BLASR

PYTHON PREREQUISITE

The python dependencies for the pipeline are: * sqlite3 * numpy * cython * biopython

You can install them via the following commands (NOTE: it is recommended that you activate your virtual environment first):

pip install numpy
pip install biopython

INSTALLATION

We recommend that you set up and activate a virtual environment before installation. See here for installation details.

'ncrna_pipeline' folder should be added to $PATH. All files in this folder (including data_config) should be made executable
Paths to data files (Gencode and genome fasta files) and CPAT 'dat' folder must be set in 'data_config'
Links to download Gencode fasta files (example for version 19, human): wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.lncRNA_transcripts.fa.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.pc_transcripts.fa.gz
A genome map database for GMAP (hg19) can be found here: http://research-pub.gene.com/gmap/

USAGE

ncrna_pipeline -f/--fasta <fasta/fastq_file> -p/--processors <n processors> -c/--classifier <cpat/plek, default is cpat>

If classifier is cpat please use 'organism' flag with -o/--organism [human/mouse] | default is 'human'

Example usage:

ncrna_pipeline -f sample.fq -p 8 -c plek (only performs plek noncoding prediction on raw reads and outputs list of putative noncoding transcripts and a filtered fasta)

ncrna_pipeline -f sample.fa -p 8 -c cpat -o human ncrna_pipeline -f sample.fa -p 8 -c cpat -o mouse (Fasta is mapped back to the reference genome, the output gtf is converted to fasta which is filtered on Gencode [both coding and noncoding] and the prediction is performed using cpat. The output is the same as for the plek pipeline)

Wiki

LncRNAs pipeline / Home

Pipeline for the identification of LncRNAs using PacBio data

SOFTWARE REQUIREMENT

PYTHON PREREQUISITE

INSTALLATION

USAGE