Wiki
Clone wikiLncRNAs pipeline / Home
Pipeline for the identification of LncRNAs using PacBio data
Workflow for human and mouse PacBio sequencing data:
- Mapping: the pipeline uses GMAP to map sequencing data (fa/fq) to the reference genome
- Extract aligned fasta sequences: fasta sequences are obtained from GMAP output gtf
- CPAT classification: CPAT is used to discriminate among coding and non-coding genes
- Blasr alignment: sequencing data is aligned to Gencode (both coding - lncRNA) to discard known features
Workflow for PacBio sequencing data from species with no reference genome:
- PLEK is used to discriminate between coding and non-coding transcripts in input fa/fq
SOFTWARE REQUIREMENT
- R
- CPAT (version 1.2.2) [A pre-compiled version is released with the ncrna_pipeline]. 'cpat.py' should be added to $PATH, while 'CPAT-1.2.2/lib/python2.7/site-packages' folder should be added to $PYTHONPATH
- PLEK should be installed and added to $PATH. Download and installation instructions for PLEK are available here: http://sourceforge.net/projects/plek/files/
- GMAP (hg19 and mm10 indices should be available) - Tested on GMAP version 2014-03-28
- BLASR
PYTHON PREREQUISITE
The python dependencies for the pipeline are: * sqlite3 * numpy * cython * biopython
You can install them via the following commands (NOTE: it is recommended that you activate your virtual environment first):
pip install numpy pip install biopython
INSTALLATION
We recommend that you set up and activate a virtual environment before installation. See here for installation details.
-
'ncrna_pipeline' folder should be added to $PATH. All files in this folder (including data_config) should be made executable
-
Paths to data files (Gencode and genome fasta files) and CPAT 'dat' folder must be set in 'data_config'
-
Links to download Gencode fasta files (example for version 19, human): wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.lncRNA_transcripts.fa.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.pc_transcripts.fa.gz - A genome map database for GMAP (hg19) can be found here: http://research-pub.gene.com/gmap/
USAGE
ncrna_pipeline -f/--fasta <fasta/fastq_file> -p/--processors <n processors> -c/--classifier <cpat/plek, default is cpat>
Example usage:
ncrna_pipeline -f sample.fq -p 8 -c plek (only performs plek noncoding prediction on raw reads and outputs list of putative noncoding transcripts and a filtered fasta)
ncrna_pipeline -f sample.fa -p 8 -c cpat -o human ncrna_pipeline -f sample.fa -p 8 -c cpat -o mouse (Fasta is mapped back to the reference genome, the output gtf is converted to fasta which is filtered on Gencode [both coding and noncoding] and the prediction is performed using cpat. The output is the same as for the plek pipeline)
Updated