1. Ya Yang
  2. optimize_assembler

Overview

Citation: Yang, Y. and S.A. Smith Optimizing de novo assembly of short-read RNA-seq data for phylogenomics. BMC Genomics 2013, 14:328 doi:10.1186/1471-2164-14-328

Extract nuclear reads.

Command lines for removing reads that map to R. communis organellar genomes by bowtie2. Fasta files Rcommunis_chl.fa and Rcommunis_mt.fa are chloroplast and mitochondrial genome sequence files.

cat Rcommunis_chl.fa Rcommunis_mt.fa >Rcommunis_chl-mt.fa
bowtie2-build Rcommunis_chl-mt.fa Rcommunis_chl-mt
bowtie2 -p 10 -x Rcommunis_chl-mt -1 <filtered forward reads> -2 <filtered reverse reads> --un-conc <outfile> --phred64 --local -S <outfile>.sam

De novo assembly

Oases

Oases single k

velveth <assemble dir> <k> -shortPaired -fastq <shuffled reads>
velvetg <assemble dir> -read_trkg yes
oases <assemble dir> -ins_length 200

Or alternatively, using the Oases-M pipeline from Oases

python oases_pipeline.py -m 19 -M 71 -o <assemble dir> -d " -shortPaired -fastq <cleaned and shuffled reads>" -p " -ins_length 200 "
python oases_pipeline.py -m 21 -M 61 -s 10 -r -o <assemble dir>

Trinity

Command lines for de novo assembly using Trinity. Make sure that the reads end with "/1" for all the left reads, and "/2" for the right reads. Replace $TRINITY_DIR with the path to the trinity directory.

ulimit -s unlimited
perl $TRINITY_DIR/Trinity.pl --seqType fq --JM 20G --left <read1> --right <read2> --output <dataset id> --CPU 10 --full_cleanup

[Note added Jan 29,2014] New benchmark comparing SOAPdenovo-Trans v1.03 and Trinity v. 20131110 show that Trinity recovered higher gene coverage and is more flexible with better documentations.

[Note added Apr 29,2014] New benchmark comparing Trinity 20140413p1 vs. 20131110: the new version had a 15% reduction in chimeras with a 1% reduction in coverage when using default Trinity parameters.

Trans-ABySS

Command lines for de novo assembly using Trans-ABySS

abyss-pe name=PAZJ_abyss k=21 in='<read1.cleaned.filtered> <read1.cleaned.filtered>'
abyss-pe name=PAZJ_abyss k=31 in='<read1.cleaned.filtered> <read1.cleaned.filtered>'
abyss-pe name=PAZJ_abyss k=41 in='<read1.cleaned.filtered> <read1.cleaned.filtered>'
abyss-pe name=PAZJ_abyss k=51 in='<read1.cleaned.filtered> <read1.cleaned.filtered>'
abyss-pe name=PAZJ_abyss k=61 in='<read1.cleaned.filtered> <read1.cleaned.filtered>'
python trans-abyss.py -T -d -l PAZJ_abyss -L 72 --assembly_dir <DIR> -p myproject -P <DIR>
python trans-abyss.py -T -0 -l PAZJ_abyss -L 72 --assembly_dir <DIR> -p myproject -P <DIR> --local

SOAPdenovo-Trans

Download the newest version of SOAPdenovo-Trans, which comes with the executables. If using the default K=23:

python soapdenovo-trans_wrapper fq1 fq2 outDIR

If need to modify the value of K. First edit the config file, and then modify and run the following script. Make sure using the corresponding executable with each K value.

sh soapdenovo-trans.sh

Pick the representative isoform by read coverage

For Oases

Python script for processing Oases output and pick the highest-covered transcript for each "locus".

python process_oases_transcripts.py <assembly dir> <maximum no. of transcpripts per locus> <minimal length of transcript to keep> <outfile.fa>

For Trinity

Command line for mapping reads to transcripts using RSEM. Download RSEM, cd into the RSEM dir, type "make", and add the RSEM directory to the path. Replace $TRINITY_DIR with the path to the trinity directory in the following command line. Add "--SS_lib_type RF" if using Illumina stranded library prep.

perl $TRINITY_DIR/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts <trinity output fasta file> --seqType fq --left <read1.cleaned.filtered> --right <read2.cleaned.filtered> --thread_count 4 --output_dir <outDIR> --prefix <dataset id>

Python scripts for picking the highest-covered isoform per subcomponent from Trinity output and RSEM results

python pick_isoform_trinity_RSEM.py <RSEM.isoforms.results> <trinity output fasta file>

Python scripts for removing the lowest-covered isoform per subcomponent when there are two or more isoforms per subcomponent

python filter_isoform_trinity_RSEM.py <RSEM.isoforms.results> <trinity output fasta file> <outfile>

Cap3 to combine assembly from multiple runs

cap3 <concatenated fasta file> -o 200 -p 99
cat <.cap.contigs> <.cap.singlets> > <outfile>

Detecting chimeras and cutting assembled transcripts according to customized tabular blastx output. Local blast searches are carried out using the blast+ package from NCBI. Remove any space or special charactors in sequence names before blast! Make sure that the input fasta file name ends with NAME.fa, and blastx output file name look like NAME.fa.blastx. This step is best done before translation, and the blast database should come from proteomes from a small number of closely-related species.

blastx -db <blast database> -query <NAME.fa> -evalue 0.01 -outfmt '6 qseqid qlen sseqid slen frames pident nident length mismatch gapopen qstart qend sstart send evalue bitscore' -out <NAME.fa.blastx> -num_threads=23 -max_target_seqs 100
python detect_chimera_from_blastx.py <DIR_blastx_output>
python cut_chimera_from_blastx.py <DIR_fasta> <DIR_blastx_output> <minimal_output_size>

Summarize assembly statistics

Scripts that based on blat require very closely-related reference transcripts and are for benchmark only. When a closely-related annotated genome is available, use genome guided assembly instead.

Extract nuclear reference transcripts

blat <organeller genome sequence> <non-redundant R. communus reference transcripts> <blat output file name>
python extract_nuclear_transcripts.py <nuclear and organeller reference transcripts> <blat output file name>

Extracting nuclear reference transcripts from total reference transcripts

blat <fasta file with R. communis chloroplast and mitochondrial genomes> <total reference transcripts> <output .psl file>
python extract_nuclear_transcripts.py <total reference transcripts> <blat output .psl file>

Python script for picking the best blat hit and identify chimeric "transcripts". It takes a directory of blat output files end with .psl, and output two files for each .psl file: a .goodhits file with the best hit for each "transcript", and a .probhits file listing the putative chimeric "transcripts" with all their highly similar blat hits.

python pick_blat_hits.py <DIR_blat_output>

Calculate total gene coverage in base pair and in number of genes

python sum_blat_gene_coverage.py <DIR_blat_output>

Check the coverage and redundancy of translated peptide sequences by blastp agaist one closely-related reference proteome.

python check_peptide_coverage_redundancy.py peptide_fasta_DIR peptide_fasta_file_ending reference_proteome_fasta