1. Ya Yang
  2. phylogenomic_dataset_construction



Citation: Yang, Y. and S.A. Smith. 2014. Orthology inference in non-model organisms using transcriptomes and low-coverage genomes: improving accuracy and matrix occupancy for phylogenomics. Molecular Biology and Evolution. doi: 10.1093/molbev/msu245

Choose a taxonID for each data set. This taxonID will be used throughout the analysis. Use short taxonIDs with 4-6 letters and digits with no special characters.

Notes on taxon sampling: since any clustering algorithm is sensitive to structure within an ortholog group, it is best to avoid including two or more taxa that are significantly more closely-related to each other than to the rest of the taxa.

Step 1: Filter adapters with blastn

Choose a taxonID for each data set. This taxonID will be used throughout the analysis. Use short taxonIDs with no special characters.

Remove adapters for paired end reads:

python filter_fastq.py taxonID_1.fq taxonID_2.fq adapter_file num_cores

Alternatively, if sequences are single end reads:

python filter_fastq.py taxonID.fq adapter_file num_cores

The output files are taxonID_1.fq.filtered and taxonID_2.fq.filtered for paired end reads, and taxonID.fq.filtered for single end reads. I use filter_fastq.py only for sequences downloaded from SRA with adapter sequences and phred score offset unknown. For data sets with known adapter and phred scores I use Trimmomatic instead (see below). Trimmomatic is much faster but would not give an error message if you the adapter or phred score offset were misspedified.

Step 2: De novo assembly with Trinity

A typical Trinity call for analyzing Smith Lab sequences looks like (custom adapter file, use trimmomatic for both adapter clipping and quality trimming, stranded):

ulimit -s unlimited
Trinity --seqType fq --trimmomatic --quality_trimming_params "ILLUMINACLIP:/home/yayang/optimize_assembler/data/TruSeq_adapters:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25" --max_memory 100G --CPU 9 --full_cleanup --SS_lib_type RF --output taxonID.trinity --left <forward reads> --right <reverse reads>

A typical Trinity call for analyzing SRA data sets looks like (custom adapter file, use trimmomatic for quality trimming but not adapter clipping, non-stranded):

ulimit -s unlimited
Trinity --seqType fq --trimmomatic --quality_trimming_params "SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25" --max_memory 100G --CPU 9 --full_cleanup --output taxonID.trinity --left <forward reads> --right <reverse reads>

Step 3: Find open reading frames and translate using TransDecoder with blastp for choosing orfs

I benchmarked using blastp vs. PfamAB and found that since we have closely related, high quality proteomes available for the Caryophyllales, blastp using a custom blast database is faster and more sensitive than Pfam. For Smith Lab data sets (stranded):

TransDecoder.LongOrfs -t <transcripts> -S
blastp -query <transcripts>.transdecoder_dir/longest_orfs.pep -db <blast_database> -max_target_seqs 1 -outfmt 6 -evalue 10 -num_threads 9 > taxonID.blastp.outfmt6
TransDecoder.Predict -t <transcripts> --retain_blastp_hits taxonID.blastp.outfmt6 --cpu 9

For sra data sets (non-stranded):

TransDecoder.LongOrfs -t <transcripts>
blastp -query <transcripts>.transdecoder_dir/longest_orfs.pep -db <blast_database> -max_target_seqs 1 -outfmt 6 -evalue 10 -num_threads 9 > taxonID.blastp.outfmt6
TransDecoder.Predict -t <transcripts> --retain_blastp_hits taxonID.blastp.outfmt6 --cpu 9

Below are the two pipelines that I am currently using for processing Illumina reads from raw reads to translation. Both pipelines log program calls, check intermediate results, write a summary stats after finishing and remove intermediate files. Both also runs a final check of peptide coverage after finished. Change SRC_HOME, APPS_HOME, BLASTP_DB_PATH, TruSeq_ADAPTER,TRINITY_CMD and TRANSDECODER_PFEFIX in process_seq.py according to where the files are located. Also change the hitID in process_seq.py to the closest-related proteome taxonID in your blast database.

All data sets generated in the Smith Lab are stranded and paired-end 101 bp read using Illumina HiSeq. Read trimming is carried out using trimmomatic with custom adapter files.

python run_all_illumina_pe_stranded fasq.gzDIR <taxonID> <max_cores> <max_memory_GB>

Alternatively, for SRA data sets, the pipeline runs a check of phres scores (will stop and give an error message if phred score offset is not 33), logs all parameters used, and records number of read pairs before and after filtering adapters. Change FASTQ_DUMP_CMD and GENERAL_ADAPTER as needed. It also output the quality scores for a subset of reads for plotting in R. This is useful because sometimes paired end data sets in SRA are incorrectly formated as single end. There would be a dip in quality scores if this is the case.

python run_all_illumina_sra <sra_file> taxonID single/pair stranded/non-stranded <max_cores> <max_memory_GB>

Compress large files.

python gzip_files.py <DIR> <file ending>

Example files and test data sets can be found in examples/example_seq_processing. cd in to the dir examples/example_seq_processing and run the pipelines.

sh test_sra_pe.sh
sh test_sra_se.sh
sh test_smith_lab_seq.sh

Clean up test results

sh clean.sh

Step 4: Clustering

The TransDecoder output files are taxonID.Trinity.fasta.transdecoder.pep and taxonID.Trinity.fasta.transdecoder.cds. Shorten the namess to taxonID@seqID. The special character "@" is used to separate taxonID and seqID. Avoid using any other special characters in the sequence names except "_". Any "-" in the sequence name will be replaced by phyutility and cause problems in later stages when matching sequence names between tree and alignment.

python fix_names_from_transdecoder.py <inDIR> <outDIR>

The output are taxonID.pep.fa and taxonID.cds.fa. Check for duplicated names, special characters other than digits, letters and "_", all names follow the format taxonID@seqID, and file names are the taxonID. It's good to check especially when some of the data sets were obtained from elsewhere. Most peptides and CDS files from genome annotation contain long names, spaces and special characters.

python check_names.py <DIR> <file_ending>

Reduce redundancy. For amino acids:

cd-hit -i taxonID.fa -o taxonID.fa.cdhit -c 0.995 -n 5 -T <num_cores>

Or alternatively, for cds (use -r 0 since these should all be positive strand after translation):

cd-hit-est -i taxonID.fa.cds -o code.fa.cds.cdhitest -c 0.99 -n 10 -r 0 -T <num_cores>

All-by-all blast. Copy all the taxonID.fa.cdhit files (or .cdhitest files) into a new directory. Note that it is important to set the maximum number of hits very high (e.g. 1000) to allow the inclusion of all closely related ingroup and outgroup sequences. I usually use an evalue cutoff of 10 so that I don't need to re-run the all-by-all blast again.

Since blastp takes much longer to complete than blastn, I prefer using seperate input fasta files for all-by-all blastp to keep track of progress. I also carry out the makeblastdb step locally before doing blast on a cluster. This is because makeblastdb will check formatting of sequences, duplicate sequence names and special characters in seqIDs etc. It is easier to fix these locally before moving to a cluster.

python all-by-all_blastp.py <DIR> <file_ending> <num_cores>
cat *.rawblastp >all.rawblast

Or alternatively, if CDS works better when the taxa diverged recently:

cat *.cdhitest >all.fa
makeblastdb -in all.fa -parse_seqids -dbtype nucl -out all.fa
blastn -db all.fa -query all.fa -evalue 10 -num_threads <num_cores> -max_target_seqs 1000 -out all.rawblast -outfmt '6 qseqid qlen sseqid slen frames pident nident length mismatch gapopen qstart qend sstart send evalue bitscore'

[Optional] Remove ends of sequences that are not covered by any blast hits from other taxa. Skip this if wish not to cut ends that are fast-evolving, or using sequences from genome annotation.

python cut_seq_ends.py all.fa all.rawblast

Filter raw blast output by hit fraction and prepare input file for mcl. The input can be rawblastp or rawblastn results. I usually use 0.3 or 0.4 for hit_fraction_cutoff when using sequences assembled from RNA-seq depending on how divergent the sequences are. A low hit-fraction cutoff will output clusters with more incomplete sequences and much larger and sparser alignments, whereas a high hit-fraction cutoff gives tighter clusters but ignores incomplete or divergent sequences. For genome data I use a hit_fraction cutoff of 0.5. You can also set IGNORE_INTRASPECIFIC_HITS to be True to avoid recent gene duplications or isoforms forming tight clusters and break off.

python blast_to_mcl.py all.rawblast <hit_fraction_cutoff>

The output file is used as input for mcl. Try a few different hit fraction cutoffs and inflation values. My experience is that MCL is robust to minusLogEvalue cutoffs and using a cutoff of 0 or 5 works ok in all the cases I tested. You loose entire clusters of short genes by using a high minusLogEvalue cutoff. Use the smallest inflation value and hit-fraction cutoff value combination that gives alignable clusters. "--te" specifies number of threads, "-I" specifies the inflation value, and -tf 'gq()' specifies minimal -log transformed evalue to consider, and "-abc" specifies the input file format. Here are some example mcl command lines:

mcl all.rawblast.hit-frac0.4.minusLogEvalue --abc -te 5 -tf 'gq(5)' -I 1.4 -o hit-frac0.4_I1.4_e5
mcl all.rawblast.hit-frac0.4.minusLogEvalue --abc -te 5 -tf 'gq(5)' -I 2 -o hit-frac0.4_I2_e5
mcl all.rawblast.hit-frac0.3.minusLogEvalue --abc -te 5 -tf 'gq(5)' -I 1.4 -o hit-frac0.3_I1.4_e5
mcl all.rawblast.hit-frac0.3.minusLogEvalue --abc -te 5 -tf 'gq(5)' -I 2 -o hit-frac0.3_I2_e5

Write fasta files for each cluster from mcl output. Make a new directory to put the thousands of output fasta files.

mkdir <outDIR>
python write_fasta_files_from_mcl.py <fasta files with or without ends cut> <mcl_outfile> <minimal_taxa> <outDIR>

Now we have a new directory with fasta files that look like cluster1.fa, cluster2.fa and so on.

Step 5: Build homolog trees

Put phyutility in the path. First, get the absolute path for the phyutility.jar file. For example, mine is /home/yayang/apps/phyutility/phyutility.jar. Next, make a new text file named "phyutility" in your path, and write this line in it pointing to the absolute path of the jar file like this (use the actual path in your machine of course):

java -Xmx6g -jar /home/yayang/apps/phyutility/phyutility.jar $*

Make sure that raxml, fasttree, phyutility, mafft and pasta are properly installed and excutables are in the path, and the executables are named exactly as raxml, fasttree, phyutility, mafft and pasta.py respectively.

Align each cluster, trim alignment, and infer a tree. Make sure that you have raxml 8 or later installed so that it reads fasta file. With an ealier version of raxml you will get an error message "Problem reading number of species and sites".

For clusters that have less than 1000 sequences, it will be aligned with mafft (--genafpair --maxiterate 1000), trimmed by a minimal column occupancy of 0.1 and tree inference using raxml. For larger clusters it will be aligned with pasta, trimmed by a minimal column occupancy of 0.01 and tree inference using fasttree. The ouput tree files look like clusterID.raxml.tre or clusterID.fasttree.tre for clusters with 1000 or more sequences.

python fasta_to_tree.py <fasta dir> <number_cores> dna/aa bootstrap(y/n)

Trim tips that are longer than a relative length cutoff and more than 10 times longer than its sister. Also trim tips that are longer than an absolute value. The output tree ends with ".tt". Keep input and output trees in the same directory.

python trim_tips.py <input tree dir> <tree_file_ending> <relative_cutoff> <absolute_cutoff>

Mask both mono- and (optional) paraphyletic tips that belong to the same taxon. Keep the tip that has the most un-ambiguous charactors in the trimmed alignment. Keep input and output trees in the same directory.

python mask_tips_by_taxonID_transcripts.py <.tt dir> <aln-cln dir> mask_paraphyletic(y/n)

For phylogenomic data sets that are from annotated genomes, I would only mask monophyletic tips, and keep the sequence with the shortest terminal branch length. Keep input and output trees in the same directory.

python mask_tips_by_taxonID_genomes.py <.tt dir>

Cut deep paralogs. If interested in building phylogeny a lower (more stringent) long_internal_branch_cutoff should be used. Use a higher (more relaxed) cutoff if interested in homologs to avoid splitting homologs. This works very well with CDS and less effective amino acid squences. For CDS the branch lengths are mostly determined by synonymous distance and are more consistant than for amino acids. Make sure that the indir and outdir are different directories.

python cut_long_internal_branches.py <input tree dir> <input tree file ending> <internal_branch_length_cutoff> <minimal no. taxa> <outDIR>

Write fasta files from trees. The imput tree file ending should be .subtree

python write_fasta_files_from_trees.py all.fa <cut tree dir> <tree_file_ending> <outDIR>

Repeat the alignment tree estimation, trimming, masking and cutting deep paralogs. Can use a set of more stringent cutoffs in the second round. After the final round, write fasta files from trees using tree files that ends with both .subtree and .tt.mm, and estimate the final homolog trees.

Alternatively one can calculate the synonymous distance and use that to guide cutting. However, since we are only trying to get well-aligned clusters for tree inference, choice of length cutoffs here can be somewhat arbitary.

From here a number of further analyses can be done with the homologs, such as gene tree discordance and back translate peptide alignment to codons with pal2nal and investigate signature of natural selection.

Step 6: Paralogy pruning to infer orthologs. Use one of the following:

1to1: only look at homologs that are strictly one-to-one. No cutting is carried out.

python filter_1to1_orthologs.py <homologDIR> <tree_file_ending> <minimal_taxa> <outDIR>

MI: prune by maximum inclusion. The long_tip_cutoff here is typically the same as the value used when trimming tips. Set OUTPUT_1to1_ORTHOLOGS to False if wish only to ouput orthologs that is not 1-to-1, for example, when 1-to-1 orthologs have already been analyzed in previous steps.

python prune_paralogs_MI.py <homologDIR> <tree_file_ending> <relative_long_tip_cutoff>  <absolute_long_tip_cutoff> <minimal_taxa> <outDIR>

MO: prune by using homologs with monophyletic, non-repeating outgroups, reroot and cut paralog from root to tip. If no outgroup, only use those that do not have duplicated taxa. Change the list of ingroup and outgroup names first. Set OUTPUT_1to1_ORTHOLOGS to False if wish only to ouput orthologs that is not 1-to-1

python prune_paralogs_MO.py <homologDIR> <tree_file_ending> <minimal_taxa> <outDIR>

RT: prune by extracting ingroup clades and then cut paralogs from root to tip. If no outgroup, only use those that do not have duplicated taxa. Compile a list of ingroup and outgroup taxonID, with each line begin with either "IN" or "OUT", followed by a tab, and then the taxonID.

python prune_paralogs_RT.py <homologDIR> <tree_file_ending> <outDIR> <minimal_ingroup_taxa> <ingroup and outgroup taxonIDs>

Or alternatively, if the input homolog tree is already rooted:

python prune_paralogs_from_rooted_trees.py <homoTreeDIR> <tree_file_ending> <minimal_taxa> <outDIR>

Step 7: Visualize matrix occupancy stats and constructing the supermatrix

python ortholog_occupancy_stats.py <ortho_treDIR>

Read in and rank number of taxa per ortholog from highest to lowest. Plot the ranked number of taxa per ortholog

a <- as.numeric(read.table("ortho_stats")[,1])
a <- sort(a, decreasing=TRUE)
plot(a, type="l", lwd=3, ylab="Number of Taxa in Each Ortholog")

Check taxon_stats to see if any taxa have unusally low number of genes in the orthologs. Open the file taxon_occupancy.pdf and decide the MIN_TAXA filter. Write new fasta files from ortholog trees

python write_ortholog_fasta_files.py <fasta file with all seqs> <ortholog tree DIR> outDIR MIN_TAXA

Align final orthologs. Play with a few alignment methods here. Try prank or fsa in addition to mafft or sate for more accurate alignments. Prank tend to create lots of gaps when it is not sure about the homology so make sure to check the alignment visually.

python prank_wrapper.py <inDIR> <outDIR> <file ending> DNA/aa

Trim alignment. I usually use 0.3 for MIN_COLUMN_OCCUPANCY

python phyutility_wrapper.py <inDIR> <MIN_COLUMN_OCCUPANCY> DNA/aa

Or use Gblocks for trimming alignments if the sequences are very divergent (change FILE_ENDING first):

python pep_gblocks_wrapper.py <inDIR> <outDIR>

Choose the minimal cleaned alignment length and minimal number of taxa filters for whether to include an ortholog in the supermatrix. Concatenate selected cleaned matrices:

python concatenate_matrices.py <aln-clnDIR> <numofsites> <numoftaxa> DNA/aa <outfile>

This will output a list of cleaned orthology alignments that passed the filter, a summary of taxon matrix occupancies to check whether any taxon is under represented, and a concatenated matrix in phylip format

Step 8: Estimate species tree

Run raxml with each ortholog as a separate partition. Use GTRCAT instead of PROTCATWAG for dna

raxml -T <num_cores> -p 12345 -m PROTCATWAG -q <.model> -s <.phy> -n <output>

Run raxml with 200 rapid bootstrap replicates and search for the best tree. Use GTRCAT instead of PROTCATWAG for dna

raxml -T 9 -f a -x 12345 -# 200 -p 12345 -m PROTCATWAG -q <.models file> -s <.phy file> -n <output>

Use examl if the matrix is large:

python examl_wrapper.py <.phy file> <.model file> <outname> <number_cores> DNA/aa

Run 200 jackknife replicates. Copy both the .phy and .model file into a new working dir, cd into the working dir, and run the following script (sample fixed proportion of number of genes):

python jackknife_by_percent_genes.py <num_core> <jackknife proportion> DNA/aa

Alternatively, resampling by number of genes:

python jackknife_by_number_genes.py DIR <num_core> <no. genes to resample> DNA/aa

Mapping jackknife results to the best tree. Use GTRCAT instead of PROTCATWAG for dna

cat *result* >JK10_trees
raxml -f b -t <bestTree> -z JK10_trees -T 2 -m PROTCATWAG -n <output_name>

Making a consensus tree from jackknife output

phyutility -con -t 0.001 -in JK10_trees -out JK10_consensus

Translate taxon codes to make the tree more readable:

python taxon_name_subst.py <table> <treefile>

A test data set can be found in examples/homology_inference. cd in to the dir examples/example_seq_processing and type: