HTTPS SSH

This repo contains scripts for analyzing the subclade CDS datasets in the following publication:

Yang, Y., M.J. Moore, S.F. Brockington, J. Mikenas, J. Olivieri, J.F. Walker, and S.A. Smith. 2017. Improved transcriptome sampling pinpoints 26 paleopolyploidy events in Caryophyllales, including two paleo-allopolyploidy events. Accepted. New Phytologist.

Scripts in this repo was developed from https://bitbucket.org/yangya/phylogenomic_dataset_construction to automate analyses

Homology and orthology inference starting from all-by-all BLASTN

Create a directory "data" with all the CDS fasta files in it. Create a text file "in_out" listing ingroup and outgroup taxonIDs. See associated dryad repo for example files. cd into each subclade dir.

PHYT

python cds_to_tree_cary.py --indir=data --outdir=. --threads=4 --reltip=0.2 --abstip=0.4 --deep=0.3 --ortho=121 --inout=in_out --max_mis_taxa=2

PORT

python cds_to_tree_cary.py --indir=data --outdir=. --threads=4 --reltip=0.2 --abstip=0.4 --deep=0.3 --ortho=121 --inout=in_out --max_mis_taxa=2

AMAR

python cds_to_tree_cary.py --indir=data --outdir=. --threads=4 --reltip=0.2 --abstip=0.4 --deep=0.2 --ortho=121 --inout=in_out --max_mis_taxa=2

CARY

python cds_to_tree_cary.py --indir=data --outdir=. --threads=4 --reltip=0.1 --abstip=0.3 --deep=0.2 --ortho=121 --inout=in_out --max_mis_taxa=2

NCORE

python cds_to_tree_cary.py --indir=data --outdir=. --threads=4 --reltip=0.3 --abstip=0.6 --deep=0.4 --ortho=121 --inout=in_out --max_mis_taxa=2

This will produce homologs and orthologs for each subclade.

Species tree

ASTRAL

Multi-locus bootstrap with ASTRAL version 4.10.12.

python filter_1to1_orthologs_for_ASTRAL.py 5_homolog <minimal_taxa> <ASTRAL dir>
cat *.tre >all_ML.trees
ls *.trees >all_boot.trees
java -jar ~/apps/ASTRAL/Astral/astral.4.10.12.jar -i all_ML.trees -b all_boot.trees -g -r 100 -o astral_boot.out
tail -1 astral_boot.out >../astral_boot.tre

ICA score

Copy over the all_ML.trees file generated in the previous step, remove the extra “)” before “;” and calculate IC/ICA scores

sed -i 's/);/;/g' all_ML.trees
raxml -f i -T 2 -t RAxML_bestTree.ncore -z all_ML.trees -m GTRCAT -n ICA

The branch length is usually set to an arbitrary 1. With partial gene trees it will be something like 0.9. To visualize the tree in figtree, replace all the arbitrary branch lengths with 1, and then use regular expression in geany to reformat the output tree

:1\[(\d)\.(\d+),(\d)\.(\d+)\]
\1.\2_\3.\4:1

Mapping gene duplication to species tree

Extract rooted orthogroups from homolog groups

python extract_clades.py inDIR treefileending outDIR MIN_INGROUP_TAXA in_out output_name

Root the species tree from RAxML by reroot the tree in figtree and save as a new newick file. Map gene duplication from rooted orthogroups to the rooted species tree, filtering by average bootstrap percentage per orthogroup being at least 50

python map_dups.py orthogroupDIR rooted_spTree min_taxa outname

Alternatively, instead of filtering by average bootstrap percentage, filter by local tree topology requiring sister clade of the dup branch in orthogroup contains subset of species in sister clade of speciest tree

python map_dups_concordant.py orthogroupDIR rooted_spTree outname

Plot all duplication percentages to show that our cutoff for WGD is not arbitrary but indeed outliers

python plot_branch_labels.py <dup_perc_filter50_global file>

Go to R

a=read.table('dup_perc_filter50_global.filter43.branch_labels')
pdf=pdf(file='perc_dup_labels.pdf',width=4,height=4,)
hist(a[,1],breaks=60,col='grey',xlab='',ylab='',main='',axes=FALSE,xlim=c(0,0.6))
axis(1,pos=0)
axis(2,pos=0)
dev.off()

Ks plots

The taxon name translation table named "taxon_table" is locaded in this repo. Move the corresponding ks_plots python scripts, ks_calculator_scripts/bin, and ks_calculator_scripts/synonymous_calc.py to the dir where peptides or CDA files are located

Within-taxon based on BLASTP starting with cd-hit

python ks_plots.py num_cores taxon_table

Within-taxon based on BLASTN without clustering

python ks_plots_cds.py num_cores taxon_table

Between-taxa based on BLASTN

python ks_between_taxa_cds.py workingDIR taxonID1 taxonID2 num_cores taxon_table

An R script will be generated in the output for plotting. Plotting within and between species Ks plots in one panel

pdf=pdf(file='./PDQH_MJM2259.pdf',width=7,height=7)
a<-read.table('./PDQH.ks.ng')
b<-read.table('./MJM2259.ks.ng')
ab<-read.table('./PDQH_MJM2259.ks.ng')
hist(b[,1],breaks=50,col=rgb(0,0.5,0,0.5),xlab='',ylab='',main='PDQH_MJM2259',axes=FALSE)
hist(ab[,1],breaks=50,col=rgb(0.5,0,0,0.5),add=T)
hist(a[,1],breaks=50,col=rgb(0,0,0.5,0.5),add=T)
axis(1,pos=0)
axis(2,pos=0)
dev.off()