Overview

HTTPS SSH

This repository contains scripts for baited homology search using SWIPE, refine homolog trees, and output results with taxon name fixed. Parameters are tuned for seach against angiosperm proteomes

Citation: Lopez-Nievesa, et al. Relaxation of tyrosine pathway regulation precedes the evolution of betalain pigmentation in Caryophyllales. In prep.

Part 1 Install and check dependencies

python

Currently the code is tested in python 2.7.x and is compatible with python 2.6. However, it won't work with 3.x.

blast+ package

Download the blast+ package from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/. Put makeblastdb and blastp in the PATH.

java

Type java to check whether you have java installed. If not, for linux

sudo apt-get install default-jre

For mac, follow instructions here https://helpx.adobe.com/x-productkb/global/install-java-jre-mac-os.html

phyutility

Download phyutility from https://code.google.com/p/phyutility/downloads/list. Unzip, and move the directory to ~/apps.

Put phyutility in the PATH. First, get the absolute path for the phyutility.jar file. In mac, you can just drag the .jar file into your terminal to get the absolute path. For example, mine is /Users/yangya/apps/phyutility/phyutility.jar. Next, make a new text file named "phyutility" in your home directory. For mac users, use TextWrangler for the text editing since TextEdit will not be able to save plain text files. Write the following line in this new text file pointing to the absolute path of the jar file, such as:

java -Xmx6g -jar /Users/yangya/apps/phyutility/phyutility.jar $*

Of course use the actual absolute path on your machine. Make the phyutility file executable

chmod u+x ~/phyutility

Move the text file to the PATH

sudo mv ~/phyutility /usr/bin/phyutility

Type

phyutility

to make sure that it is correctly installed. It should print out a list of commands.

If your initial homolog trees have 2,000 sequences or more, you will need to additional dependencies (skip PASTA and FastTree if you are working with small data sets, say, less than 50 species):

PASTA

Install PASTA following instructions here: https://github.com/smirarab/pasta. Put the script run_pasta.py to the path. I wrapped PASTA with pasta_wrapper.py

FastTree

Install FastTree 2 following the instructions here: http://meta.microbesonline.org/fasttree/. Put the executable in the path and rename it "fasttree". I wrapped PASTA with fasttree_wrapper.py

Visualization tools

I use Jalview and FigTree to visualize alignments and trees.

Part 2 Prepare data files

Bait fasta file

Prepare a fasta file ends with .pep.fa that contains the bait. Multiple sequences are allowed. Full-length peptide sequences are prefered. See the file example/ADH.pep.fa for an example.

Input peptide fasta files in pepdir

By default, initial input peptide fasta files are named taxonID.pep.fa, and each sequence id looks like taxonID@seqID. Avoid any special character in taxonID or seqID except "_". [Optional] Run CD-HIT on each input peptide dataset, and name the output taxonID.pep.fa.cdhit. See example/pep for examples. The module swipe_dir.py reads all files end with either .pep.fa or .cdhit in pepdir and BLASTP against each of them

Taxon table file

Prepare a plain text file named "taxon_table" in the same directory as the scripts. In each line put the taxonID and the actual species name (or however you want to call them), separated by a tab. See the taxon_table in this dir for an example. You can substitute it with your own taxon_table. I color coded taxon names for major clades for ease of visualization in FigTree.

Part 3 Baited homology search using SWIPE

python bait_homologs.py query_pep_fa pepDIR num_to_bait num_cores outdir

The parameter num_to_bait tells the script the maximum number of SWIPE hits to record, suggested value is between 5 and 20 unless dealing with very large gene families. You can change the filteres in the scripts. For example, cd into the example directory, and type

python ../bait_homologs.py ADH.pep.fa pep/ 5 2 .

This produces a directory named ADH (or any bait file name you use) with all the results and intermediate files in it.

ADH.rawswipe: unfiltered raw SWIPE results baitid.filtered_hits: list of positive hits after filtering ADH.swipe.fa: initial fasta file ADH.swipe.fa.mafft.aln: initial alignment file ADH.swipe.fa.mafft.aln-cln: initial alignment after trimming low occupancy columns ADH.raxml.tre: initial tree ADH.raxml.tre.tt.mm: initial tree after masking and trimming tips ADH.raxml.tre.tt.mm.name: ADH.raxml.tre.tt.mm after swap taxon names ADH_1.subtree: subtree containing baits after separating long internal branches ADH_1.subtree.name: ADH_1.subtree after swap taxon name

These following trees follow the same naming schemes as the initial round.

ADH_1.fa ADH_1.fa.mafft.aln ADH_1.fa.mafft.aln-cln ADH_1.raxml.tre ADH_1.raxml.tre.tt.mm ADH_1.raxml.tre.tt.mm.name

Same naming scheme for the final round. When multiple subtrees contains the baits all will be retained.

ADH_1_1.fa ADH_1_1.subtree ADH_1_1.subtree.name ADH_1_2.fa ADH_1_2.subtree ADH_1_2.subtree.name

Part 4 Post processing

I ususally open all the tree files end with .names to check the output. There are two rounds of alignment and tree building steps, each followed by cutting long internal branches. The last step of cutting may be too stringent and you might want to use the alignments and trees before the last cut instead. In addition, below are optional scripts for fine tuning the alignments and trees (in no particular order).

Adding additional taxa to an existing homolog group

This is essentially the same as part 3 but skips taxa that are already in the existing homolog fasta file.

python add_to_existing_homolog.py query_pep_fa pepDIR num_to_bait num_cores outdir existing_homolog_fasta

Extract a rooted ingruop clade from homolog tree given ingroup and outgroup taxa

python extract_clades.py inDIR treefileending outDIR MIN_INGROUP_TAXA TAXON_CODE output_name

Pooling fasta (pep or CDS) from all tips in a subtree

python write_fasta_from_one_tree.py all_fasta in_fasta
python write_fasta_from_one_tree.py fasta tree

Alternative alignment tools and wrappers

python fsa_wrapper faDIR
python prank_wrapper.py inDIR outDIR infile_ending dna/codon/aa