This repository contains scripts for baited homology search using SWIPE, refine homolog trees, and output results with taxon name fixed. Parameters are tuned for seach against angiosperm proteomes
Citation: Lopez-Nievesa, et al. Relaxation of tyrosine pathway regulation precedes the evolution of betalain pigmentation in Caryophyllales. In prep.
Part 1 Install and check dependencies
Currently the code is tested in python 2.7.x and is compatible with python 2.6. However, it won't work with 3.x.
Download the blast+ package from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/. Put makeblastdb and blastp in the PATH.
Type java to check whether you have java installed. If not, for linux
sudo apt-get install default-jre
For mac, follow instructions here https://helpx.adobe.com/x-productkb/global/install-java-jre-mac-os.html
Download phyutility from https://code.google.com/p/phyutility/downloads/list. Unzip, and move the directory to ~/apps.
Put phyutility in the PATH. First, get the absolute path for the phyutility.jar file. In mac, you can just drag the .jar file into your terminal to get the absolute path. For example, mine is /Users/yangya/apps/phyutility/phyutility.jar. Next, make a new text file named "phyutility" in your home directory. For mac users, use TextWrangler for the text editing since TextEdit will not be able to save plain text files. Write the following line in this new text file pointing to the absolute path of the jar file, such as:
java -Xmx6g -jar /Users/yangya/apps/phyutility/phyutility.jar $*
Of course use the actual absolute path on your machine. Make the phyutility file executable
chmod u+x ~/phyutility
Move the text file to the PATH
sudo mv ~/phyutility /usr/bin/phyutility
to make sure that it is correctly installed. It should print out a list of commands.
If your initial homolog trees have 2,000 sequences or more, you will need to additional dependencies (skip PASTA and FastTree if you are working with small data sets, say, less than 50 species):
Install PASTA following instructions here: https://github.com/smirarab/pasta. Put the script run_pasta.py to the path. I wrapped PASTA with pasta_wrapper.py
Install FastTree 2 following the instructions here: http://meta.microbesonline.org/fasttree/. Put the executable in the path and rename it "fasttree". I wrapped PASTA with fasttree_wrapper.py
I use Jalview and FigTree to visualize alignments and trees.
Part 2 Prepare data files
Bait fasta file
Prepare a fasta file ends with .pep.fa that contains the bait. Multiple sequences are allowed. Full-length peptide sequences are prefered. See the file example/ADH.pep.fa for an example.
Input peptide fasta files in pepdir
By default, initial input peptide fasta files are named taxonID.pep.fa, and each sequence id looks like taxonID@seqID. Avoid any special character in taxonID or seqID except "_". [Optional] Run CD-HIT on each input peptide dataset, and name the output taxonID.pep.fa.cdhit. See example/pep for examples. The module swipe_dir.py reads all files end with either .pep.fa or .cdhit in pepdir and BLASTP against each of them
Taxon table file
Prepare a plain text file named "taxon_table" in the same directory as the scripts. In each line put the taxonID and the actual species name (or however you want to call them), separated by a tab. See the taxon_table in this dir for an example. You can substitute it with your own taxon_table. I color coded taxon names for major clades for ease of visualization in FigTree.
Part 3 Baited homology search using SWIPE
python bait_homologs.py query_pep_fa pepDIR num_to_bait num_cores outdir
The parameter num_to_bait tells the script the maximum number of SWIPE hits to record, suggested value is between 5 and 20 unless dealing with very large gene families. You can change the filteres in the scripts. For example, cd into the example directory, and type
python ../bait_homologs.py ADH.pep.fa pep/ 5 2 .
This produces a directory named ADH (or any bait file name you use) with all the results and intermediate files in it.
ADH.rawswipe: unfiltered raw SWIPE results baitid.filtered_hits: list of positive hits after filtering ADH.swipe.fa: initial fasta file ADH.swipe.fa.mafft.aln: initial alignment file ADH.swipe.fa.mafft.aln-cln: initial alignment after trimming low occupancy columns ADH.raxml.tre: initial tree ADH.raxml.tre.tt.mm: initial tree after masking and trimming tips ADH.raxml.tre.tt.mm.name: ADH.raxml.tre.tt.mm after swap taxon names ADH_1.subtree: subtree containing baits after separating long internal branches ADH_1.subtree.name: ADH_1.subtree after swap taxon name
These following trees follow the same naming schemes as the initial round.
ADH_1.fa ADH_1.fa.mafft.aln ADH_1.fa.mafft.aln-cln ADH_1.raxml.tre ADH_1.raxml.tre.tt.mm ADH_1.raxml.tre.tt.mm.name
Same naming scheme for the final round. When multiple subtrees contains the baits all will be retained.
ADH_1_1.fa ADH_1_1.subtree ADH_1_1.subtree.name ADH_1_2.fa ADH_1_2.subtree ADH_1_2.subtree.name
Part 4 Post processing
I ususally open all the tree files end with .names to check the output. There are two rounds of alignment and tree building steps, each followed by cutting long internal branches. The last step of cutting may be too stringent and you might want to use the alignments and trees before the last cut instead. In addition, below are optional scripts for fine tuning the alignments and trees (in no particular order).
Adding additional taxa to an existing homolog group
This is essentially the same as part 3 but skips taxa that are already in the existing homolog fasta file.
python add_to_existing_homolog.py query_pep_fa pepDIR num_to_bait num_cores outdir existing_homolog_fasta
Extract a rooted ingruop clade from homolog tree given ingroup and outgroup taxa
python extract_clades.py inDIR treefileending outDIR MIN_INGROUP_TAXA TAXON_CODE output_name
Pooling fasta (pep or CDS) from all tips in a subtree
python write_fasta_from_one_tree.py all_fasta in_fasta python write_fasta_from_one_tree.py fasta tree
Alternative alignment tools and wrappers
python fsa_wrapper faDIR python prank_wrapper.py inDIR outDIR infile_ending dna/codon/aa