Overview

HTTPS SSH

MAPS: Multi-tAxon Paleopolyploidy Search Algorithm

Developed by the Barker lab at the University of Arizona.

Publications

  • MAPS was first described in:

    • Li, Z., A. E. Baniaga, E. B. Sessa, M. Scascitelli, S. W. Graham, L. H. Rieseberg, and M. S. Barker. 2015. Early genome duplications in conifers and other seed plants. Science Advances 1(10): e1501084. (https://doi.org/10.1126/sciadv.1501084)
  • It was updated with a new WGD simulation approach in:

    • Li, Z., G. Tiley, S. Galuska, C. Reardon, T. Kidder, R. Rundell, and M. S. Barker. 2018. Multiple large-scale gene and genome duplications during the evolution of hexapods. Proceedings of the National Academy of Sciences of the USA 115: 4713–4718. (https://doi.org/10.1073/pnas.1710791115)

Overview of MAPS

MAPS infers ancient whole genome duplication (WGD) or paleopolyploidization events by counting and summarizing subtrees from thousands of gene trees. Inputs to MAPS are a ladderized species tree and gene family phylogenies. Output of MAPS is a summary of subtree counts and the percentage of subtrees that support a shared gene duplication by node. By using a recently developed statistical approach, support for an ancient WGD at a particular location can be assessed by comparison to both null and positive simulations of paleopolyploidy.

Input of MAPS

The example_dataset folder provides an example of input format, please follow this input format.
1. The input list of species labels and species tree should have a .list extension. The .list file should start with a name for the MAPS analysis. Each species should use a 3 or 4 letter code. Each species code is separated by a tab (ex: example.list). Using the example.list as an example, MAPS reads it as a species tree ((((((aaa,bbb),ccc),ddd),eee),fff),ggg); 2. The input phylogenetic tree file should be a concatenation for gene trees. Each gene tree should contain at least a single gene copy for each of the species in the .list file. Gene trees should be in newick format and each gene tree should be in a row. The input phylogenetic tree file should have a .tre or .tree extension (ex: example.tre).

Usage of MAPS

perl maps.pl --help

--d     Set the directory to search
--l     Set the .list input file
--mb    Set the minimum bootstrap value (range 0 - 100)
--mt    Set the minimum % of the ingroup taxa to be present in all subtrees (range 0% - 100%, we recommend to use 45%)
--multi Use Multi-MAPS mode
--o     Set the output file name
--t     Set the .tre or .tree input file

Running MAPS with input data

To use at least 45% of the ingroup taxa to be present in all subtrees analyzed by MAPS.

perl maps.pl --l <.list file>  --t <.tre file> --mt 45

Running MAPS with bootstrapped trees

To use a minimum bootstrap value of 80 for each gene tree and at least 45% of the ingroup taxa to be present in all subtrees analyzed by MAPS.

perl  maps.pl --l <.list file>  --t <.tre file> --mb 80 --mt 45

Running Multi-MAPS with input data

Multi-MAPS runs multiple maps analyses in a folder. In Multi-MAPS mode, maps search for pair input files (ex: example.list and example.tre) inside the specify directory. One does not need to use --l and --t to specify the pair input files in Multi-MAPS mode.

perl maps.pl --multi --mt 45

Null simulations:

Getting the null simulations requires: 1. An ultrametric species tree in newick format. 2. A sim.ctl file. 3. simulateGeneTrees.3.0.pl. The simulateGeneTrees.3.0.pl is currently required jprime-0.3.4.jar, this can be adjust accordingly.

Modify the sim.ctl file

Please read through the sim.ctl file carefully and modify it as needed for you own data sets. Fixed_lambda, fixed_mu are background gene birth and death rates. It can be estimated using WGDgc or other programs. Root_distribution can be estimated using gene count data in WGDgc. Wgd_retention_rate should be set to 0 for null simulations.

For using WGDgc

WGDgc requires gene count information and a species tree in simmap format. For using WGDgc to estimate background gene birth and death rates for MAPS simulations, the gene count information can be obtained after the gene family clustering process. The species tree should be the same as the input species relationship for MAPS but in simmap format. The detail for how to run WGDgc can be found at http://pages.stat.wisc.edu/~ane/wgd/

Running null simulations

The simulationGeneTree.3.0.pl requires jprime-0.3.4.jar to be installed. A different version of JPrIME (https://github.com/arvestad/jprime) can be used by modyfying this script.

perl simulateGeneTrees.3.0.pl

Subsampling simulated trees

The input is a concatenated file of all the simulated gene trees. Each simulated gene tree should be in a single row.

perl sampleTrees.pl -in <input> -n <number of trees to sample> -r <number of replicates> -out <prefix for outputs>

Fisher’s Exact Test

A Fisher’s Exact Test is used to identify locations with significant increases of gene duplication compared with a null simulation.

perl runFisher_null.pl

Positive simulations:

Getting the positive simulations requires: 1. An ultrametric species tree in newick format. 2. A sim.ctl file. 3. simulateGeneTrees.3.0.pl.

Modify the species tree.

Label WGD(s) in the branch in ultrametric species tree.

Modify the sim.ctl file

Please read through the sim.ctl file carefully and modify it as needed for you own data sets. The default wgd_retention_rate is 0.20. The wgd_time_before_divergence is half of the branch length of the branch that subtaning the WGD.

Running positive simulation

The simulationGeneTree.3.0.pl requires jprime-0.3.4.jar to be installed. A different version of JPrIME (https://github.com/arvestad/jprime) can be used by modyfying this script.

perl simulateGeneTrees.3.0.pl

Subsampling simulated trees

Input files is a concatenation file for all the simulated gene trees. Each simulated gene tree should be in a single line.

perl sampleTrees.pl -in <input> -n <number of trees to sample> -r <number of replicates> -out <prefix for outputs>

Fisher’s Exact Test

A second Fisher’s Exact Test is then used to characterize bursts of duplication as ancient polyploidy if the observed data are not significantly less than the percentage of duplications in a positive simulation

perl runFisher_positive.pl