Wiki
Clone wikiguenomu / Home
Guenomu
guenomu (ゲノム) means "genome" in Japanese, and is a software written in C that estimates the species tree for a given set of gene families. Theoretically it will work with aligned sequences, one alignment per gene family, but at the moment it is functional only if you provide the distribution of gene trees for each family. That is, you must provide it with the possible gene topologies for each gene family, and the program will sample species trees compatible with these gene trees assuming that the disagreement between them is a composition between duplication and losses, deep coalescences, or other stochastic processes.
References
Species Tree Estimation from Genome-wide Data with Guenomu, biorxiv, 2015 DOI: 10.1101/023861
A Bayesian Supertree Model for Genome-Wide Species Tree Reconstruction, Systematic Biology, 2016 DOI: 10.1093/sysbio/syu082
Compilation
This software uses the Autotools suite to facilitate finding the appropriate libraries etc. Theoretically autotools allows the program to compile on several OSes, but I have tested it only on Linux systems and strongly advise against running it on Windows or Macs -- you're on your own if you choose to do so. To compile it, you just need to run "configure" followed by "make" and perhaps "make install". I suggest you to create a directory where to compile (please avoid compiling on top of the distribution directory, you'll have problems when compiling again afterwards). Therefore the "full" set of commands would look like:
/home/simpson $ wget https://bitbucket.org/leomrtns/guenomu/get/823a4854650b.zip /home/simpson $ unzip 823a4854650b.zip /home/simpson $ mkdir build /home/simpson $ cd build /home/simpson/build $ ../leomrtns-guenomu-823a4854650b/configure --enable-mpi --prefix="/home/simpson/local" /home/simpson/build $ make; make install
In the example above, I'm using a particular version (823a4854650b) of the program, please go to http://bitbucket.org/leomrtns/guenomu/downloads to download the newest one. After unzipping this file, a directory "leomrtns-guenomu-xxx" will be created, and we call the "configure" script from there. Notice that we didn't go to this directory, we created another one just for the compilation process, called "build". In our example we furthermore asked it to compile the parallel version with MPI, and asked it to install the final executables and libraries in the directory "/home/simpson/local" (a common alternative, if you are the administrator, is to use "/usr/local/".)
The configure script may complain that you don't have the argtable2 library, or that you miss the MPI environment. In these cases you may need to install the library (see below) or give up the parallel version (just remove the "enable-mpi" and you'll be able to compile the serial version).
Run "configure --help" to have a description of command-line options.
Installing argtable2
It is possible that you don't have the mandatory argtable2 library, in which case the program will fail to configure. There are 3 ways to fix that:
1) Ask your system administrator to install it system-wide, which implies in updating pkg-config. In Debian, for instance, this can be achieved by installing the package libargtable2-dev.
2) Install argtable2 locally (downloading and following installation instructions from http://sourceforge.net/projects/argtable/files/), and then configure guenomu bypassing pkg-config:
/home/simpson/build $ ARGTABLE2_CFLAGS="-I/home/simpson/local/include" ARGTABLE2_LIBS="-L/home/simpson/local/lib -largtable2" ../leomrtns-guenomu-823a4854650b/configure --enable-mpi --prefix="/home/simpson/local"
Assuming that you have installed argtable2 in /home/simpson/local -- that is, the argtable2 installation process should have created the files "/home/simpson/local/include/argtable2.h", "/home/simpson/local/lib/libargtable2.so" etc. I repeat, this is the installation directory (defined with "--prefix="), not the place where you downloaded the argtable2 library.
3) Install argtable2 locally as above, and use a custom pkg-config file. First create a file called "argtable2.pc" in your guenomu build directory with the following contents (again, assuming argtable2 was installed in "/home/simpson/local"):
prefix=/home/simpson/local
exec_prefix=${prefix}
includedir=${prefix}/include
libdir=${exec_prefix}/lib
Name: argtable2
Description: A library for parsing GNU style command line arguments
Version: 12
Libs: -L${libdir} -largtable2
Cflags: -I${includedir}
Therefore, if you have a file with full path "/home/simpson/build/argtable2.pc" with the contents above, then you can compile guenomu with
/home/simpson/build $ PKG_CONFIG_PATH=/home/simpson/build/ ../leomrtns-guenomu-823a4854650b/configure --enable-mpi --prefix="/home/simpson/local"
The quick explanation is that the software pkg-config tries to find all libraries and software. The file "argtable2.pc" tells pkg-config where to look for your argtable2 libraries, and the option "PKG_CONFIG_PATH" tells where the ".pc" files are. You can even create a "/home/simpson/local/lib/pkgconfig" directory and add there ".pc" files for your non-standard programs, using the recipe above. I haven't tried solution (2), so please let me know if you run into trouble -- and specially if you know how to fix it!
Programs
The main program is called "guenomu", which is responsible for the Bayesian MCMC analysis (more on that below). There are other experimental programs independent from the main one. These were developed to perform small tasks, check for the correctness of some functions, show how to use the biomcmc library or that have some ancilary version of the software.
The directory "src/test" contains several short programs, of which one may find particularly interesting:
- bmc2_maxtree species tree estimates based on gene family trees using algorithms described in this paper (the uncorrected versions), but that can work with or without branch lengths -- if the gene tree does not have branch lengths, the program assumes that the patristic distances are simply the number of edges between the tips. The name comes from the first of these distance matrix algorithms I studied.
- bmc2_addTreeNoise simulator of tree inference uncertainty: for each tree in a tree file, it will use branch lengths as proxies of error in tree estimation. That is, it will add noise to the tree where the shorter the branch the more noise it has -- and other tree without the branch will appear. This program is a poor man's version of simulating an alignment under the tree and then using using MrBayes to reconstruct the tree.
- bmc2_tree calculates minimum number of duplications, losses and deep coalescenses between all gene trees from a file and all species trees from another file (under D+L and ILS costs, independently), as well as other (novel) distance measures like the dSPR (still experimental and based on my recombination detection model), the MulRF (from here) and what I call the Hdist (that I rediscovered when speeding up the dSPR algortihm but that is similar to one described in 2006).
Other executables include mbc2_alignment, bmc2_random and bmc2_discrete that are simple checks to see if the functions and structures are working properly, and bmc2_likelihood which is an experimental Maximum Likelihood tree estimator based on a quick-and-dirty simmulated annealing.
If you need to use some of these programs in any serious analysis, please drop me a note so that I can reassess my priorities and get back to you (at this time I'm treating those just as consistency checks of the program). I don't check these programs regularly, so it's not impossible that I change some underlying function and render them unusable...
Control File
The program "guenomu" only needs one argument (but see below), which is the name of the control file. The control file accepts comments like the nexus format -- that is, between square brackets "[" and "]". It receives two kinds of parameters: 1) those that start with a param_ and are followed by a fixed number of parameter values after the equality signal "="; and 2) those that start with begin_list_of_, which are followed by an unknown number of values, until an end_of_list is found. The order of parameters is irrelevant, and most of them have default values except the file names of gene trees and the species names.
For more information and detailed description of the parameters, please check the example of a control file
The program also accepts command-line arguments, which have precedence over the control file -- that is, they will overwrite the choice given in the file. Therefore a good compromise is to have a control file with minimal information, unlikely to change (like the name of the files with species names and gene trees), while you can control other options in the command line. You can call "guenomu -h" or "guenomu --help" to see a complete description of the command-line options, and a short description is also given if you call the program without any argument. The command-line option is particularly useful since you must call the program once to do the MCMC analysis (option "-z 0") and then call it once more to analyze the produced output (run it with "-z 1").
All options have an equivalent commnad-line version, so the control file is not needed (although when running several analyses it helps in keeping track of the parameters used).
How to Run
As explained above, you must run the program twice: once to actually do the MCMC sampling and once more to analyze the output. The MCMC sampling will generate one (or several, if you're using the parallel version) binary file, called something like "job0.checkpoint.bin". This file is much more compact than a formatted text table, but needs to be "decompressed" afterwards so that we have usable information. Even if the MCMC program halts before completion, the binary file is usable (in the future it will be used to resume a halted analysis). Actually one thing you can do is to "read" this file while the MCMC program is still running, so that you can have an idea of how the sampling is going.
"WARNING 2014.11.13: the parallel version of the program may have bugs"
If you have many gene families, you may want to use the parallel version of the program. I couldn't find a smart (and easy) way of compiling both the serial and parallel versions at once, so for this you'll need to reconfigure and re-compile the program using the "--enable-mpi" option to the "configure" script. Then you can go to a cluster of PCs or a multi-core machine and run the program like "mpirun -np 8 --hostfile /myhosts.txt guenomu controlfile" -- in this case we are saying that we want to run 8 cores (the program will span 8 jobs) and that our list of computers is on file "/myhosts.txt". The number of jobs must be smaller than the number of gene families. Preferencially much smaller since all jobs communicate at every iteration, and thus we don't want each iteration to be very quick since then the program would be spending most of its time synchronizing the communication between jobs. Details you may want to skip: The communication is minimal (all jobs know all random moves involving common parameters like the species tree, they just need to broadcast the proposal values), there's a small time/memory overhead due to all jobs working with their copies of the species trees, and the time for one iteration depends on the gene family size. Therefore the program tries to divide equally the gene families amongst jobs -- while the number of gene families is almost constant between jobs, the program avoids one job with many large families while another have many small ones by sorting them by size and spreading over jobs.
As seen above the first step, of MCMC sampling, is called through "guenomu -z 0 controlfile" while the second step, of interpreting the binary output, is called by "guenomu -z 1 controlfile". When using the parallel version you must remember to call the program both times under the same parallel settings, that is, with the same number of jobs. Therefore if you run the MCMC sampling with "mpirun -np 8 guenomu -z 0 controlfile" you must analyze the output with "mpirun -np 8 guenomu -z 1 controlfile". (Although you don't need to be on the same supercomputer, as long as you have a working mpi environment. The issue here is that the distribution of gene families is done by the number of jobs.)
Updated