PhyloPhlAn: microbial Tree of Life using 400 universal proteins
PhyloPhlAn is a computational pipeline for reconstructing highly accurate and resolved phylogenetic trees based on whole-genome sequence information. The pipeline is scalable to thousands of genomes and uses the most conserved 400 proteins for extracting the phylogenetic signal. PhyloPhlAn also implements taxonomic curation, estimation, and insertion operations.
The main features of PhyloPhlAn are:
- completely automatic, as the user needs only to provide the (unannotated) protein sequences of the input genomes (as multifasta files of peptides - not nucleotides)
- very high topological accuracy and resolution because of the use of up to 400 previously identified most conserved proteins
- the possibility of integrating new genomes in the already reconstructed most comprehensive tree of life (3,171 microbial genomes)
- taxonomy estimation for the newly inserted genomes
- taxonomic curation for the produced phylogenetic trees
new PhyloPhlAn implementation [alpha version]
We are developing a new version of PhyloPhlAn and here you can find the new PhyloPhlAn wiki page.
Please note that it is still an alpha release available in the
dev branch of the repository.
PhyloPhlAn can also be obtained using Mercurial as follows:
$ hg clone https://bitbucket.org/nsegata/phylophlan
PhyloPhlAn has been developed and tested on Unix-based systems. On Windows or Mac systems, PhyloPhlAn may require some tweaking.
If you find the software or methodology useful, please cite the accompanying manuscript:
PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes
Nicola Segata, Daniela Börnigen, Xochitl C. Morgan, and Curtis Huttenhower.
Nature Communications 4, 2013
You can download PhyloPhlAn's inferred phylogenetic tree as a Newick file (with bootstrapping support) in which the genome labels are encoded with IMG taxon ID (prefixed with 't'). The same tree with leaf nodes annotated with labels for species, genera, families, and phyla are available. In addition, we provide the 400 alignments, and the subsampled concatenated alignment.
The image below reports the comprehensive, automated, and high-resolution microbial tree of life with taxonomic annotations obtained with PhyloPhlAn. It contains a total of 3,737 microbial genomes
A high-resolution version of this image can be downloaded here.
Updates and mailing list
Software updates will be posted on the bitbucket repository. You are more than welcome to use the Issue Tracking system on Bitbucket (or email us) to provide feedback, report bugs, and suggest/request new features.
If you questions and comments or you would like to be notified about new version, new features, or any other news related to PhyloPhlAn please join our mailing list:
Common commands and examples
"De novo" phylogenetic tree building with any sets of genomes
If you would like to build a phylogenetic tree using any set of private or public genomes all you need to do is creating a folder in the
input folder and copy inside one multifasta file (with extension ".faa") for each genome containing the peptidic sequences. If you call this folder "my_genomes" here is the command you need to call:
$ ./phylophlan.py -u my_genomes
when finished, the resulting tree will appear in the
Example 1: Corynebacterium "de novo" phylogenetic tree building
You can try out this operation (
-u) using an example included in the PhyloPhlAn package you downloaded called
example_corynebacteria and stored in the
input folder. In contains a protein multifasta file for each of the 30 genomes available for the Corynebacterium genus as February 2012 plus two Streptomyces genomes as a meaningful outgroup. As mentioned above, the command for obtaining the phylogenetic tree is:
$ ./phylophlan.py -u example_corynebacteria --nproc 4
Using 4 threads (specified with
--nproc 4) this operation should take no more than 4-5 minutes, but even using one processor only (default) should give you the results in 10 minutes or so.
output/example_corynebacteria/ folder you'll find a newick file of the resulting tree as provided by FastTree, and a PhyloXML file containing the same tree rerooted with a procedure which tries to maximize the distance from the root to any leaf. The two files are available for download (example_corynebacteria.tree.nwk, example_corynebacteria.tree.reroot.xml and can be inspected with tree visualization software and drawn with GraPhlAn. Figure 3B in the PhyloPhlAn paper reports and discuss this example.
Also the full three of life reported above has been originally generated in this way. Notice that the concatenated alignment used to generate the tree with FastTree is stored in
data/example_corynebacteria/aln.fna and can be used as input for other phylogenetic reconstruction software such as RAxML or Mega among many others.
Inserting new genomes to the tree of life
PhyloPhlAn let you insert a genome (or a set of genomes) into the already built microbial tree of life (containing >3,000 genomes, see figure and tree files above). Also in this case you need to create a dedicated folder (e.g.
my_genomes_to_insert) in the
input folder to store the protein multifasta files of interest. The command is:
$ ./phylophlan.py -i my_genomes_to_insert --nproc 16
If possible, we would recommend to use as many threads as possible (
--nproc) because this operation is quite computationally demanding as it requires the alignments with other 3,000 genomes to be updated and the full tree of life to be rebuilt.
The resulting tree file
output/my_genomes_to_insert/my_genomes_to_insert.tree.int.nwk can be inspected with tree visualization software to check where the new genomes are rooted and their relations with already well characterized strains.
Example 2: inserting Lactobacillus and Sulfolobus genomes into the tree of life
As an example of insertion, we included in the
input folder contained in the PhyloPhlAn package, three genomes recently sequenced and not yet included into the PhyloPhlAn tree and repository. These are two Lactobacillus and one Sulfolobus genomes available in IMG (accessions 2511231185, 2519899592, and 2524023197 respectively).
$ ./phylophlan.py -i example_insertion --nproc 16
The resulting file
example_insertion.tree.int.nwk now contains the thousands of genomes in the PhyloPhlAn repository as well as the three "new" genomes.
Imputing taxonomic labels for newly integrated genomes
You can also ask PhyloPhlAn to try to automatically assign a taxonomic labels to the genomes integrated into the tree of life (
-i option introduced above). This is possible simply adding the
-t flag (for taxonomic analysis) to the same command line:
$ ./phylophlan.py -i -t my_genomes_to_insert --nproc 16
In addition to the
output/my_genomes_to_insert/my_genomes_to_insert.tree.int.nwk file, you will obtain tab-separated text files with the most confident taxonomic predictions for your genomes in the
Example 3: predicting the taxonomic labels of three "new" genomes
Suppose you don't know the taxonomic labels of the Lactobacillus and Sulfolobus genomes used as examples above, possibly because of insufficient phenotipic characterization or because you obtained them with metagenomic assembly. You can call the PhyloPhlAn taxonomic imputation pipeline as:
$ ./phylophlan.py -i -t example_insertion --nproc 16
And check the predictions in the imputed_conf_high-conf.txt file that we report below:
Sulfolobus_acidocaldarius_N8 d__Archaea.p__Crenarchaeota.c__Thermoprotei.o__Sulfolobales.f__Sulfolobaceae.g__Sulfolobus.s__?.t__? Lactobacillus_rhamnosus_K_ATCC_8530 d__Bacteria.p__Firmicutes.c__Bacilli.o__Lactobacillales.f__Lactobacillaceae.g__Lactobacillus.s__rhamnosus.t__? Lactobacillus_rhamnosus_LRHMDP3 d__Bacteria.p__Firmicutes.c__Bacilli.o__Lactobacillales.f__Lactobacillaceae.g__Lactobacillus.s__rhamnosus.t__?
As expected, the all three genomes are assigned to the right genera. The two lactobacilli could also be assigned to the right species (
s__rhamnosus) whereas PhyloPhlAn does not find enough support to assign the Sulfolobus genome to the "acidocaldarius" species.
All command line options and parameters
$ ./phylophlan.py -h usage: phylophlan.py [-h] [-i] [-u] [-t] [--tax_test TAX_TEST] [-c] [--cleanall] [--nproc N] [-v] [PROJECT NAME] NAME AND VERSION: PhyloPhlAn version 0.99 (8 May 2013) AUTHORS: Nicola Segata (email@example.com) and Curtis Huttenhower (firstname.lastname@example.org) DESCRIPTION PhyloPhlAn is a computational pipeline for reconstructing highly accurate and resolved phylogenetic trees based on whole-genome sequence information. The pipeline is scalable to thousands of genomes and uses the most conserved 400 proteins for extracting the phylogenetic signal. PhyloPhlAn also implements taxonomic curation, estimation, and insertion operations. positional arguments: PROJECT NAME The basename of the project corresponding to the name of the input data folder inside input/. The input data consist of a collection of multifasta files (extension .faa) containing the proteins in each genome. If the project already exists, the already executed steps are not re-ran. The results will be stored in a folder with the project basename in output/ Multiple project can be generated and they safetely coexists. optional arguments: -h, --help show this help message and exit -i, --integrate Integrate user genomes into the PhyloPhlAn tree -u, --user_tree Build a phylogenetic tree using user genomes only -t, --taxonomic_analysis Check taxonomic inconsistencies and refine/correct taxonomic labels --tax_test TAX_TEST nerrors:type:taxl:tmin:tex:name (alpha version, experimental!) -c, --clean Clean the final and partial data produced for the specified project. (use --cleanall for removing general installation and database files) --cleanall Remove all instalation and database file leaving untouched the initial compressed data that is automatically extracted and formatted at the first pipeline run. Projects are not remove (specify a project and use -c for removing projects). --nproc N The number of CPUs to use for parallelizing the blasting [default 1, i.e. no parallelism] -v, --version Prints the current PhyloPhlAn version and exit
External Software Dependencies
- muscle version v3.8.31 or higher must be present in the system path and called "muscle"
- usearch version v5.2.32 (notice that version 6 is currently NOT supported) must be present in the system path and called "usearch"
- FastTree version 2.1 or higher must be present in the system path and called "FastTree"
- Biopython it is a PyPhlAn dependency, actually, but used inside PhyloPhlAn
The authors of PhyloPhlAn would like to thank Ashlee Earl and the Human Microbiome Project Strains Working Group for insightful suggestions, Morgan Price for his helpful comments on applying FastTree, and Levi Waldron, Joshua Reyes and Timothy Tickle for their suggestions on methodology and tree visualization
Changes in version 0.99 (8 May 2013)
Updates: - Pyphlan dependency removal - command line arguments simplified
Changes in version 0.98 (28 July 2012)
Bug fixes: - missing data file added
Changes in version 0.97 (24 July 2012)
First public release