Wiki

Clone wiki

PhyloPhlAn / Home

PhyloPhlAn: microbial Tree of Life using 400 universal proteins

PhyloPhlAn is a computational pipeline for reconstructing highly accurate and resolved phylogenetic trees based on whole-genome sequence information. The pipeline is scalable to thousands of genomes and uses the most conserved 400 proteins for extracting the phylogenetic signal. PhyloPhlAn also implements taxonomic curation, estimation, and insertion operations.

The main features of PhyloPhlAn are:

  • completely automatic, as the user needs only to provide the (unannotated) protein sequences of the input genomes (as multifasta files of peptides - not nucleotides)
  • very high topological accuracy and resolution because of the use of up to 400 previously identified most conserved proteins
  • the possibility of integrating new genomes in the already reconstructed most comprehensive tree of life (3,171 microbial genomes)
  • taxonomy estimation for the newly inserted genomes
  • taxonomic curation for the produced phylogenetic trees

Obtaining PhyloPhlAn

PhyloPhlAn can be downloaded here or accessed from our live source code repository.

PhyloPhlAn can also be obtained using Mercurial as follows:

$ hg clone https://bitbucket.org/nsegata/phylophlan

The package can also be downloaded as a compressed file in zip, and bz2 formats.

PhyloPhlAn has been developed and tested on Unix-based systems. On Windows or Mac systems, PhyloPhlAn may require some tweaking.


Citing PhyloPhlAn

If you find the software or methodology useful, please cite the accompanying manuscript:

PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes
Nicola Segata, Daniela Börnigen, Xochitl C. Morgan, and Curtis Huttenhower.
Nature Communications 4, 2013

You can download PhyloPhlAn's inferred phylogenetic tree as a Newick file (with bootstrapping support) in which the genome labels are encoded with IMG taxon ID (prefixed with 't'). The same tree with leaf nodes annotated with labels for species, genera, families, and phyla are available. In addition, we provide the 400 alignments, and the subsampled concatenated alignment.


The image below reports the comprehensive, automated, and high-resolution microbial tree of life with taxonomic annotations obtained with PhyloPhlAn. It contains a total of 3,737 microbial genomes

PhyloPhlAn

A high-resolution version of this image can be downloaded here.

Updates and mailing list

Software updates will be posted on the bitbucket repository. You are more than welcome to use the Issue Tracking system on Bitbucket (or email us) to provide feedback, report bugs, and suggest/request new features.

If you questions and comments or you would like to be notified about new version, new features, or any other news related to PhyloPhlAn please join our mailing list:

PhyloPhlAn google group


Common commands and examples

"De novo" phylogenetic tree building with any sets of genomes

If you would like to build a phylogenetic tree using any set of private or public genomes all you need to do is creating a folder in the input folder and copy inside one multifasta file (with extension ".faa") for each genome containing the peptidic sequences. If you call this folder "my_genomes" here is the command you need to call:

$ ./phylophlan.py -u my_genomes

when finished, the resulting tree will appear in the output/my_genomes folder.

Example 1: Corynebacterium "de novo" phylogenetic tree building

You can try out this operation (-u) using an example included in the PhyloPhlAn package you downloaded called example_corynebacteria and stored in the input folder. In contains a protein multifasta file for each of the 30 genomes available for the Corynebacterium genus as February 2012 plus two Streptomyces genomes as a meaningful outgroup. As mentioned above, the command for obtaining the phylogenetic tree is:

$ ./phylophlan.py -u example_corynebacteria --nproc 4

Using 4 threads (specified with --nproc 4) this operation should take no more than 4-5 minutes, but even using one processor only (default) should give you the results in 10 minutes or so.

In the output/example_corynebacteria/ folder you'll find a newick file of the resulting tree as provided by FastTree, and a PhyloXML file containing the same tree rerooted with a procedure which tries to maximize the distance from the root to any leaf. The two files are available for download (example_corynebacteria.tree.nwk, example_corynebacteria.tree.reroot.xml and can be inspected with tree visualization software and drawn with GraPhlAn. Figure 3B in the PhyloPhlAn paper reports and discuss this example.

Also the full three of life reported above has been originally generated in this way. Notice that the concatenated alignment used to generate the tree with FastTree is stored in data/example_corynebacteria/aln.fna and can be used as input for other phylogenetic reconstruction software such as RAxML or Mega among many others.

Inserting new genomes to the tree of life

PhyloPhlAn let you insert a genome (or a set of genomes) into the already built microbial tree of life (containing >3,000 genomes, see figure and tree files above). Also in this case you need to create a dedicated folder (e.g. my_genomes_to_insert) in the input folder to store the protein multifasta files of interest. The command is:

$ ./phylophlan.py -i my_genomes_to_insert --nproc 16

If possible, we would recommend to use as many threads as possible (--nproc) because this operation is quite computationally demanding as it requires the alignments with other 3,000 genomes to be updated and the full tree of life to be rebuilt.

The resulting tree file output/my_genomes_to_insert/my_genomes_to_insert.tree.int.nwk can be inspected with tree visualization software to check where the new genomes are rooted and their relations with already well characterized strains.

Example 2: inserting Lactobacillus and Sulfolobus genomes into the tree of life

As an example of insertion, we included in the input folder contained in the PhyloPhlAn package, three genomes recently sequenced and not yet included into the PhyloPhlAn tree and repository. These are two Lactobacillus and one Sulfolobus genomes available in IMG (accessions 2511231185, 2519899592, and 2524023197 respectively).

$ ./phylophlan.py -i example_insertion --nproc 16

The resulting file example_insertion.tree.int.nwk now contains the thousands of genomes in the PhyloPhlAn repository as well as the three "new" genomes.

Imputing taxonomic labels for newly integrated genomes

You can also ask PhyloPhlAn to try to automatically assign a taxonomic labels to the genomes integrated into the tree of life (-i option introduced above). This is possible simply adding the -t flag (for taxonomic analysis) to the same command line:

$ ./phylophlan.py -i -t my_genomes_to_insert --nproc 16

In addition to the output/my_genomes_to_insert/my_genomes_to_insert.tree.int.nwk file, you will obtain tab-separated text files with the most confident taxonomic predictions for your genomes in the output/my_genomes_to_insert/ folder.

Example 3: predicting the taxonomic labels of three "new" genomes

Suppose you don't know the taxonomic labels of the Lactobacillus and Sulfolobus genomes used as examples above, possibly because of insufficient phenotipic characterization or because you obtained them with metagenomic assembly. You can call the PhyloPhlAn taxonomic imputation pipeline as:

$ ./phylophlan.py -i -t example_insertion --nproc 16

And check the predictions in the imputed_conf_high-conf.txt file that we report below:

Sulfolobus_acidocaldarius_N8    d__Archaea.p__Crenarchaeota.c__Thermoprotei.o__Sulfolobales.f__Sulfolobaceae.g__Sulfolobus.s__?.t__?
Lactobacillus_rhamnosus_K_ATCC_8530     d__Bacteria.p__Firmicutes.c__Bacilli.o__Lactobacillales.f__Lactobacillaceae.g__Lactobacillus.s__rhamnosus.t__?
Lactobacillus_rhamnosus_LRHMDP3 d__Bacteria.p__Firmicutes.c__Bacilli.o__Lactobacillales.f__Lactobacillaceae.g__Lactobacillus.s__rhamnosus.t__?

As expected, the all three genomes are assigned to the right genera. The two lactobacilli could also be assigned to the right species (s__rhamnosus) whereas PhyloPhlAn does not find enough support to assign the Sulfolobus genome to the "acidocaldarius" species.

All command line options and parameters

$ ./phylophlan.py -h
usage: phylophlan.py [-h] [-i] [-u] [-t] [--tax_test TAX_TEST] [-c]
                     [--cleanall] [--nproc N] [-v]
                     [PROJECT NAME]

NAME AND VERSION:
PhyloPhlAn version 0.99 (8 May 2013)

AUTHORS:
Nicola Segata (nsegata@hsph.harvard.edu) and Curtis Huttenhower (chuttenh@hsph.harvard.edu)

DESCRIPTION
PhyloPhlAn is a computational pipeline for reconstructing highly accurate and resolved 
phylogenetic trees based on whole-genome sequence information. The pipeline is scalable 
to thousands of genomes and uses the most conserved 400 proteins for extracting the 
phylogenetic signal.
PhyloPhlAn also implements taxonomic curation, estimation, and insertion operations.

positional arguments:
  PROJECT NAME          The basename of the project corresponding to the name of the input data folder inside 
                        input/. The input data consist of a collection of multifasta files (extension .faa)
                        containing the proteins in each genome. 
                        If the project already exists, the already executed steps are not re-ran.
                        The results will be stored in a folder with the project basename in output/
                        Multiple project can be generated and they safetely coexists.

optional arguments:
  -h, --help            show this help message and exit
  -i, --integrate       Integrate user genomes into the PhyloPhlAn tree 
  -u, --user_tree       Build a phylogenetic tree using user genomes only 
  -t, --taxonomic_analysis
                        Check taxonomic inconsistencies and refine/correct taxonomic labels
  --tax_test TAX_TEST   nerrors:type:taxl:tmin:tex:name (alpha version, experimental!)
  -c, --clean           Clean the final and partial data produced for the specified project.
                         (use --cleanall for removing general installation and database files)
  --cleanall            Remove all instalation and database file leaving untouched the initial compressed data 
                        that is automatically extracted and formatted at the first pipeline run.
                        Projects are not remove (specify a project and use -c for removing projects).
  --nproc N             The number of CPUs to use for parallelizing the blasting
                        [default 1, i.e. no parallelism]
  -v, --version         Prints the current PhyloPhlAn version and exit

External Software Dependencies

  • muscle version v3.8.31 or higher must be present in the system path and called "muscle"
  • usearch version v5.2.32 (notice that version 6 is currently NOT supported) must be present in the system path and called "usearch"
  • FastTree version 2.1 or higher must be present in the system path and called "FastTree"

Acknowledgements

The authors of PhyloPhlAn would like to thank Ashlee Earl and the Human Microbiome Project Strains Working Group for insightful suggestions, Morgan Price for his helpful comments on applying FastTree, and Levi Waldron, Joshua Reyes and Timothy Tickle for their suggestions on methodology and tree visualization

Change log

Changes in version 0.99 (8 May 2013)

Updates:
- Pyphlan dependency removal
- command line arguments simplified

Changes in version 0.98 (28 July 2012)

Bug fixes:
- missing data file added

Changes in version 0.97 (24 July 2012)

First public release

Updated