# MetaPhlAn2 tutorial

MetaPhlAn2 is a tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data.

If you're using this tutorial as part of a workshop, or if you'd like to think about the algorithm and implementation details a bit, you'll occasionally see "discussion questions" at the end of tutorial sections, formatted like this:

• What is your quest?
• What is your favorite color?

## Overview

The basic steps of MetaPhlAn2 are:

## Prerequisites

• Python (version >= 2.7)
• Bowtie2
• Numpy
• Pandas (optional, only required by utility scripts)
• BioPython (optional, only required by utility scripts)
• SciPy (optional, only required by utility scripts)
• Matplotlib (optional, only required by utility scripts)
• biom (optional, only required for biom format input/output)

## Installation

### Homebrew

If you're using this tutorial from the bioBakery VM (locally or on the cloud), MetaPhlAn2 is already installed! Otherwise, you can install MetaPhlAn2 and other bioBakery tools automatically with Homebrew for MacOS or Linuxbrew for Linux platforms.

$brew tap biobakery/biobakery$ brew install metaphlan2


This will also install all MetaPhlAn2 dependencies.

### From Source

Alternatively, you can manually install MetaPhlAn2 from source and then manually install the dependencies.

Step 1: Download MetaPhlAn2 and unpack the software:

$tar xzvf biobakery-metaphlan2-<versionid>.tar.gz$ cd biobakery-metaphlan2-<versionid>/
$ls  Step 2: Add all of the scripts to your PATH: 1. Add this command export PATH=$MDIR:$MDIR/utils/:$PATH to your $HOME/.bashrc file 1. Replace$MDIR with the full path to the MetaPhlAn2 folder
2. To get the value of $MDIR run$ pwd from inside the MetaPhlAn2 folder
2. Source your $HOME/.bashrc file. 1.$ source $HOME/.bashrc 2. This only needs to be done once to update your PATH in your current session. 3. All future sessions will automatically source this file when you login. Alternatively, instead of adding the scripts to your PATH, you can provide the full path to the script each time you run. For example, the following command prints the MetaPhlAn2 help screen (replace$MDIR with the full path to the MetaPhlAn2 folder):

MDIR/metaphlan2.py --help


Step 3: Install the MetaPhlAn2 dependencies.

Install bowtie2 in a folder in your PATH or specify the path to the bowtie2 executable with each run of MetaPhlAn2 using the option --bowtie2_exe <bowtie2>.

## Create taxonomic profiles

MetaPhlAn2 accepts as input short reads from a single shotgun metagenomic sequencing experiment and outputs the list of detected microbes and their relative abundances.

### Input files

MetaPhlAn2 accepts metagenomic sequence in several formats, including .fasta, .fastq, and .tar.bz2.

To see all the input formats type metaphlan2.py -h | less. Use the arrow key to move up and down. Type q to quit back to the prompt.

• Which command line arguments are required?
• Which optional command line arguments seem most important or most commonly used?

For the purpose of this tutorial, we will use the following set of six input files that have been subsampled for rapid analysis:

The original files, and many others, can be downloaded from the HMP DACC.

If you're running this tutorial locally, click on the links to download each file. NOTE that if you are using the Google Chrome browser, it may automatically decompress (gunzip) the gzip-ed files. The files should download to your default Downloads folder location.

If you're running this tutorial on a server or cloud instance, you can use the curl or wget programs to download each file. Right click on each link, choose "Copy Link Address" (or equivalent), and in a terminal type:

$curl -O https://bitbucket.org/biobakery/biobakery/raw/tip/demos/biobakery_demos/data/metaphlan2/input/SRS014459-Stool.fasta.gz$ curl -O https://bitbucket.org/biobakery/biobakery/raw/tip/demos/biobakery_demos/data/metaphlan2/input/SRS014464-Anterior_nares.fasta.gz
...


Once you have downloaded all six input files, if you did not already place them in your tutorial working directory, make a new folder there and move the downloaded files to it using the following commands at the terminal prompt.

• Make a new folder in your current working directory

$mkdir metaphlan2_analysis  • Move the six input files from the Downloads folder (or wherever they were placed) to the metaphlan2_analysis folder $ mv ~/Downloads/SRS*.fasta.gz metaphlan2_analysis/

• Now change directories into the metaphlan2_analysis folder and list the directory contents to see the six files

$cd metaphlan2_analysis$ ls


We are now ready to run the MetaPhlAn2 analysis.

• If you're familiar with the command line and file structure, does it matter where you place MetaPhlAn2's input files, or where you run it from?
• What about these input files might make them particularly appropriate for a short demonstration?

### Run a single sample

Here is the basic example to profile a single metagenome from raw reads.

$metaphlan2.py SRS014476-Supragingival_plaque.fasta.gz --input_type fasta > SRS014476-Supragingival_plaque_profile.txt  ### Output files Running MetaPhlAn2, following the example in the prior section, will create two output files. This file contains the intermediate mapping results to unique sequence markers. Alignments are listed one per line in tab-separated columns of read and reference marker: $ less -S SRS014476-Supragingival_plaque.fasta.gz.bowtie2out.txt


This file contains the final computed organism abundances.

Organism abundances are listed one clade per line, tab-separated from the clade's percent abundance:

$less -S SRS014476-Supragingival_plaque_profile.txt  The first column lists clades, ranging from taxonomic kingdoms (Bacteria, Archaea, etc.) through species. The taxonomic level of each clade is prefixed to indicate its level: Kingdom: k__, Phylum: p__, Class: c__, Order: o__, Family: f__, Genus: g__, Species: s__. Since sequence-based profiling is relative and does not provide absolute cellular abundance measures, clades are hierarchically summed. Each level will sum to 100%; that is, the sum of all kindom-level clades is 100%, the sum of all genus-level clades (including unclassified) is also 100%, and so forth. OTU equivalents can be extracted by using only the species-level s__ clades from this file (again, making sure to include clades unclassified at this level). • What do the parts of the read identifiers in the first column of the first, per-read/marker output file indicate? • What do the parts of the gene identifiers in the second column of the first, per-read/marker output file indicate? • Since the second, per-clade abundance output file is already normalized, you never need to sum-normalize these relative abundances. However, if you tried to, what would the sum of all clades' relative abundances be? ### Run on multiple cores When available, MetaPhlAn2 can take advantage of multiple cores using the nproc flag: $ metaphlan2.py SRS014459-Stool.fasta.gz --input_type fasta --nproc 4 > SRS014459-Stool_profile.txt


• Under typical circumstances, how many cores (threads) should you ask MetaPhlAn2 to use?

### Run multiple samples

Each MetaPhlAn2 execution processes exactly one sample, but the resulting single-sample analyses can easily be combined into an abundance table spanning multiple samples. Let's finish the last four samples from the input files tutorial section:

$metaphlan2.py SRS014464-Anterior_nares.fasta.gz --input_type fasta --nproc 4 > SRS014464-Anterior_nares_profile.txt$ metaphlan2.py SRS014470-Tongue_dorsum.fasta.gz --input_type fasta --nproc 4 > SRS014470-Tongue_dorsum_profile.txt
$metaphlan2.py SRS014472-Buccal_mucosa.fasta.gz --input_type fasta --nproc 4 > SRS014472-Buccal_mucosa_profile.txt$ metaphlan2.py SRS014494-Posterior_fornix.fasta.gz --input_type fasta --nproc 4 > SRS014494-Posterior_fornix_profile.txt


Alternatively, if you're familiar with bash shell syntax, you can loop over the entire sample set:

$for f in *.fasta.gz > do > metaphlan2.py$f --input_type fasta --nproc 4 > ${f%.fasta.gz}_profile.txt > done  Either way, you will now have a complete set of six profile output files and six intermediate mapping files. If you'd like to skip this step to speed things up, the 12 demo file outputs can be downloaded from the following links (click the link and then right-click on the preview page to select "Save as...", or copy the URL to download on a server). Profile output files: Intermediate mapping output files: Finally, the MetaPhlAn2 distribution includes a utility script that will create a single tab-delimited table from these files: $ merge_metaphlan_tables.py *_profile.txt > merged_abundance_table.txt


The resulting table can be opened in Excel, any gene expression analysis program, less (example below), or visualized graphically as per subsequent tutorial sections.

$less -S merged_abundance_table.txt  • How might you convert these relative abundance measures to pseudo-RPKs that are sensitive to each sample's read depth? • In what important ways does analysis of a relative abundance table differ from that of a gene expression (microarray or RNA-seq) transcript table? In what ways are they similar? • Under what circumstances is this tab-delimited text data format particularly efficient or inefficient? Is this likely to be a problem for species-level taxonomic profiles? ## Visualize results ### Create a heatmap with hclust2 A heatmap is one way to visualize tabular abundance results such as those from MetaPhlAn2. The plotting tool we'll use here, hclust2, is a convenience script that can show any, some, or all of the microbes or samples in a MetaPhlAn2 table. In this tutorial we will plot the heatmap for all of the samples. If you're using this tutorial from the bioBakery VM (locally or on the cloud), hclust2 is already installed! Otherwise, you can install hclust2 and other bioBakery tools automatically with HomeBrew for MacOS or LinuxBrew for Linux platforms. $ brew tap biobakery/biobakery
$brew install hclust2  This will install hclust2 and all of its dependencies. Alternatively, you can manually install hclust2 from source by downloading hclust2 and then installing the hclust2 dependencies (numpy, pandas, biopython, scipy, and matplotlib). Step 1: Generate the species only abundance table Run the following command to create a species only abundance table, providing the abundance table ( merged_abundance_table.txt ) created in prior tutorial steps: $ grep -E "(s__)|(^ID)" merged_abundance_table.txt | grep -v "t__" | sed 's/^.*s__//g' > merged_abundance_table_species.txt


There are three parts to this command. The first grep searches the file for the regular expression "(s__)|(^ID)" which matches to those lines with species information and also to the header. The second grep does not print out any lines with strain information (labeled as t__). The sed removes the full taxonomy from each line so the first column only includes the species name.

The new abundance table (merged_abundance_table_species.txt) will contain only the species abundances with just the species names (instead of the full taxonomy).

The first few lines of the file will look like:

ID  SRS014459-Stool_profile SRS014464-Anterior_nares_profile        SRS014470-Tongue_dorsum_profile   SRS014472-Buccal_mucosa_profile       SRS014476-Supragingival_plaque_profile  SRS014494-Posterior_fornix_profile
Actinomyces_graevenitzii    0.0     0.0     0.85102 0.0     0.0     0.0
Corynebacterium_matruchotii 0.0     0.0     0.0     0.0     58.21595        0.0
Corynebacterium_pseudodiphtheriticum        0.0     14.07759        0.0     0.0     0.0     0.0
Rothia_dentocariosa 0.0     0.0     0.0     0.0     32.10966        0.0
Bacteroides_cellulosilyticus        3.82206 0.0     0.0     0.0     0.0     0.0
Bacteroides_massiliensis    10.61295        0.0     0.0     0.0     0.0     0.0
Bacteroides_ovatus  4.08051 0.0     0.0     0.0     0.0     0.0


• Why is it useful to visualize (and particularly to cluster) only the "tips" of the taxonomic tree?
• Why does the grep / sed command work the way that it does?
• What assumptions does it make about the taxonomy?

Step 2: Generate the heatmap

Next generate the species only heatmap by running the following command:

$hclust2.py -i merged_abundance_table_species.txt -o abundance_heatmap_species.png --ftop 25 --f_dist_f braycurtis --s_dist_f braycurtis --cell_aspect_ratio 0.5 -l --flabel_size 6 --slabel_size 6 --max_flabel_len 100 --max_slabel_len 100 --minv 0.1 --dpi 300  The command above includes options to select the top 25 features, use Bray-Curtis as the distance measure both between samples and between features (microbes), set the ratio between the width/height of cells to 0.5, use a log scale for assigning heatmap colors, set the sample and feature label size to 6, set the max sample and feature label length to 100, select the minimum value to display as 0.1, and select an image resolution of 300. Open the resulting heatmap (abundance_heatmap_species.png) to take a look. If you generated it on your local computer, just double click. If you're using a server with just a terminal interface, you might have to transfer the file locally first using a tool like scp. If you're using a server with a graphical interface, you can open the file using a command like see abundance_heatmap_species.png). Using any of these methods, the results should look like: Notice that due to the very large differences between body site communities in the human microbiome, we can still easily see site-specific species despite the small demonstration input files (each is subsampled to only 10,000 reads for efficiency). • Which microbes are most abundant at each body site in these demonstration data? • Under what circumstances is log-scaling the heatmap abundance colors good? When might it be bad (i.e. visually deceptive)? • Why do we show only the 25 most abundant features in this example? ### Create a cladogram with GraPhlAn You can also visualize microbial abundances on a tree of life (also referred to as a phylogeny or cladogram) that captures their taxonomic (or phylogenetic) relatedness. Here, we'll use a tool called GraPhlAn that can render trees and annotate them with microbial names or data such as abundances. The instructions here assume that you will run GraPhlAn from the command line, but if you'd like to use an online Galaxy module instead, see the section on GraPhlAn in Galaxy. For more information on this tool, refer to the GraPhlAn tutorial. If you're using this tutorial from the bioBakery VM (locally or on the cloud), GraPhlAn is already installed! Otherwise, you can install GraPhlAn and other bioBakery tools automatically with Homebrew for MacOS or Linuxbrew for Linux platforms. $ brew tap biobakery/biobakery
$brew install graphlan  This will install GraPhlAn and all of its dependencies. Alternatively, you can manually install GraPhlAn from source by downloading GraPhlAn and then install the GraPhlAn dependencies (numpy, pandas, biopython, scipy, and matplotlib). Step 1: Create the GraPhlAn input files GraPhlAn requires two inputs: (i) a tree structure to represent and (ii) graphical annotation options for the tree. Run the following command to generate the two input files for GraPhlAn (the tree and annotation files) providing the abundance table (merged_abundance_table.txt) created in prior tutorial steps: $ export2graphlan.py --skip_rows 1,2 -i merged_abundance_table.txt --tree merged_abundance.tree.txt --annotation merged_abundance.annot.txt --most_abundant 100 --abundance_threshold 1 --least_biomarkers 10 --annotations 5,6 --external_annotations 7 --min_clade_size 1


The command above has options to skip rows 1 and 2, select the top 100 most abundance clades to highlight, set a minimum abundance threshold for clades to be annotated, extract a minimum of 10 biomarkers, select taxonomic levels 5 and 6 to be annotated within the tree, select taxonomic level 7 to be used in the external legend, and set the minimum size of clades annotated as biomarkers to 1. The output files created are merged_abundance.tree.txt and merged_abunance.annot.txt.

• What are the contents and structure of the "tree" file?
• What are the contents and structure of the "annot" file?

Step 2: Create a cladogram

Run the following commands to generate the cladogram providing the tree ( merged_abundance.tree.txt ) and annotation ( merged_abundance.annot.txt ) files from the prior step

$graphlan_annotate.py --annot merged_abundance.annot.txt merged_abundance.tree.txt merged_abundance.xml$ graphlan.py --dpi 300 merged_abundance.xml merged_abundance.png --external_legends


The first command creates an xml file from the tree and annotation inputs. The second command creates the image, setting the image resolution to 300 and requesting external legends.

The first few lines of the xml file are:

<phyloxml xmlns="http://www.phyloxml.org" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.phyloxml.org http://www.phyloxml.org/1.10/phyloxml.xsd">
<phylogeny rooted="true">
<clade>
<clade>
<name>k__Viruses</name>
<branch_length>1.0</branch_length>


The generated cladogram (merged_abundance.png) is:

The annotation (merged_abundance_annot.png ) should be:

And the legend (merged_abundance_legend.png ) is:

As above, if you generated these images on your local computer, open them by simply double clicking. If you're using a server with only a terminal interface, transfer the file locally first using a tool like scp. If you're using a server with a graphical interface, you can open the file using a command like see merged_abundance.png).

• What is the PhyloXML format? Why might it be used in this context?
• Why is it often particularly useful to plot circular, rather than linear, cladograms?
• What other types of annotations might be useful on such a tree (either different graphical formats, or different types of data to take advantage of them)?

## Create a strain-level marker-based heatmap (PanPhlAn)

PanPhlAn provides one way to identify, track, and phylogenetically place individual strains from metagenomes. It uses phylogenomics, or the co-occurrence of gene presence / absence across strains in a metagenome collection. Here, we'll show a simple example of identifying strains in demonstration data and visualizing them as a marker-based heatmap.

PanPhlAn uses MetaPhlAn markers to identify the dominant strain within each (sufficiently abundant) species across one or more metagenomes. You will thus use a different sample set for this tutorial than in prior sections, since we provide multiple samples that all share the same abundant species (unlike the differing species across body sites above).

The files you will start this section with were generated by running MetaPhlAn2. They are intermediate mapping results output files (named "bowtie2out.txt"). If you would like to learn how to generate these files from the input fasta files, see the first step in the StrainPhlAn Tutorial.

For this tutorial section, please download the following MetaPhlAn intermediate bowtie2 output files (click the link and then right-click on the preview page to select "Save as...", or copy the URL to download as above):

• Different human individuals generally don't gain and lose different genes. Why do different microbial strains?
• What pros and cons are there in using gene presence / absence patterns for strain identification, as opposed to nucleotide polymorphisms or other types of variants?

Step 1: Create a file of abundances for all markers in the selected species

Run the following commands to create marker abundance files for all samples. For this tutorial we will be selecting the species s__Eubacterium_siraeum and requiring at least 1% abundance.

$metaphlan2.py --input_type bowtie2out -t clade_specific_strain_tracker --clade s__Eubacterium_siraeum --min_ab 1.0 13530241_SF05.fasta.gz.bowtie2out.txt > 13530241_SF05.siraeum.txt$ metaphlan2.py --input_type bowtie2out -t clade_specific_strain_tracker --clade s__Eubacterium_siraeum --min_ab 1.0 13530241_SF06.fasta.gz.bowtie2out.txt > 13530241_SF06.siraeum.txt
$metaphlan2.py --input_type bowtie2out -t clade_specific_strain_tracker --clade s__Eubacterium_siraeum --min_ab 1.0 19272639_SF05.fasta.gz.bowtie2out.txt > 19272639_SF05.siraeum.txt$ metaphlan2.py --input_type bowtie2out -t clade_specific_strain_tracker --clade s__Eubacterium_siraeum --min_ab 1.0 19272639_SF06.fasta.gz.bowtie2out.txt > 19272639_SF06.siraeum.txt
$metaphlan2.py --input_type bowtie2out -t clade_specific_strain_tracker --clade s__Eubacterium_siraeum --min_ab 1.0 40476924_SF05.fasta.gz.bowtie2out.txt > 40476924_SF05.siraeum.txt$ metaphlan2.py --input_type bowtie2out -t clade_specific_strain_tracker --clade s__Eubacterium_siraeum --min_ab 1.0 40476924_SF06.fasta.gz.bowtie2out.txt > 40476924_SF06.siraeum.txt


The commands above set options to indicate the input file is of type bowtie2 output, the analysis is of type clade_specific_strain_tracker, the clade to track, and the minimum abundance is 1.0.

The following output files will be created:

Step 2: Merge the marker abundance files

Merge the output files by running the following command:

$merge_metaphlan_tables.py *.siraeum.txt > siraeum_tracker.txt  This will generate a merged marker abundance file ( siraeum_tracker.txt ). Step 3: Create a heatmap Run the following command to create the heatmap: $ hclust2.py -i siraeum_tracker.txt -o siraeum_tracker.png --skip_rows 1 --f_dist_f hamming --no_flabels --dpi 300 --cell_aspect_ratio 0.01


The command above sets options to skip row 1 in the input file (it skips reading the sample id header), the feature distance is computed with hamming, no feature labels will be included, the image resolution is 300, and the ratio between the width/height of cells is 0.01.

For more information on hclust2, see the section on how to create a heatmap with hclust2.

The resulting figure ( siraeum_tracker.png ) will show the E. siraeum markers present (yellow) and absent (black) in the clustered heatmap:

The strain level marker heatmap is an example of PanPhlAn results. For additional information on strain-level profiling, see the PanPhlAn Tutorial.

## Additional tools

The LEfSe and MaAsLin tutorials describe additional tools to analyze MetaPhlAn2 output.

Updated