MetaPhlAn 2: Metagenomic Phylogenetic Analysis


MetaPhlAn 2.0 can be obtained

Thorugh Bioconda

$ conda install metaphlan2

by direct download from Bitbucket

or cloning the repository using the following command

$ hg clone


MetaPhlAn is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level. With the newly added StrainPhlAn module, it is now possible to perform accurate strain-level microbial profiling.

MetaPhlAn 2 relies on ~1M unique clade-specific marker genes (the marker information file mpa_v20_m200_marker_info.txt.bz2 can be found in the Download page here) identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic), allowing:

  • unambiguous taxonomic assignments;
  • accurate estimation of organismal relative abundance;
  • species-level resolution for bacteria, archaea, eukaryotes and viruses;
  • strain identification and tracking
  • orders of magnitude speedups compared to existing methods.
  • metagenomic strain-level population genomics

If you use MetaPhlAn version 1, please cite:

Metagenomic microbial community profiling using unique clade-specific marker genes. Nicola Segata, Levi Waldron, Annalisa Ballarini, Vagheesh Narasimhan, Olivier Jousson, & Curtis Huttenhower. Nature Methods 9, 811-814 (2012)

If you use MetaPhlAn2, please cite:

MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Duy Tin Truong, Eric A Franzosa, Timothy L Tickle, Matthias Scholz, George Weingart, Edoardo Pasolli, Adrian Tett, Curtis Huttenhower & Nicola Segata. Nature Methods 12, 902-903 (2015)

If you use StrainPhlAn, please cite the MetaPhlAn2 paper and the following StrainPhlAn paper:

Microbial strain-level population structure and genetic diversity from metagenomes. Duy Tin Truong, Adrian Tett, Edoardo Pasolli, Curtis Huttenhower, & Nicola Segata. Genome Research 27:626-638 (2017)

MetaPhlAn and StrainPhlAn tutorials and resources

In addition to the information on this page, you can refer to the following additional resources.


MetaPhlAn requires python 2.7 or higher with argparse, tempfile, numpy, and Biopython libraries installed (apart for numpy and Biopython, the others are usually installed together with the python distribution). Python3 is also now supported.

MetaPhlAn requires the script to be present in the system path, if not found MetaPhlAn will try to locate it in the folder containing the script under utils/ In case you moved the script, please export the script in your PATH bash variable.

If you provide the SAM output of BowTie2 as input, there are no additional prerequisite.

  • If you would like to use the BowTie2 integrated in MetaPhlAn, you need to have BowTie2 version 2.0.0 or higher and perl installed (bowtie2 needs to be in the system path with execute and read permission)

  • If you use the "utils/" script to plot and hierarchical cluster multiple MetaPhlAn-profiled samples you will also need the following python libraries: matplotlib, scipy, pylab (if not installed together with MatPlotLib).

  • If you want to produce the output as "biom" file you also need biom installed

  • MetaPhlAn is not tightly integrated with advanced heatmap plotting with hclust2 and cladogram visualization with GraPhlAn. If you use such visualization tool please refer to their prerequisites.

Basic Usage

This section presents some basic usages of MetaPhlAn2, for more advanced usages, please see at its wiki.

We assume here that is in the system path and that mpa_dir bash variable contains the main MetaPhlAn folder. You can set this two variables moving to your MetaPhlAn2 local folder and type:

$ export PATH=`pwd`:$PATH
$ export mpa_dir=`pwd`

Here is the basic example to profile a metagenome from raw reads (requires BowTie2 in the system path with execution and read permissions, Perl installed).

$ metagenome.fastq --input_type fastq > profiled_metagenome.txt

It is highly recommended to save the intermediate BowTie2 output for re-running MetaPhlAn extremely quickly (--bowtie2out), and use multiple CPUs (--nproc) if available:

$ metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq > profiled_metagenome.txt

If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you can obtain the results in few seconds by using the previously saved --bowtie2out file and specifying the input (--input_type bowtie2out):

$ metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out > profiled_metagenome.txt

You can also provide an externally BowTie2-mapped SAM if you specify this format with --input_type. Two steps here: first map your metagenome with BowTie2 and then feed MetaPhlAn2 with the obtained sam:

$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x databases/mpa_v20_m200 -U metagenome.fastq
$ metagenome.sam --input_type sam > profiled_metagenome.txt

MetaPhlAn 2 can also natively handle paired-end metagenomes (but does not use the paired-end information), and, more generally, metagenomes stored in multiple files (but you need to specify the --bowtie2out parameter):

$ metagenome_1.fastq,metagenome_2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq > profiled_metagenome.txt

For advanced options and other analysis types (such as strain tracking) please refer to the full command-line options.

Full command-line options

usage: --input_type
                     [--mpa_pkl MPA_PKL] [--bowtie2db METAPHLAN_BOWTIE2_DB]
                     [--bt2_ps BowTie2 presets] [--bowtie2_exe BOWTIE2_EXE]
                     [--bowtie2out FILE_NAME] [--no_map] [--tmp_dir]
                     [--tax_lev TAXONOMIC_LEVEL] [--min_cu_len]
                     [--min_alignment_len] [--ignore_viruses]
                     [--ignore_eukaryotes] [--ignore_bacteria]
                     [--ignore_archaea] [--stat_q]
                     [--ignore_markers IGNORE_MARKERS] [--avoid_disqm]
                     [--stat] [-t ANALYSIS TYPE] [--nreads NUMBER_OF_READS]
                     [--pres_th PRESENCE_THRESHOLD] [--clade] [--min_ab] [-h]
                     [-o output file] [--sample_id_key name]
                     [--sample_id value] [-s sam_output_file]
                     [--biom biom_output] [--mdelim mdelim] [--nproc N] [-v]
                     [INPUT_FILE] [OUTPUT_FILE]

 MetaPhlAn version 2.1.0 (28 April 2015): 
 METAgenomic PHyLogenetic ANalysis for metagenomic taxonomic profiling.

AUTHORS: Nicola Segata (, Duy Tin Truong (


 We assume here that is in the system path and that mpa_dir bash variable contains the
 main MetaPhlAn folder. Also BowTie2 should be in the system path with execution and read
 permissions, and Perl should be installed.

========== MetaPhlAn 2 clade-abundance estimation ================= 

The basic usage of MetaPhlAn 2 consists in the identification of the clades (from phyla to species and 
strains in particular cases) present in the metagenome obtained from a microbiome sample and their 
relative abundance. This correspond to the default analysis type (--analysis_type rel_ab).

*  Profiling a metagenome from raw reads:
$ metagenome.fastq --input_type fastq

*  You can take advantage of multiple CPUs and save the intermediate BowTie2 output for re-running
   MetaPhlAn extremely quickly:
$ metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq

*  If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you
   can obtain the results in few seconds by using the previously saved --bowtie2out file and 
   specifying the input (--input_type bowtie2out):
$ metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out

*  You can also provide an externally BowTie2-mapped SAM if you specify this format with 
   --input_type. Two steps: first apply BowTie2 and then feed MetaPhlAn2 with the obtained sam:
$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x databases/mpa_v20_m200 -U metagenome.fastq
$ metagenome.sam --input_type sam > profiled_metagenome.txt

*  We can also natively handle paired-end metagenomes, and, more generally, metagenomes stored in 
  multiple files (but you need to specify the --bowtie2out parameter):
$ metagenome_1.fastq,metagenome_2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq


========== MetaPhlAn 2 strain tracking ============================ 

MetaPhlAn 2 introduces the capability of charaterizing organisms at the strain level using non
aggregated marker information. Such capability comes with several slightly different flavours and 
are a way to perform strain tracking and comparison across multiple samples.
Usually, MetaPhlAn 2 is first ran with the default --analysis_type to profile the species present in
the community, and then a strain-level profiling can be performed to zoom-in into specific species
of interest. This operation can be performed quickly as it exploits the --bowtie2out intermediate 
file saved during the execution of the default analysis type.

*  The following command will output the abundance of each marker with a RPK (reads per kil-base) 
   higher 0.0. (we are assuming that metagenome_outfmt.bz2 has been generated before as 
   shown above).
$ -t marker_ab_table metagenome_outfmt.bz2 --input_type bowtie2out > marker_abundance_table.txt
   The obtained RPK can be optionally normalized by the total number of reads in the metagenome 
   to guarantee fair comparisons of abundances across samples. The number of reads in the metagenome
   needs to be passed with the '--nreads' argument

*  The list of markers present in the sample can be obtained with '-t marker_pres_table'
$ -t marker_pres_table metagenome_outfmt.bz2 --input_type bowtie2out > marker_abundance_table.txt
   The --pres_th argument (default 1.0) set the minimum RPK value to consider a marker present

*  The list '-t clade_profiles' analysis type reports the same information of '-t marker_ab_table'
   but the markers are reported on a clade-by-clade basis.
$ -t clade_profiles metagenome_outfmt.bz2 --input_type bowtie2out > marker_abundance_table.txt

*  Finally, to obtain all markers present for a specific clade and all its subclades, the 
   '-t clade_specific_strain_tracker' should be used. For example, the following command
   is reporting the presence/absence of the markers for the B. fragulis species and its strains
$ -t clade_specific_strain_tracker --clade s__Bacteroides_fragilis metagenome_outfmt.bz2 databases/mpa_v20_m200.pkl --input_type bowtie2out > marker_abundance_table.txt
   the optional argument --min_ab specifies the minimum clade abundance for reporting the markers


positional arguments:
  INPUT_FILE            the input file can be:
                        * a fastq file containing metagenomic reads
                        * a BowTie2 produced SAM file. 
                        * an intermediary mapping file of the metagenome generated by a previous MetaPhlAn run 
                        If the input file is missing, the script assumes that the input is provided using the standard 
                        input, or named pipes.
                        IMPORTANT: the type of input needs to be specified with --input_type
  OUTPUT_FILE           the tab-separated output file of the predicted taxon relative abundances 
                        [stdout if not present]

Required arguments:
  --mpa_pkl MPA_PKL     the metadata pickled MetaPhlAn file
  --input_type {fastq,fasta,multifasta,multifastq,bowtie2out,sam}
                        set whether the input is the multifasta file of metagenomic reads or 
                        the SAM file of the mapping of the reads against the MetaPhlAn db.
                        [default 'automatic', i.e. the script will try to guess the input format]

Mapping arguments:
  --bowtie2db METAPHLAN_BOWTIE2_DB
                        The BowTie2 database file of the MetaPhlAn database. 
                        Used if --input_type is fastq, fasta, multifasta, or multifastq
  --bt2_ps BowTie2 presets
                        presets options for BowTie2 (applied only when a multifasta file is provided)
                        The choices enabled in MetaPhlAn are:
                         * sensitive
                         * very-sensitive
                         * sensitive-local
                         * very-sensitive-local
                        [default very-sensitive]
  --bowtie2_exe BOWTIE2_EXE
                        Full path and name of the BowTie2 executable. This option allows 
                        MetaPhlAn to reach the executable even when it is not in the system 
                        PATH or the system PATH is unreachable
  --bowtie2out FILE_NAME
                        The file for saving the output of BowTie2
  --no_map              Avoid storing the --bowtie2out map file
  --tmp_dir             the folder used to store temporary files 
                        [default is the OS dependent tmp dir]

Post-mapping arguments:
                        The taxonomic level for the relative abundance output:
                        'a' : all taxonomic levels
                        'k' : kingdoms (Bacteria and Archaea) only
                        'p' : phyla only
                        'c' : classes only
                        'o' : orders only
                        'f' : families only
                        'g' : genera only
                        's' : species only
                        [default 'a']
  --min_cu_len          minimum total nucleotide length for the markers in a clade for
                        estimating the abundance without considering sub-clade abundances
                        [default 2000]
  --min_alignment_len   The sam records for aligned reads with the longest subalignment
                        length smaller than this threshold will be discarded.
                        [default None]
  --ignore_viruses      Do not profile viral organisms
  --ignore_eukaryotes   Do not profile eukaryotic organisms
  --ignore_bacteria     Do not profile bacterial organisms
  --ignore_archaea      Do not profile archeal organisms
  --stat_q              Quantile value for the robust average
                        [default 0.1]
  --ignore_markers IGNORE_MARKERS
                        File containing a list of markers to ignore. 
  --avoid_disqm         Descrivate the procedure of disambiguating the quasi-markers based on the 
                        marker abundance pattern found in the sample. It is generally recommended 
                        too keep the disambiguation procedure in order to minimize false positives
  --stat                EXPERIMENTAL! Statistical approach for converting marker abundances into clade abundances
                        'avg_g'  : clade global (i.e. normalizing all markers together) average
                        'avg_l'  : average of length-normalized marker counts
                        'tavg_g' : truncated clade global average at --stat_q quantile
                        'tavg_l' : trunated average of length-normalized marker counts (at --stat_q)
                        'wavg_g' : winsorized clade global average (at --stat_q)
                        'wavg_l' : winsorized average of length-normalized marker counts (at --stat_q)
                        'med'    : median of length-normalized marker counts
                        [default tavg_g]

Additional analysis types and arguments:
  -t ANALYSIS TYPE      Type of analysis to perform: 
                         * rel_ab: profiling a metagenomes in terms of relative abundances
                         * rel_ab_w_read_stats: profiling a metagenomes in terms of relative abundances and estimate the number of reads comming from each clade.
                         * reads_map: mapping from reads to clades (only reads hitting a marker)
                         * clade_profiles: normalized marker counts for clades with at least a non-null marker
                         * marker_ab_table: normalized marker counts (only when > 0.0 and normalized by metagenome size if --nreads is specified)
                         * marker_pres_table: list of markers present in the sample (threshold at 1.0 if not differently specified with --pres_th
                        [default 'rel_ab']
  --nreads NUMBER_OF_READS
                        The total number of reads in the original metagenome. It is used only when 
                        -t marker_table is specified for normalizing the length-normalized counts 
                        with the metagenome size as well. No normalization applied if --nreads is not 
                        Threshold for calling a marker present by the -t marker_pres_table option
  --clade               The clade for clade_specific_strain_tracker analysis
  --min_ab              The minimum percentage abundace for the clade in the clade_specific_strain_tracker analysis
  -h, --help            show this help message and exit

Output arguments:
  -o output file, --output_file output file
                        The output file (if not specified as positional argument)
  --sample_id_key name  Specify the sample ID key for this analysis. Defaults to '#SampleID'.
  --sample_id value     Specify the sample ID for this analysis. Defaults to 'Metaphlan2_Analysis'.
  -s sam_output_file, --samout sam_output_file
                        The sam output file
  --biom biom_output, --biom_output_file biom_output
                        If requesting biom file output: The name of the output file in biom format 
  --mdelim mdelim, --metadata_delimiter_char mdelim
                        Delimiter for bug metadata: - defaults to pipe. e.g. the pipe in k__Bacteria|p__Proteobacteria 

Other arguments:
  --nproc N             The number of CPUs to use for parallelizing the mapping
                        [default 1, i.e. no parallelism]
  -v, --version         Prints the current MetaPhlAn version and exit

Utility Scripts

MetaPhlAn's repository features a few utility scripts to aid in manipulation of sample output and its visualization. These scripts can be found under the utils folder in the metaphlan2 directory.

Merging Tables

The script allows to combine MetaPhlAn output from several samples to be merged into one table Bugs (rows) vs Samples (columns) with the table enlisting the relative normalized abundances per sample per bug.

To merge multiple output files, run the script as below

$ python utils/ metaphlan_output1.txt metaphlan_output2.txt metaphlan_output3.txt > output/merged_abundance_table.txt

Wildcards can be used as needed:

$ python utils/ metaphlan_output*.txt > output/merged_abundance_table.txt

There is no limit to how many files you can merge.

Heatmap Visualization

The script allows to visualize the MetaPhlAn results in the form of a hierarchically-clustered heatmap. To generate the heatmap for a merged MetaPhlAn output table (as described above), please run the script as below.

$ python utils/ -c bbcry --top 25 --minv 0.1 -s log --in output/merged_abundance_table.txt --out output_images/abundance_heatmap.png

For detailed command-line instructions, please refer to below:

$ utils/ -h
usage: [-h] --in INPUT_FILE --out OUTPUT_FILE
                                   [-m {single,complete,average,weighted,centroid,median,ward}]
                                   [-d {euclidean,minkowski,cityblock,seuclidean,sqeuclidean,cosine,correlation,hamming,jaccard,chebyshev,canberra,braycurtis,mahalanobis,yule,matching,dice,kulsinski,rogerstanimoto,russellrao,sokalmichener,sokalsneath,wminkowski,ward}]
                                   [-f {euclidean,minkowski,cityblock,seuclidean,sqeuclidean,cosine,correlation,hamming,jaccard,chebyshev,canberra,braycurtis,mahalanobis,yule,matching,dice,kulsinski,rogerstanimoto,russellrao,sokalmichener,sokalsneath,wminkowski,ward}]
                                   [-s scale norm] [-x X] [-y Y] [--minv MINV]
                                   [--maxv max value]
                                   [--tax_lev TAXONOMIC_LEVEL] [--perc PERC]
                                   [--top TOP] [--sdend_h SDEND_H]
                                   [--fdend_w FDEND_W] [--cm_h CM_H]
                                   [--cm_ticks label for ticks of the colormap]
                                   [--font_size FONT_SIZE]
                                   [--clust_line_w CLUST_LINE_W]
                                   [-c {Accent,Blues,BrBG,BuGn,BuPu,Dark2,GnBu,Greens,Greys,OrRd,Oranges,PRGn,Paired,Pastel1,Pastel2,PiYG,PuBu,PuBuGn,PuOr,PuRd,Purples,RdBu,RdGy,RdPu,RdYlBu,RdYlGn,Reds,Set1,Set2,Set3,Spectral,YlGn,YlGnBu,YlOrBr,YlOrRd,afmhot,autumn,binary,bone,brg,bwr,cool,copper,flag,gist_earth,gist_gray,gist_heat,gist_ncar,gist_rainbow,gist_stern,gist_yarg,gnuplot,gnuplot2,gray,hot,hsv,jet,ocean,pink,prism,rainbow,seismic,spectral,spring,summer,terrain,winter,bbcyr,bbcry}]

This scripts generates heatmaps with hierarchical clustering of both samples
and microbial clades. The script can also subsample the number of clades to                                                                                                                                                                                                    
display based on the their nth percentile abundance value in each sample                                                                                                                                                                                                       

optional arguments:                                                                                                                                                                                                                                                            
  -h, --help            show this help message and exit                                                                                                                                                                                                                        
  --in INPUT_FILE       The input file of microbial relative abundances. This                                                                                                                                                                                                  
                        file is typically obtained with the                                                                                                                                                                                                                    
  --out OUTPUT_FILE     The output image. The extension of the file determines                                                                                                                                                                                                 
                        the image format. png, pdf, and svg are the preferred                                                                                                                                                                                                  
  -m {single,complete,average,weighted,centroid,median,ward}                                                                                                                                                                                                                   
                        The hierarchical clustering method, default is                                                                                                                                                                                                         
  -d {euclidean,minkowski,cityblock,seuclidean,sqeuclidean,cosine,correlation,hamming,jaccard,chebyshev,canberra,braycurtis,mahalanobis,yule,matching,dice,kulsinski,rogerstanimoto,russellrao,sokalmichener,sokalsneath,wminkowski,ward}                                      
                        The distance function for samples. Default is                                                                                                                                                                                                          
  -f {euclidean,minkowski,cityblock,seuclidean,sqeuclidean,cosine,correlation,hamming,jaccard,chebyshev,canberra,braycurtis,mahalanobis,yule,matching,dice,kulsinski,rogerstanimoto,russellrao,sokalmichener,sokalsneath,wminkowski,ward}                                      
                        The distance function for microbes. Default is                                                                                                                                                                                                         
  -s scale norm                                                                                                                                                                                                                                                                
  -x X                  Width of heatmap cells. Automatically set, this option                                                                                                                                                                                                 
                        should not be necessary unless for very large heatmaps                                                                                                                                                                                                 
  -y Y                  Height of heatmap cells. Automatically set, this                                                                                                                                                                                                       
                        option should not be necessary unless for very large                                                                                                                                                                                                   
  --minv MINV           Minimum value to display. Default is 0.0, values                                                                                                                                                                                                       
                        around 0.001 are also reasonable                                                                                                                                                                                                                       
  --maxv max value      Maximum value to display. Default is maximum value                                                                                                                                                                                                     
                        present, can be set e.g. to 100 to display the full
                        The taxonomic level to display: 'a' : all taxonomic
                        levels 'k' : kingdoms (Bacteria and Archaea) only 'p'
                        : phyla only 'c' : classes only 'o' : orders only 'f'
                        : families only 'g' : genera only 's' : species only
                        [default 's']
  --perc PERC           Percentile to be used for ordering the microbes in
                        order to select with --top the most abundant microbes
                        only. Default is 90
  --top TOP             Display the --top most abundant microbes only
                        (ordering based on --perc)
  --sdend_h SDEND_H     Set the height of the sample dendrogram. Default is
  --fdend_w FDEND_W     Set the width of the microbes dendrogram. Default is
  --cm_h CM_H           Set the height of the colormap. Default = 0.03
  --cm_ticks label for ticks of the colormap
  --font_size FONT_SIZE
                        Set label font sizes. Default is 7
  --clust_line_w CLUST_LINE_W
                        Set the line width for the dendrograms
  -c {Accent,Blues,BrBG,BuGn,BuPu,Dark2,GnBu,Greens,Greys,OrRd,Oranges,PRGn,Paired,Pastel1,Pastel2,PiYG,PuBu,PuBuGn,PuOr,PuRd,Purples,RdBu,RdGy,RdPu,RdYlBu,RdYlGn,Reds,Set1,Set2,Set3,Spectral,YlGn,YlGnBu,YlOrBr,YlOrRd,afmhot,autumn,binary,bone,brg,bwr,cool,copper,flag,gist_earth,gist_gray,gist_heat,gist_ncar,gist_rainbow,gist_stern,gist_yarg,gnuplot,gnuplot2,gray,hot,hsv,jet,ocean,pink,prism,rainbow,seismic,spectral,spring,summer,terrain,winter,bbcyr,bbcry}
                        Set the colormap. Default is "jet".

GraPhlAn Visualization

The tutorial of using GraPhlAn can be found from the MetaPhlAn2 wiki.

Customizing the database

In order to add a marker to the database, the user needs the following steps:

  • Reconstruct the marker sequences (in fasta format) from the MetaPhlAn2 bowtie2 database by:
bowtie2-inspect metaphlan2/databases/mpa_v20_m200 > metaphlan2/markers.fasta
  • Add the marker sequence stored in a file new_marker.fasta to the marker set:
cat new_marker.fasta >> metaphlan2/markers.fasta
  • Rebuild the bowtie2 database:
mkdir metaphlan2/db_v21/mpa_v21_m200
bowtie2-build metaphlan2/markers.fasta metaphlan2/db_v21/mpa_v21_m200
  • Assume that the new marker was extracted from genome1, genome2. Update the taxonomy file from python console as follows:
import cPickle as pickle
import bz2

db = pickle.load(bz2.BZ2File('databases/mpa_v20_m200.pkl', 'r'))

# Add the taxonomy of the new genomes
db['taxonomy']['taxonomy of genome1'] = length of genome1
db['taxonomy']['taxonomy of genome2'] = length of genome2

# Add the information of the new marker as the other markers
db['markers'][new_marker_name] = {
                                   'clade': the clade that the marker belongs to,
                                   'ext': {the name of the first external genome where the marker appears, 
                                           the name of the second external genome where the marker appears, 
                                   'len': length of the marker,
                                   'score': score of the marker,
                                   'taxon': the taxon of the marker}
# To see an example, try to print the first marker information:
# print db['markers'].items()[0]

# Save the new mpa_pkl file
ofile = bz2.BZ2File('metaphlan2/db_v21/mpa_v21_m200.pkl', 'w')
pickle.dump(db, ofile, pickle.HIGHEST_PROTOCOL)
  • To use the new database, switch to metaphlan2/db_v21 instead of metaphlan2/db_v20 when running with option "--mpa_pkl".

Metagenomic strain-level population genomics

StrainPhlAn is a computational tool for tracking individual strains across large set of samples. The input of StrainPhlAn is a set of metagenomic samples and for each species, the output is a multiple sequence alignment (MSA) file of all species strains reconstructed directly from the samples. From this MSA, StrainPhlAn calls RAxML (or other phylogenetic tree builders) to build the phylogenetic tree showing the strain evolution of the sample strains. For each sample, StrainPhlAn extracts the strain of a specific species by merging and concatenating all reads mapped against that species markers in the MetaPhlAn2 database.

In detail, let us start from a toy example with 6 HMP gut metagenomic samples (SRS055982-subjectID_638754422, SRS022137-subjectID_638754422, SRS019161-subjectID_763496533, SRS013951-subjectID_763496533, SRS014613-subjectID_763840445, SRS064276-subjectID_763840445) from 3 three subjects (each was sampled at two time points) and one Bacteroides caccae genome G000273725. We would like to:

  • extract the Bacteroides caccae strains from these samples and compare them with the reference genome in a phylogenetic tree.
  • know how many snps between those strains and the reference genome.

Running StrainPhlAn on these samples, we will obtain the Bacteroides caccae phylogentic tree and its multiple sequence alignment in the following figure (produced with ete2 and Jalview):


We can see that the strains from the same subject are grouped together. The tree also highlights that the strains from subject "763840445" (red color) do not change between two sampling time points whereas the strains from the other subjects have slightly evolved. From the tree, we also know that the strains of subject "763496533" is closer to the reference genome than those of the others. In addition, the table below shows the number of snps between the sample strains and the reference genome based on the strain alignment returned by MetaPhlAn_Strainer.


In the next sections, we will illustrate step by step how to run MetaPhlAn_Strainer on this toy example to reproduce the above figures.


StrainPhlAn requires python 2.7 and the libraries pysam (tested on version 0.8.3), biopython, msgpack and numpy, dendropy (tested on version 3.12.0). Besides, StrainPhlAn also needs the following programs in the executable path:

  • bowtie2 for mapping reads against the marker database.
  • MUSCLE for the alignment step.
  • samtools, bcftools and which can be downloaded from here for building consensus markers. Note that is included in bcftools and StrainPhlAn only works with samtools version 0.1.19 as samtools has changed the output format after this version.
  • blastn for adding reference genomes to the phylogenetic tree.
  • raxmlHPC and raxmlHPC-PTHREADS-SSE3 for building the phylogenetic trees.

All dependence binaries on Linux 64 bit can be downloaded in the folder "bin" from this link.

The script files in folder "strainphlan_src" should be changed to executable mode by:

chmod +x strainphlan_src/*.py

and add to the executable path:

export PATH=$PATH:$(pwd -P)/strainphlan_src


Let's reproduce the toy example result in the introduction section. Note that all the commands to run the below steps are in the "strainer_tutorial/step?*.sh" files (? corresponding to the step number). All the below steps are excuted under the "strainer_tutorial" folder. The steps are as follows:

Step 1. Download 6 HMP gut metagenomic samples, the metadata.txt file and one reference genome from the folder "fastqs" and "reference_genomes" in this link and put these folders under the "strainer_tutorial" folder.

Step 2. Obtain the sam files from these samples by mapping them against MetaPhlAn2 database:

This step will run MetaPhlAn2 to map all metagenomic samples against the MetaPhlAn2 marker database and produce the sam files (*.sam.bz2). Each sam file (in SAM format) corresponding to each sample contains the reads mapped against the marker database of MetaPhlAn2. The commands to run are:

mkdir -p sams
for f in $(ls fastqs/*.bz2)
    echo "Running metaphlan2 on ${f}"
    bn=$(basename ${f} | cut -d . -f 1)
    tar xjfO ${f} | ../ --bowtie2db ../databases/mpa_v20_m200 --mpa_pkl ../databases/mpa_v20_m200.pkl --input_type multifastq --nproc 10 -s sams/${bn}.sam.bz2 --bowtie2out sams/${bn}.bowtie2_out.bz2 -o sams/${bn}.profile

After this step, you will have a folder "sams" containing the sam files (*.sam.bz2) and other MetaPhlAn2 output files. This step will take around 270 minutes. If you want to skip this step, you can download the sam files from the folder "sams" in this link.

Step 3. Produce the consensus-marker files which are the input for StrainPhlAn:

For each sample, this step will reconstruct all species strains found in it and store them in a marker file (*.markers). Those strains are referred as sample-reconstructed strains. Additional details in generating consensus sequences can be found here. The commands to run are:

mkdir -p consensus_markers
cwd=$(pwd -P)
export PATH=${cwd}/../strainphlan_src:${PATH}
python ../strainphlan_src/ --ifn_samples sams/*.sam.bz2 --input_type sam --output_dir consensus_markers --nprocs 10 &> consensus_markers/log.txt

The result is the same if you want run several scripts in parallel with each run for a sample (this maybe useful for some cluster-system settings). After this step, you will have a folder "consensus_markers" containing all sample-marker files (*.markers). This steps will take around 44 minutes. If you want to skip this step, you can download the consensus marker files from the folder "consensus_markers" in this link.

Step 4. Extract the markers of Bacteroides_caccae from MetaPhlAn2 database (to add its reference genome later):

This step will extract the markers of Bacteroides_caccae in the database and then StrainPhlAn will identify the sequences in the reference genomes that are closet to them (in the next step by using blast). Those will be concatenated and referred as reference-genome-reconstructed strains. The commands to run are:

mkdir -p db_markers
bowtie2-inspect ../databases/mpa_v20_m200 > db_markers/all_markers.fasta
python ../strainphlan_src/ --mpa_pkl ../databases/mpa_v20_m200.pkl --ifn_markers db_markers/all_markers.fasta --clade s__Bacteroides_caccae --ofn_markers db_markers/s__Bacteroides_caccae.markers.fasta

Note that the "all_markers.fasta" file consists can be reused for extracting other reference genomes. After this step, you should have two files in folder "db_markers": "all_markers.fasta" containing all marker sequences, and "s__Bacteroides_caccae.markers.fasta" containing the markers of Bacteroides caccae. This step will take around 1 minute and can skipped if you do not need to add the reference genomes to the phylogenetic tree. Those markers can be found in the folder "db_markers" in this link.

Before building the trees, we should get the list of all clades detected from the samples and save them in the "output/clades.txt" file by the following command:

python ../ --mpa_pkl ../databases/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --output_dir output --nprocs_main 10 --print_clades_only > output/clades.txt

The clade names in the output file "clades.txt" will be used for the next step.

Step 5. Build the multiple sequence alignment and phylogenetic tree:

This step will align and clean the sample-reconstructed strains (stored in the marker files produced in step 3) and reference-genome-reconstructed strains (extracted based on the database markers in step 4) to produce a multiple sequence alignment (MSA) and store it in the file "clade_name.fasta". From this MSA file, StrainPhlAn will call RAxML to build the phylogenetic tree. Note that: all marker files (*.markers) must be used together as the input for the script because StrainPhlAn needs to align all of the strains at once.

The commands to run are:

mkdir -p output
python ../ --mpa_pkl ../databases/mpa_v20_m200.pkl --ifn_samples consensus_markers/*.markers --ifn_markers db_markers/s__Bacteroides_caccae.markers.fasta --ifn_ref_genomes reference_genomes/G000273725.fna.bz2 --output_dir output --nprocs_main 10 --clades s__Bacteroides_caccae &> output/log_full.txt

This step will take around 2 minutes. After this step, you will find the tree "output/RAxML_bestTree.s__Bacteroides_caccae.tree". All the output files can be found in the folder "output" in this link. You can view it by Archaeopteryx or any other viewers.

By default, if you do not specify reference genomes (by --ifn_ref_genomes) and any specific clade (by --clades), will build the phylogenetic trees for all species that it can detect.

In order to add the metadata, we also provide a script called "" which can be used as follows:

python ../strainphlan_src/ --ifn_trees output/RAxML_bestTree.s__Bacteroides_caccae.tree --ifn_metadatas fastqs/metadata.txt --metadatas subjectID

The script "" can accept multiple metadata files (space separated, wild card can also be used) and multiple trees. A metadata file is a tab separated file where the first row is the meta-headers, and the following rows contain the metadata for each sample. Multiple metadata files are used in the case where your samples come from more than one dataset and you do not want to merge the metadata files. For more details of using "", please see its help (with option "-h"). An example of a metadata file is the "fastqs/metadata.txt" file with the below content:

sampleID        subjectID
SRS055982       638754422
SRS022137       638754422
SRS019161       763496533
SRS013951       763496533
SRS014613       763840445
SRS064276       763840445
G000273725  ReferenceGenomes

Note that "sampleID" is a compulsory field.

After adding the metadata, you will obtain the tree files "*.tree.metadata" with metadata and view them by Archaeopteryx as in the previous step.

If you have installed graphlan, you can plot the tree with the command:

python ../strainphlan_src/ --ifn_tree output/RAxML_bestTree.s__Bacteroides_caccae.tree.metadata --colorized_metadata subjectID

and obtain the following figure (output/RAxML_bestTree.s__Bacteroides_caccae.tree.metadata.png):


Step 6. If you want to remove the samples with high-probability of containing multiple strains, you can rebuild the tree by removing the multiple strains:

python ../strainphlan_src/ --ifn_alignments output/s__Bacteroides_caccae.fasta --nprocs 10 --log_ofn output/build_tree_single_strain.log
python ../strainphlan_src/ --ifn_trees output/RAxML_bestTree.s__Bacteroides_caccae.remove_multiple_strains.tree --ifn_metadatas fastqs/metadata.txt --metadatas subjectID

You will obtain the refined tree "output/RAxML_bestTree.s__Bacteroides_caccae.remove_multiple_strains.tree.metadata". This tree can be found in the folder "output" in this link.

Some useful options

All option details can be viewed by help:

python ../ -h

The default setting can be stringent for some cases where you have very few samples left in the phylogenetic tree. You can relax some parameters to add more samples back:

  1. marker_in_clade: In each sample, the clades with the percentage of present markers less than this threshold are removed. Default "0.8". You can set this parameter to "0.5" to add some more samples.
  2. sample_in_marker: If the percentage of samples that a marker present in is less than this threhold, that marker is removed. Default "0.8". You can set this parameter to "0.5" to add some more samples.
  3. N_in_marker: The consensus markers with the percentage of N nucleotides greater than this threshold are removed. Default "0.2". You can set this parameter to "0.5" to add some more samples.
  4. gap_in_sample: The samples with full sequences concatenated from all markers and having the percentage of gaps greater than this threshold will be removed. Default 0.2. You can set this parameter to "0.5" to add some more samples.
  5. relaxed_parameters: use this option to automatically set the above parameters to add some more samples by accepting some more gaps, Ns, etc. This option is equivalent to set: marker_in_clade=0.5, sample_in_marker=0.5, N_in_marker=0.5, gap_in_sample=0.5. Default "False".
  6. relaxed_parameters2: use this option to add more samples by accepting some noise. This is equivalent to set marker_in_clade=0.2, sample_in_marker=0.2, N_in_marker=0.8, gap_in_sample=0.8. Default "False".

Some other useful output files

In the output folder, you can find the following files:

  1. clade_name.fasta: the alignment file of all metagenomic strains.
  2. *.marker_pos: this file shows the starting position of each marker in the strains.
  3. *.info: this file shows the general information like the total length of the concatenated markers (full sequence length), number of used markers, etc.
  4. *.polymorphic: this file shows the statistics on the polymorphic site, where "sample" is the sample name, "percentage_of_polymorphic_sites" is the percentage of sites that are suspected to be polymorphic, "avg_freq" is the average frequency of the dominant alleles on all polymorphic sites, "avg_coverage" is the average coverage at all polymorphic sites.