1. bioBakery
  2. Untitled Project
  3. biobakery


Clone wiki

biobakery / strainphlan

StrainPhlAn tutorial

StrainPhlAn is a tool for strain-level resolution of species across large sample sets, based on single nucleotide polymorphisms (SNPs) within conserved and unique species marker genes. The first step in the StrainPhlAn workflow is to run MetaPhlAn2.


The following figure shows the workflow of StrainPhlAn.




NOTE if you are running StrainPhlAn in bioBakery (either locally in the VM or in Google Cloud) than you do not need to install StrainPhlAn because StrainPhlAn and its dependencies are already preinstalled. You can then proceed to the How to run section.


You can install StrainPhlAn and other bioBakery tools automatically with HomeBrew for MacOS or LinuxBrew for Linux platforms.

$ brew tap biobakery/biobakery
$ brew install strainphlan

This will also install all StrainPhlAn dependencies.

From Source

Alternatively, you can manually install StrainPhlAn from source and then manually install the dependencies.

Step 1: Download StrainPhlAn and unpack the software:

$ tar xzvf biobakery-strainphlan-<versionid>.tar.gz
$ cd biobakery-strainphlan-<versionid>/

Step 2: Add all of the scripts to your PATH:

  1. Add this command export PATH=$MDIR:$MDIR/utils/:$MDIR/strainphlan_src/:$PATH to your $HOME/.bashrc file
    1. Replace $MDIR with the full path to the StrainPhlAn folder
    2. To get the value of $MDIR run $ pwd from inside the StrainPhlAn folder
  2. Source your $HOME/.bashrc file.
    1. $ source $HOME/.bashrc
    2. This only needs to be done once to update your PATH in your current session.
    3. All future sessions will automatically source this file when you login.

Alternatively, instead of adding the scripts to your PATH, you can provide the full path to the script each time you run. For example, the following command prints the MetaPhlAn2 help screen (replace $MDIR with the full path to the MetaPhlAn2 folder):

$ $MDIR/metaphlan2.py --help

Step 3: Install the StrainPhlAn dependencies.

Install bowtie2 in a folder in your PATH or specify the path to the bowtie2 executable with each run of StrainPhlAn using the option --bowtie2_exe <bowtie2>.

How to run

Follow these steps to learn how to process a set of 6 samples with StrainPhlAn. For the purpose of this tutorial, we will use the following set of 6 input files that have been subsampled for rapid analysis. These 6 files are stool microbiome shotgun sequence files from 3 individuals, each sampled 6 months apart. Click on the links to download each file. NOTE that if you are using the Google Chrome browser, it will automatically decompress (gunzip) the gzip-ed files.

StrainPhlAn can take as input .fasta, .fastq, .fasta.gz, and .fastq.gz files. Download the set of 6 demo fastq.gz files to get started on the tutorial.

Click on each individual file to download. These files will download to your Downloads folder. Once you have downloaded all six input files, make a new folder in your working directory and move the downloaded files there by typing the following commands at the terminal prompt.

  • Make a new folder in your current working directory:

    $ mkdir strainphlan_analysis
  • Move the six input files from the Downloads folder in to the strainphlan_analysis folder:

    $ mv ~/Downloads/135*.fasta.gz ~/Downloads/404*.fasta.gz ~/Downloads/192*.fasta.gz strainphlan_analysis/
  • Now change directories into the strainphlan_analysis folder and list the directory contents to see the six files:

    $ cd strainphlan_analysis
    $ ls

  • If you've run the PanPhlAn tutorial, you'll recognize these file names - but these are FASTQ sequence files rather than MetaPhlAn2 bowtie2 marker intermediate files. Why do we need to start with raw sequences for StrainPhlAn?
  • Why is it more informative to perform strain tracking on a collection of related metagenomes, rather than a single metagenome, or metagenomes from extremely different environments?

There are four steps to run the subsequent StrainPhlAn analysis.

Step 1: Run MetaPhlAn2

The first step is to run MetaPhlAn2 to obtain the sam output files. The sam files contain the alignment information from mapping the reads of each sample against the MetaPhlAn2 marker database. It takes about a minute to run each command.

$ metaphlan2.py 13530241_SF05.fasta.gz 13530241_SF05_profile.txt --bowtie2out 13530241_SF05_bowtie2.txt --samout 13530241_SF05.sam.bz2 --input_type multifasta
$ metaphlan2.py 13530241_SF06.fasta.gz 13530241_SF06_profile.txt --bowtie2out 13530241_SF06_bowtie2.txt --samout 13530241_SF06.sam.bz2 --input_type multifasta
$ metaphlan2.py 19272639_SF05.fasta.gz 19272639_SF05_profile.txt --bowtie2out 19272639_SF05_bowtie2.txt --samout 19272639_SF05.sam.bz2 --input_type multifasta
$ metaphlan2.py 19272639_SF06.fasta.gz 19272639_SF06_profile.txt --bowtie2out 19272639_SF06_bowtie2.txt --samout 19272639_SF06.sam.bz2 --input_type multifasta
$ metaphlan2.py 40476924_SF05.fasta.gz 40476924_SF05_profile.txt --bowtie2out 40476924_SF05_bowtie2.txt --samout 40476924_SF05.sam.bz2 --input_type multifasta
$ metaphlan2.py 40476924_SF06.fasta.gz 40476924_SF06_profile.txt --bowtie2out 40476924_SF06_bowtie2.txt --samout 40476924_SF06.sam.bz2 --input_type multifasta

If you would like to run with multiple cores, add the option --nproc.

Alternatively, if you're familiar with bash shell syntax, you can loop over the entire sample set:

$ for file in *.fasta.gz
> do
>    metaphlan2.py $file ${file%.fasta.gz}_profile.txt --bowtie2out ${file%.fasta.gz}_bowtie2.txt --samout ${file%.fasta.gz}.sam.bz2 --input_type multifasta
> done

The sam output files generated from the commands above, can be downloaded from the following links:

  • Why can't this step use the "simple" bowtie2 intermediate files that MetaPhlAn2 saves by default?

Step 2: Run sample2markers

Providing the sam output files, from the prior step, to the sample2markers script will generate a marker file for each sample. The marker files contain the consensus of unique marker genes for each species found in the sample, which will be used for SNP profiling. Run the commands below to generate the marker files in your current working directory.

$ sample2markers.py --ifn_samples 13530241_SF05.sam.bz2 --input_type sam --output_dir .
$ sample2markers.py --ifn_samples 13530241_SF06.sam.bz2 --input_type sam --output_dir .
$ sample2markers.py --ifn_samples 19272639_SF05.sam.bz2 --input_type sam --output_dir .
$ sample2markers.py --ifn_samples 19272639_SF06.sam.bz2 --input_type sam --output_dir .
$ sample2markers.py --ifn_samples 40476924_SF05.sam.bz2 --input_type sam --output_dir .
$ sample2markers.py --ifn_samples 40476924_SF06.sam.bz2 --input_type sam --output_dir .

NOTE that after successfully executing this command there will be a message printed out that looks something like the output below. Running this command takes about 5 minutes per file so instead of running the remaining five you can download the remaining marker files from the links below.

$ sample2markers.py --ifn_samples 13530241_SF05.sam.bz2 --input_type sam --output_dir .
ooSubprocess: dump_file.py --input_file 13530241_SF05.sam.bz2 | samtools view -bS - | samtools sort -o - ./13530241_SF05.sam.bz2.bam.sorted | samtools mpileup -u - | bcftools view -c -g -p 1.1 - | fix_AF1.py --input_file - | vcfutils.pl vcf2fq

If you would like to run with multiple cores, add the option --nproc.

The marker output files generated from the commands above, can be downloaded from the following links:

  • Try looking at the .markers files using less. What happens, and why? What information is contained in these files, and why not store it as plain text?

Step 3: Identify clades detected in the samples and build reference databases

Run StrainPhlAn to identify the clades that were detected in all samples, providing the marker files generated in the prior step, to see which clades can be SNP-profiled.

$ strainphlan.py --ifn_samples *.markers --output_dir . --print_clades_only > clades.txt

If you would like to run with multiple cores, add the option --nprocs_main.

The clades.txt file contains on a single line with the species name s__Eubacterium_siraeum. In a real sample, many organisms might be listed, but these demonstration files were constructed by subsampling reads from this species across several metagenomes. When profiling full sequence files (which have not been subsampled), expect to see many more species listed.

Next a reference database will be built for the species. NOTE this only needs to be done once for each species for all projects. Since the input file required for the build is large (all_markers.fasta is 736M), for the purpose of the tutorial, feel free to skip this step and download the output files.

$ extract_markers.py --mpa_pkl mpa_v20_m200.pkl --ifn_markers all_markers.fasta --clade s__Eubacterium_siraeum --ofn_markers s__Eubacterium_siraeum.markers.fasta

The marker reference file generated from the command above, can be downloaded from the following link:

  • Why are the input files required for extract_markers.py unavoidably large?
  • Explain the contents of the s__Eubacterium_siraeum.markers.fasta file.

Step 4: Generate trees from alignments

Run StrainPhlAn to generate alignments and then trees, providing the sample marker files and the clade reference marker file generated in prior steps. Also provide a genome sequence for E. siraeum from the NCBI RefSeq database, for comparative purposes, which can be downloaded from the following link:

  • GCF_000154325.fna.bz2:

    $ strainphlan.py --ifn_samples *.markers --ifn_markers s__Eubacterium_siraeum.markers.fasta --ifn_ref_genomes GCF_000154325.fna.bz2 --output_dir . --clades s__Eubacterium_siraeum --marker_in_clade 0.2

The command above reduces the marker_in_clade option default because the input files are small sub-sampled demos. If you would like to run with multiple cores, add the option --nprocs_main.

The tree generated from the command above, can be downloaded from the following link:

The alignment file generated from the command above, can be downloaded from the following link:

  • Explain the format and contents of the phylogenetic tree output file.
  • Explain the format and contents of the multiple alignment output file.

Visualize results

The sections that follow cover some of the common methods to visualize the results of a StrainPhlAn anlaysis. We will use the ETE Toolkit for quick visualization online. We will also visualize StrainPhlAn trees with the GraPhlAn and the ggtree R package. NOTE that GraPhlAn and ggtree packages are already preinstalled in bioBakery so you don't need to install them.

A list of tree editors for exploring phylogenetic trees can be found here.

Multiple sequence alignment files can be explored with the R package msa and interactively with Jalview.

Visualization with ETE

The tree and multiple marker sequence alignment file from the previous tutorial steps can be input together for an online visualization using the ETE Toolkit.

In your browser, go to the ETE Toolkit Phylogenetic tree viewer. Upload the tree and merged marker alignment file from the previous tutorial steps. Select Condensed format from the Alignment image type drop down menu. Click on View Tree!.

The tree generated will appear as follows:


Visualization with GraPhlAn

Step 1: Install GraPhlAn with Homebrew:

$ brew tap biobakery/biobakery
$ brew install graphlan

This will install GraPhlAn and all of its dependencies.

Alternatively, you can manually install GraPhlAn from source by downloading GraPhlAn and then installing the GraPhlAn dependencies (numpy, pandas, biopython, scipy, and matplotlib). NOTE that if you are running this tutorial in the bioBakery VM then you do not need to install GraPhlAn because GraPhlAn and its dependencies are already preinstalled.

Step 2: Visualize the results

First add the metadata to your tree. Download the metadata.txt file for the steps that follow.

$ add_metadata_tree.py --ifn_trees RAxML_bestTree.s__Eubacterium_siraeum.tree --ifn_metadata metadata.txt

Next provide the tree with metadata to create a dendrogram.

$ plot_tree_graphlan.py --ifn_tree RAxML_bestTree.s__Eubacterium_siraeum.tree.metadata  --colorized_metadata SubjectID --leaf_marker_size 60 --legend_marker_size 60

The resulting image is shown below:


Visualization with ggtree

Follow these steps to create dendrograms and an ordination plot using StrainPhlAn output files, ggtree, distmat, and scripts from Breadcrumbs.

Step 1: Install Breadcrumbs with Homebrew.

$ brew tap biobakery/biobakery
$ brew install breadcrumbs

This will install breadcrumbs and all of its dependencies.

Alternatively, you can manually install Breadcrumbs from source and then install the dependencies. NOTE that if you are running this tutorial in the bioBakery VM then you do not need to install Breadcrumbs because Breadcrumbs and its dependencies are already preinstalled.

Step 2: Generate dendrograms

Run the following command, to generate two dendrograms, providing the output files from StrainPhlAn along with a metadata file. Download the metadata.txt file for the steps that follow.

$ strainphlan_ggtree.R RAxML_bestTree.s__Eubacterium_siraeum.tree metadata.txt s__Eubacterium_siraeum.fasta strainphlan_tree_1.png strainphlan_tree_2.png

The dendrograms created will look like the following images:


Here we are decorating the dendrogram with subject ID metadata. Notice that the two samples from subject ID13530241 have identical strains, without any SNPs between them. While this can be a true observation of phylogenetic relatedness, in this case it is an artifact of subsampling. The input files for this tutorial were subsampled to approximately 2,000 reads per file. The scale bar at the bottom of the dendrogram shows the units of branch length; i.e. the number of nucleotide substitutions per site.


The second dendrogram is also decorated with a slice of the multiple sequence alignment of unique gene marker sequences, where colors in the alignment reveal SNPs.

Step 3: Generate an ordination plot.

Next, we can use the multiple sequence alignment file to generate a phylogenetic distance matrix that contains the pairwise nucleotide substitution rate between strains. We will use the distmat function from the EMBOSS package. NOTE that if you are running the analysis in bioBakery then you do not need to install distmat because EMBOSS is already installed. Otherwise you can generate a distance matrix using the online implementation of distmat. Once we have the distance matrix, we will use it to perform PCoA (Principal Coordinate Analysis).

Enter this command from your terminal:

$ distmat

Following this command provide the MSA input file.

  • Create a distance matrix from a multiple sequence alignment
  • Input aligned sequence set: s__Eubacterium_siraeum.fasta

When prompted for the correction method, enter 2 (Kimura 2-parameter).

When prompted for output file, press the Enter key to accept the default name [s__eubacterium_siraeum.distmat].

If you open the distmat output file in a text editor, you should see something like this:


You can download the distance matrix file from this link: s__eubacterium_siraeum.distmat

The values in the distance matrix represent the nucleotide substitution rate, i.e. the number of substitutions per 100 nucleotides of sequence between a given pair of samples.

  • How does the Kimura 2-parameter distance differ from counting the number (or percentage) of non-identical sites in an alignment of two nucleotide sequences?
  • Given that 97% nucleotide identity is a common threshold for defining that two genomes derive from the same species, are the distances that you're measuring here reasonable for strains of the *same* species?

Now run the following command to generate an ordination plot from the distance matrix:

$ strainphlan_ordination.R s__eubacterium_siraeum.distmat metadata.txt strainphlan_ordination.png

The plot created will look like that below:


Notice again that, as in the dendrograms above, the two samples from subject ID13530241 have identical strains, without any SNPs between them and are hence overlaid one on top of the other in the ordination.

  • What do you observe about the relative similarity of strains in this visualizations?
  • In particular, how do strains from the same individual compare to those from other individuals?
  • How does the reference strain compare to the metagenomic strains?
  • How might you assign a measure of statistical confidence to your answers above?
  • Which visualization method do you prefer? What are the strengths and limitations of the different visualization methods?