1. bioBakery
  2. Untitled Project
  3. biobakery


Clone wiki

biobakery / humann2

HUMAnN2 Tutorial

Legacy HUMAnN1 tutorial

HUMAnN2 (the HMP Unified Metabolic Analysis Network) is a method for efficiently and accurately determining the presence, absence, and abundance of metabolic pathways in a microbial community from metagenomic or metatranscriptomic sequencing data. It is appropriate for any type of microbial community shotgun sequence profiling (i.e. not just the human microbiome; the name is a historical product of its origin in a Human Microbiome Project working group).

HUMAnN2 can be installed from PyPI, as source code from the HUMAnN2 Bitbucket repository, or using our quickstart documentation.

Support is available from the HUMAnN2 Google Group. Feel free to join the group or to post any questions directly by emailing humann-users@googlegroups.com.

If you have already installed HUMAnN2 and would like to upgrade your installed version to the latest version, please see the installation update section of the HUMAnN2 User Manual.

1. Setup

1.1 Requirements

NOTE: Bowtie2, Diamond, and MinPath are automatically installed when installing HUMAnN2.

1.2 Installation

Please note, if you are using bioBakery (Vagrant VM or Google Cloud) you do not need to install HUMAnN2 because the tool and its dependencies are already installed.

1.2.1 Install HUMAnN2

The easiest way to install HUMAnN2 is with pip or Brew (HomeBrew for MacOS or LinuxBrew for Linux platforms). For other options, please refer to the HUMAnN2 User Manual.

To install with pip:

$ pip install humann2

To install with brew (which will also install MetaPhlAn2 and utility dependencies):

$ brew tap biobakery/biobakery
$ brew install humann2

Note that by default, a pip or other HUMAnN2 install will NOT include the full nucleotide or amino acid search databases needed for raw metagenome or metatranscriptome profiling, as these are quite large and should be installed based on project needs. A typical environment, however, will require one of each, and we default to ChocoPhlAn for the former and UniRef90 for the latter (note that for the purposes of this tutorial alone, you do not need to run these commands, which are provided as examples):

$ humann2_databases --download chocophlan full humann2_database_downloads
$ humann2_databases --download uniref uniref90_ec_filtered_diamond humann2_database_downloads

The commands above download the databases to a folder named humann2_database_downloads. Please note you can download these databases to any location on your machine. HUMAnN2 will track where you have placed the databases and update its configuration file.

After installation from pip or another source, you may optionally test your local HUMAnN2 environment:

$ humann2_test

Which yields (abbreviated):

test_add_length_annotation (basic_tests_utilities.TestHumann2UtilitiesFunctions) ... ok
test_add_length_annotation_spaces (basic_tests_utilities.TestHumann2UtilitiesFunctions) ... ok
test_break_up_fasta_file (basic_tests_utilities.TestHumann2UtilitiesFunctions) ... ok
test_count_reads_fasta (basic_tests_utilities.TestHumann2UtilitiesFunctions) ... ok

1.2.2 Install MetaPhlAn2

HUMAnN2 provides organism-specific community functional profiles, and to do so it first detects the organisms present in a community using MetaPhlAn2. As MetaPhlAn2 is thus a prerequisite for running the HUMAnN2 software, please ensure that you've installed MetaPhlAn2 before running HUMAnN2.

2. Metagenome functional profiling

A note on viewing files from the command line: During this tutorial you'll inspect many tabular output files. For simplicity the tutorial uses less -S file.tsv for this task. To match the more Excel-like columns seen in the tutorial document, you would first use the column command, as in:

column -t -s $'\t' file.tsv | less -S

2.1 HUMAnN2 input formats

Each HUMAnN2 run can process either a single community sample's raw reads, or (in more advanced usage) pre-computed gene data from multiple samples. Input files can thus be of the following types:

  • Quality controlled shotgun metagenomes or metatranscriptomes (fastq, fastq.gz, fasta, or fasta.gz)
  • Mapping results (sam, bam or blastm8)
  • Gene abundance tables (tsv or biom)

This tutorial will use demo inputs distributed with the HUMAnN2 installation under the examples directory. If you installed from pip or Homebrew and don't have this handy, you can obtain a copy by right-clicking these links and selecting "save link as":

2.2 Running HUMAnN2: the basics

The most basic HUMAnN2 execution uses nucleotide mapping and translated search to provide organism-specific gene and pathway abundance profiles from a single metagenome. By default, gene families are annotated using UniRef90 identifiers and pathways using MetaCyc IDs. From the examples folder (or the location where you saved demo.fastq above) execute the following:

$ humann2 --input demo.fastq --output demo_fastq --gap-fill on

Which yields (blank lines removed):

Creating output directory: demo_fastq
Output files will be written to: demo_fastq
Running metaphlan2.py ........
Found g__Bacteroides.s__Bacteroides_dorei : 59.18% of mapped reads
Found g__Bacteroides.s__Bacteroides_vulgatus : 40.82% of mapped reads
Total species selected from prescreen: 2
Selected species explain 100.00% of predicted community composition
Creating custom ChocoPhlAn database ........
Running bowtie2-build ........
Running bowtie2 ........
Total bugs from nucleotide alignment: 2
g__Bacteroides.s__Bacteroides_vulgatus: 7416 hits
g__Bacteroides.s__Bacteroides_dorei: 8908 hits
Total gene families from nucleotide alignment: 3722
Unaligned reads after nucleotide alignment: 22.2666666667 %
Running diamond ........
Aligning to reference database: uniref90_demo_prots.dmnd
Total bugs after translated alignment: 3
g__Bacteroides.s__Bacteroides_vulgatus: 7416 hits
unclassified: 327 hits
g__Bacteroides.s__Bacteroides_dorei: 8908 hits
Total gene families after translated alignment: 3744
Unaligned reads after translated alignment: 18.2666666667 %
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created:

Note that the flag --gap-fill is set to on for this demo. Gap-filling provides robustness to missing (undersampled) reactions when quantifying metabolic pathways, which occurs frequently in the demo datasets due to their low sequencing depth.

  • Gap filling allows us to quantify a pathway even when a small number of its reactions are conspicuously absent. What might be some pros/cons of this approach in real data?

2.3 HUMAnN2 default output

When HUMAnN2 is run from any input type, three main output files will be created:


For the demo.fastq file analyzed above, these look like:


Let's dig into the contents of the individual files:


This file lists the abundances of each gene family in the community in RPK (Reads Per Kilobase) units. Each gene family's total abundance is broken down into the contributions from individual organisms when possible (delineated by the pipe character, e.g. UniRef90_R6HHA8|g__Bacteroides.s__Bacteroides_dorei), or unclassified when mappings are only found using translated search (which is not organism-specific). We refer to this format as "stratified output" and the individual per-organism and unclassified contributions are referred to as "strata" (singular "stratum"). The RPKs of reads that map to gene families that lack UniRef90 IDs are captured as UniRef90_unknown, and unmapped read abundances as UNMAPPED. The total RPK of each community is thus captured by summing either 1) UNMAPPED, _unknown, and all UniRef IDs prior to organism-specific decomposition, or 2) UNMAPPED, _unknown|<organism>, and all UniRef|<organism> IDs. The sum of either set of values can then be normalized to CPMs (Copies Per Million) or relative abundances. To inspect the demo_genefamilies.tsv file, use:

$ less -S demo_fastq/demo_genefamilies.tsv

Which yields:

# Gene Family                                            demo_Abundance-RPKs
UNMAPPED                                                 3836.0000000000
UniRef90_unknown                                         1641.2181881965
UniRef90_unknown|g__Bacteroides.s__Bacteroides_dorei     898.7224083120
UniRef90_unknown|g__Bacteroides.s__Bacteroides_vulgatus  742.4957798844
UniRef90_R6HHA8                                          333.3333333333
UniRef90_R6HHA8|g__Bacteroides.s__Bacteroides_dorei      333.3333333333
UniRef90_G1UMF5                                          66.6666666667
UniRef90_G1UMF5|g__Bacteroides.s__Bacteroides_vulgatus   66.6666666667
UniRef90_R7P3A5                                          55.5555555556

Let's find some of the "unclassified" stratifications in this file using the grep command:

$ grep 'unclassified' demo_fastq/demo_genefamilies.tsv | less -S

Which yields:

UniRef90_U5FT06|unclassified  16.7693829744
UniRef90_Q9ZUH4|unclassified  13.7815543238
UniRef90_Z9K4C5|unclassified  12.7438684783
UniRef90_Q8GT21|unclassified  12.7217525466
UniRef90_W8YTG4|unclassified  12.6390305338
UniRef90_W1IYL7|unclassified  12.1580837819
UniRef90_U3KP22|unclassified  11.6979338377
UniRef90_W7DVV5|unclassified  11.5904796756
UniRef90_U3KI10|unclassified  11.5589897710
UniRef90_Z5XKU5|unclassified  11.3736016532

  • Out of the box, HUMAnN2 can quantify protein families at 90 or 50% identity (via UniRef90 and UniRef50, respectively). What are the advantages of quantifying UniRef90 gene families? UniRef50 families?


This file lists the abundances of each pathway in the community, also in RPK units as described for gene families. For details on the calculation of pathway abundances from constituent gene family abundances, see the HUMAnN2 User Manual. Like gene families, pathways are broken down into per-organism contributions when possible (again delineated by the pipe character), with the total read abundance consisting of pathways assigned to organisms, UNMAPPED reads, and UNINTEGRATED reads that are attributable to specific gene families (typically also within specific organisms), but for gene families that are not annotated to any pathway in the corresponding database (e.g. MetaCyc). As with gene families, RPKs can be normalized to copies per million or relative abundances by summing either 1) UNMAPPED reads, UNINTEGRATED organism-specific features, and organism-specific (i.e. pipe-containing) pathways, _or_ 2) UNMAPPED reads, total UNINTEGRATED features, and pathways that are not broken down per organism. To inspect the demo_pathabundance.tsv file, use:

$ less -S demo_fastq/demo_pathabundance.tsv

Which yields:

# Pathway                                                                                          demo_Abundance
UNMAPPED                                                                                           1144.4708467763
UNINTEGRATED                                                                                       5587.1252634558
UNINTEGRATED|unclassified                                                                          32.6650985274
UNINTEGRATED|g__Bacteroides.s__Bacteroides_vulgatus                                                2889.8374763450
UNINTEGRATED|g__Bacteroides.s__Bacteroides_dorei                                                   3137.3888860073
PWY-6305: putrescine biosynthesis IV                                                               12.3568523173
PWY-6305: putrescine biosynthesis IV|unclassified                                                  12.3568523173
PWY490-3: nitrate reduction VI (assimilatory)                                                      10.9963501548
PWY490-3: nitrate reduction VI (assimilatory)|unclassified                                         10.9963501548
PWY-1269: CMP-3-deoxy-D-manno-octulosonate biosynthesis I                                          10.6471569863
PWY-1269: CMP-3-deoxy-D-manno-octulosonate biosynthesis I|unclassified                             8.9740635174
COA-PWY-1: coenzyme A biosynthesis II (mammalian)                                                  6.7591572814
COA-PWY-1: coenzyme A biosynthesis II (mammalian)|g__Bacteroides.s__Bacteroides_vulgatus           5.7967729695
PWY-6700: queuosine biosynthesis                                                                   5.5677097144
PWY-6700: queuosine biosynthesis|g__Bacteroides.s__Bacteroides_dorei                               3.3937482013
PWY-7111: pyruvate fermentation to isobutanol (engineered)                                         3.3883947480
PWY-7111: pyruvate fermentation to isobutanol (engineered)|g__Bacteroides.s__Bacteroides_vulgatus  1.5541525012
VALSYN-PWY: L-valine biosynthesis                                                                  3.3883947480
VALSYN-PWY: L-valine biosynthesis|g__Bacteroides.s__Bacteroides_vulgatus                           1.5541525012

  • Unlike gene family abundance, stratified pathway abundance does not necessarily sum to yield the community total pathway abundance. Why not? (Hint: HUMAnN2 estimates the number of *complete* pathways in the community or a given stratification.)


This file lists the coverage of each pathway in the community, that is, the probability that it is provided as a unit to the metagenome or metatranscriptome either 1) within one or more particular organisms (again pipe-delimited) or 2) across one or more organisms in aggregate. For details on the calculation of pathway coverage from constituent gene family abundances, see the HUMAnN2 User Manual. Pathways' coverages should be considered probabilistic presence/absence calls for each pathway and NOT summed, and only cataloged pathways are individually analyzed this way. To inspect the demo_pathcoverage.tsv file, use:

$ less -S demo_fastq/demo_pathcoverage.tsv

Which yields:

# Pathway                                                                                          demo_Coverage
UNMAPPED                                                                                           1.0000000000
UNINTEGRATED                                                                                       1.0000000000
UNINTEGRATED|unclassified                                                                          1.0000000000
UNINTEGRATED|g__Bacteroides.s__Bacteroides_vulgatus                                                1.0000000000
UNINTEGRATED|g__Bacteroides.s__Bacteroides_dorei                                                   1.0000000000
PWY-6305: putrescine biosynthesis IV                                                               0.9552224767
PWY-6305: putrescine biosynthesis IV|unclassified                                                  0.7062665072
PWY490-3: nitrate reduction VI (assimilatory)                                                      0.9600881535
PWY490-3: nitrate reduction VI (assimilatory)|unclassified                                         0.7119750457
PWY-1269: CMP-3-deoxy-D-manno-octulosonate biosynthesis I                                          0.9435320772
PWY-1269: CMP-3-deoxy-D-manno-octulosonate biosynthesis I|unclassified                             0.5216465635
COA-PWY-1: coenzyme A biosynthesis II (mammalian)                                                  0.7994939780
COA-PWY-1: coenzyme A biosynthesis II (mammalian)|g__Bacteroides.s__Bacteroides_vulgatus           0.9113035262
PWY-6700: queuosine biosynthesis                                                                   0.7058918254
PWY-6700: queuosine biosynthesis|g__Bacteroides.s__Bacteroides_dorei                               0.7173738671
PWY-7111: pyruvate fermentation to isobutanol (engineered)                                         0.4364431285
PWY-7111: pyruvate fermentation to isobutanol (engineered)|g__Bacteroides.s__Bacteroides_vulgatus  0.4056642097
VALSYN-PWY: L-valine biosynthesis                                                                  0.4364431285
VALSYN-PWY: L-valine biosynthesis|g__Bacteroides.s__Bacteroides_vulgatus                           0.4056642097

2.4 Running HUMAnN2: pre-mapped SAM/BAM input

The raw results of mapping metagenome/metatranscriptome reads against a gene family nucleotide database can be saved as a SAM/BAM file and used to run HUMAnN2 after the mapping (e.g. bowtie2) step. The only requirement is that gene family names provided in such a file be annotated using either 1) identifiers compatible with HUMAnN2's default mapping to UniRef, or 2) UniRef identifiers directly. Unmapped reads in such an input file are searched against a non-organism-specific protein database as are unmapped reads in a "full" HUMAnN2 run.

$ less -S demo.sam

which yields:

s__Bacteroides_dorei_read000001  16  gi|423242213|ref|NZ_JH724160.1|:185833-186759|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_R7NX76|UniRef50_R5GDB8|927     423   11  100M  *  0  0  TTTCGTATTAGTGGAGTATGGCATTGAAAGTACGGATGATGACACGTTGCGCAGGATAAACCGGGGACATACTTTTGCCGTTTCTGCAGAGGCTGTTCGG  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-12  XS:i:-30  XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2              MD:Z:3T38A57       YT:Z:UU
s__Bacteroides_dorei_read000002  0   gi|423245752|ref|NZ_JH724135.1|:381614-382081|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5K0|UniRef50_A6L5K0|468     88    25  100M  *  0  0  TTTACTGTAGTCGCCCTATTGGGATTACAGGATTCAGACTCTAACATAACCATCGGAAACAACTTCGAGAGCAAAAAGCTAGGTAAAAAAGGAATTTTCA  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-6   XS:i:-42  XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1              MD:Z:96A3          YT:Z:UU
s__Bacteroides_dorei_read000003  16  gi|423238261|ref|NZ_JH724153.1|:187959-189296|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5D4|UniRef50_C6X591|1338    1159  0   100M  *  0  0  AATACACGACAGGAAGATTATTCCGGCAAGAATTTTTCGGAGTTTGAACTGAGCGCTATGGAAAAACGTCATATTGCTAAAGTTCTGCAACATACTAAAG  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-24  XS:i:-36  XN:i:0  XM:i:4  XO:i:0  XG:i:0  NM:i:4              MD:Z:5C28A16G0C47  YT:Z:UU
s__Bacteroides_dorei_read000004  16  gi|294777006|ref|NZ_ADKO01000074.1|:c27580-26633|821|g__Bacteroides.s__Bacteroides_vulgatus|UniRef90_A6L7A5|UniRef50_R7NQK6|948  802   1   100M  *  0  0  CAACACTCACGGGAAATCCCCTTCAATGAATTATGGGACAGAATGCTAAGAATCAGAAAAAACAATTATATACATCTTCCACTTTACCTAAAAATAAAGT  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-12  XS:i:-12  XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2              MD:Z:85C0T13       YT:Z:UU
s__Bacteroides_dorei_read000006  16  gi|423231083|ref|NZ_JH724113.1|:c774054-771490|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_unknown|UniRef50_D5EYI2|2565  184   7   100M  *  0  0  AATCCCGTTGCTTTATTGGAACGTTTGAGCTATGAAAAGCAGGAGGCATTAACCCAGGATAAGGTTATTGTAAAGCGCCTGAACGATGTATATGCTAAAT  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-18  XS:i:-30  XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3              MD:Z:40T24G12A21   YT:Z:UU
s__Bacteroides_dorei_read000007  16  gi|423231083|ref|NZ_JH724113.1|:c74511-72820|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_R7P3C8|UniRef50_X5DCD9|1692     629   1   100M  *  0  0  TGGCAGGAGAAGCCTATCTTCGGATGGGTATGCGTGATGCTTCCTAATTCAAGAAGGCGGAAGATGCGGTGACGCCTATTATACCAGGTGGCAAATATGA  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-18  XS:i:-18  XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3              MD:Z:46T36A7T8     YT:Z:UU
s__Bacteroides_dorei_read000009  0   gi|423227774|ref|NZ_JH724110.1|:782275-785145|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_unknown|UniRef50_R7DSH7|2871   804   2   100M  *  0  0  AGGTACAGCCGGCTTTGCCTCCACCGAGAAAAAACCAGAAAAACCCTTTGCACAATGGATTCCCCGTATTCCCGAAACAGGAAAATATGCCGTCGATGTG  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-24  XS:i:-30  XN:i:0  XM:i:4  XO:i:0  XG:i:0  NM:i:4              MD:Z:10C25G6G50T5  YT:Z:UU
s__Bacteroides_dorei_read000013  16  gi|423227774|ref|NZ_JH724110.1|:c157707-156019|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_R6AN66|UniRef50_P9WQK2|1689   1394  12  100M  *  0  0  TGGCACTGAAGGAAGAAGGCAACGTGCTGCTGCTGGACGAGCCTACCAATGATATTGACGTGAACTCGCTGCGCGCGCTGGAGGAAGGTCTGGACGATTT  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-6   XS:i:-18  XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1              MD:Z:65A34         YT:Z:UU

Here, we can also demonstrate speeding up a HUMAnN2 run by taking advantage of multiple cores using the --threads argument:

$ humann2 --input demo.sam --output demo_sam --threads 4 --gap-fill on

which yields (blank lines removed):

Creating output directory: demo_sam
Output files will be written to: demo_sam
Process the sam mapping results ...
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created:

Note the creation of the three expected output files.

2.5 Running HUMAnN2: pre-searched protein BLASTX M8 input

HUMAnN2 typically searches metagenome/metatranscriptome reads against two databases: first a nucleotide database using rapid mapping, then a broader protein database using more computationally expensive translated search. One can execute HUMAnN2 using either search alone, however; using nucleotides only is more computationally efficient but less sensitive, and using translated search alone can emulate HUMAnN1 output or be carried out even for very novel or unusual microbial communities (i.e. those with few reference genomes available). If BLASTX tab-delimited M8 format search results have already been generated for HUMAnN2-compatible protein IDs (UniRef90 by default), they can be run by providing the appropriate input format. Let's examine such a file:

$ less -S demo.m8

Which yields:

s__Bacteroides_vulgatus_read000635|100  UniRef90_X8JRH1|978   57.9  19  8   0  95   39  189  207  2.0e-01  21.6
unclassified_read000001|100             UniRef90_W8YTG4|1614  90.6  32  3   0  98   3   382  413  9.0e-15  65.9
unclassified_read000002|100             UniRef90_U5FT06|1224  90.9  33  3   0  100  2   127  159  2.1e-14  64.7
unclassified_read000003|100             UniRef90_V5V2L3|2169  90.6  32  3   0  3    98  388  419  4.5e-14  63.5
unclassified_read000004|100             UniRef90_U5FT06|1224  90.6  32  3   0  99   4   148  179  4.6e-14  63.5
unclassified_read000005|100             UniRef90_X8FHU5|1080  90.6  32  3   0  98   3   240  271  3.1e-15  67.4
unclassified_read000006|100             UniRef90_Z9JXD8|1428  90.6  32  3   0  98   3   22   53   2.6e-14  64.3
unclassified_read000007|100             UniRef90_Q6XMI3|1137  90.6  32  3   0  3    98  155  186  2.0e-14  64.7
unclassified_read000007|100             UniRef90_Q9SPV4|1077  65.6  32  11  0  3    98  129  160  1.5e-09  48.5
unclassified_read000008|100             UniRef90_Z9JUT2|1308  90.6  32  3   0  98   3   222  253  2.4e-15  67.8

To run HUMAnN2 using this file as input, use:

$ humann2 --input demo.m8 --output demo_m8 --gap-fill on

This again produces the three expected outputs (blank lines removed):

Creating output directory: /demo_m8
Output files will be written to: /demo_m8
Process the blastm8 mapping results ...
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created:

2.6 Running HUMAnN2: nucleotide search alone

Conversely, to run HUMAnN2 without any translated protein search (i.e. using only rapid nucleotide mapping alone), one can provide the --bypass-translated-search flag. This applies to FASTA, FASTQ, SAM/BAM, or other raw nucleotide inputs:

$ humann2 --input demo.fastq --output demo_fastq_noaa --threads 4 --bypass-translated-search --gap-fill on

Note the lack of DIAMOND execution in this example as compared to the FASTQ analysis above:

Creating output directory: demo_fastq_noaa
Output files will be written to: demo_fastq_noaa
Running metaphlan2.py ........
Found g__Bacteroides.s__Bacteroides_dorei : 59.18% of mapped reads
Found g__Bacteroides.s__Bacteroides_vulgatus : 40.82% of mapped reads
Total species selected from prescreen: 2
Selected species explain 100.00% of predicted community composition
Creating custom ChocoPhlAn database ........
Running bowtie2-build ........
Running bowtie2 ........
Total bugs from nucleotide alignment: 2
g__Bacteroides.s__Bacteroides_vulgatus: 7429 hits
g__Bacteroides.s__Bacteroides_dorei: 8916 hits
Total gene families from nucleotide alignment: 3898
Unaligned reads after nucleotide alignment: 22.1666666667 %
Bypass translated search
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created:

It is informative to compare the results of this demo file with and without translated search; even in a small, subsampled input, including amino acid search is more computationally costly but recovers additional gene families. To count unique gene families in the original file, execute:

$ cut -f1 demo_fastq/demo_genefamilies.tsv | grep -v "|" | sort -u | wc -l

(cut -f1 extracts the first column of the file, the gene headers; grep -v "|" removes the stratified headers; sort -u isolates the unique headers; and wc -l counts those headers.) The resulting answering is 3746 (unique gene families). The same command executed on the file demo_fastq_noaa.demo_genefamilies.tsv yields 3900 unique gene families. We infer that 3900 - 3746 = 154 additional gene families were uncovered using translated search.

In general, any HUMAnN2 intermediate file can be saved and used to resume processing from an intermediate point in the analysis pipeline, and several steps including taxonomic pre-screening, nucleotide search, or amino acid search can be independently skipped. For more information, see the relevant sections of the HUMAnN2 User Manual.

  • Why is nucleotide-level search generally faster than translated search? What makes nucleotide-level search extra fast in HUMAnN2?
  • Under what circumstances would skipping translated search be reasonable? Under what circumstances would skipping translated search be a terrible idea?

3. Manipulating HUMAnN2 output tables

3.1 Attaching names to features

To reduce file size in HUMAnN2 tables containing large numbers of features, feature names are not included by default. To attach feature names to a table containing only feature identifiers, use the humann2_rename_table script:

$ humann2_rename_table --input demo_fastq/demo_genefamilies.tsv --output demo_fastq/demo_genefamilies-names.tsv --names uniref90

Features without names will be named "NO_NAME". Inspect the named features in the output file demo_fastq/demo_genefamilies-names.tsv using less:

# Gene Family                                                                                                   demo_Abundance-RPKs
UniRef90_C3R7K2: Polysaccharide deacetylase                                                                     45.7650273224
UniRef90_C3R7K2: Polysaccharide deacetylase|g__Bacteroides.s__Bacteroides_dorei                                 45.7650273224
UniRef90_A6KZR7: Fumarate reductase iron-sulfur cluster protein subunit                                         16.7427701674
UniRef90_A6KZR7: Fumarate reductase iron-sulfur cluster protein subunit|g__Bacteroides.s__Bacteroides_dorei     9.1324200913
UniRef90_A6KZR7: Fumarate reductase iron-sulfur cluster protein subunit|g__Bacteroides.s__Bacteroides_vulgatus  7.6103500761
UniRef90_A6L5B2: Ferredoxin domain containing protein                                                           16.2037037037
UniRef90_A6L5B2: Ferredoxin domain containing protein|g__Bacteroides.s__Bacteroides_vulgatus                    9.2592592593
UniRef90_A6L5B2: Ferredoxin domain containing protein|g__Bacteroides.s__Bacteroides_dorei                       6.9444444444
UniRef90_R6HM78: Hydrogenase maturation GTPase HydF                                                             15.4937654938

3.2 Normalizing RPKs to relative abundance

To facilitate comparisons between samples with different sequencing depths, it is beneficial to normalize gene family and pathway abundance RPK values to compositional units (copies per million or relative abundance). Pathway coverage does not need to be normalized in this manner. We can normalize a HUMAnN2 abundance file using the utility script humann2_renorm_table:

$ humann2_renorm_table --input demo_fastq/demo_genefamilies.tsv --output demo_fastq/demo_genefamilies-cpm.tsv --units cpm --update-snames

Inspect the output file demo_fastq/demo_genefamilies-cpm.tsv using the less command as we did above:

# Gene Family                                            demo_Abundance-CPM
UNMAPPED                                                 168420
UniRef90_unknown                                         72057.9
UniRef90_unknown|g__Bacteroides.s__Bacteroides_dorei     39458.5
UniRef90_unknown|g__Bacteroides.s__Bacteroides_vulgatus  32599.4
UniRef90_R6HHA8                                          14635
UniRef90_R6HHA8|g__Bacteroides.s__Bacteroides_dorei      14635
UniRef90_G1UMF5                                          2927.01
UniRef90_G1UMF5|g__Bacteroides.s__Bacteroides_vulgatus   2927.01
UniRef90_R7P3A5                                          2439.17

Note the difference in scale: units tend to be larger than in the original file as a result of being scaled to sum to 1 million. Also note that the sample header has been updated from demo_Abundance-RPK to demo_Abundance-CPM courtesy of the --update-snames flag. Let's verify that the unstratified abundance values sum up to 1 million:

$ grep -v "#" demo_fastq/demo_genefamilies-cpm.tsv | grep -v "|" | cut -f2 | python -c "import sys; print sum(float(l) for l in sys.stdin)"

The command above uses grep to exclude the header line (marked by #) and the stratified lines (containing |). The cut command then pulls out the second column (CPM values). The python -c command runs a one-line python program to sum up the CPM values. This returns 999999.98559 (i.e. ~1 million modulo a small amount of rounding error).

  • HUMAnN2 natively normalizes gene family abundance by gene length. Why is this step important? (In other words, why are RPK units more useful than raw read counts?)
  • Under what circumstances would it be useful to consider gene/pathway abundance before normalizing by sample read depth?

3.3 Regrouping genes to other functional categories

HUMAnN2's default "units" of microbial function are gene families and metabolic pathways. However, starting from gene family abundance, it is possible to reconstruct the abundance of other functional categories in a microbiome using the humann2_regroup_table script. Inspect the regrouping options using the humann2_regroup_table -h command.

Let's regroup our CPM-normalized gene family abundance values to enzyme commission (EC) categories:

$ humann2_regroup_table --input demo_fastq/demo_genefamilies-cpm.tsv --output demo_fastq/level4ec-cpm.tsv --groups uniref90_level4ec

We are given the output message:

Original Feature Count: 3745; Grouped 1+ times: 265 (%7.1); Grouped 2+ times: 3 (%0.1)

Indicating that the original file contained 3,745 features (UniRef90 gene families). Of these, 265 were annotated to a level-4 EC, and 3 were annotated to more than one level-4 EC. Inspect the output file demo_fastq/level4ec-cpm.tsv using the less command:

# Gene Family                                     demo_Abundance-RPKs
UNGROUPED                                         777314.062
UNGROUPED|g__Bacteroides.s__Bacteroides_dorei     444893.564
UNGROUPED|g__Bacteroides.s__Bacteroides_vulgatus  330270.225
UNGROUPED|unclassified                            2150.283
UNMAPPED                                          168420.0                                         54.004|g__Bacteroides.s__Bacteroides_dorei     54.004                                         127.538|g__Bacteroides.s__Bacteroides_vulgatus  127.538

We see that UniRef90 features have been regrouped as level-4 EC features, such as A new stratified feature also appears in the output: UNGROUPED. This feature reports the total abundance of features that were not otherwise included in a group (similar to the UNINTEGRATED feature of the pathways output). If each feature was regrouped into at most one new feature, the total community abundance of features in the regrouped file (including UNGROUPED and UNMAPPED) would still be 1 million. The actual sum (1,001,914.9) is slightly higher than 1 million due to the three UniRef90 features whose abundance was counted in two different level-4 EC features. Using the humann2_rename_table script, attach names to the unnamed level-4 EC features. 2-dehydropantoate 2-reductase                                      54.004 2-dehydropantoate 2-reductase|g__Bacteroides.s__Bacteroides_dorei  54.004 IMP dehydrogenase                                                  127.538 IMP dehydrogenase|g__Bacteroides.s__Bacteroides_vulgatus           127.53

  • From a statistical testing perspective, what is one major advantage of regrouping raw gene abundance to a smaller list of gene groups?
  • When regrouping genes to ECs (or KOs or MetaCyc reactions), we sum over the gene sequences belonging to a particular group. Under what circumstance would it make sense to average over the members of a group rather than sum? (Hint: the answer is related to the difference between the "is_a" and "part_of" relationships used in the Gene Ontology.)

4. Advanced HUMAnN2 topics

4.1 HUMAnN2 intermediate files

It is in some cases useful to understand the collection of intermediate files generated by the individual steps of a typical HUMAnN2 run:

$ ls -lh demo_fastq/demo_humann2_temp/

Which yields:

-rw-rw-r-- 1   11M Aug  2 11:22 demo_bowtie2_aligned.sam
-rw-rw-r-- 1  2.9M Aug  2 11:22 demo_bowtie2_aligned.tsv
-rw-rw-r-- 1   11M Aug  2 11:22 demo_bowtie2_index.1.bt2
-rw-rw-r-- 1  3.6M Aug  2 11:22 demo_bowtie2_index.2.bt2
-rw-rw-r-- 1  136K Aug  2 11:22 demo_bowtie2_index.3.bt2
-rw-rw-r-- 1  3.6M Aug  2 11:22 demo_bowtie2_index.4.bt2
-rw-rw-r-- 1   11M Aug  2 11:22 demo_bowtie2_index.rev.1.bt2
-rw-rw-r-- 1  3.6M Aug  2 11:22 demo_bowtie2_index.rev.2.bt2
-rw-rw-r-- 1  628K Aug  2 11:22 demo_bowtie2_unaligned.fa
-rw-rw-r-- 1   17M Aug  2 11:22 demo_custom_chocophlan_database.ffn
-rw-rw-r-- 1   86K Aug  2 11:22 demo_diamond_aligned.tsv
-rw-rw-r-- 1  521K Aug  2 11:22 demo_diamond_unaligned.fa
-rw-rw-r-- 1   18K Aug  2 11:23 demo.log
-rw-rw-r-- 1   55K Aug  2 11:21 demo_metaphlan_bowtie2.txt
-rw-rw-r-- 1   958 Aug  2 11:22 demo_metaphlan_bugs_list.tsv

These include SAM files containing the alignment of input reads against the default nucleotide database, which are first reduced by counting up and tabulating the hits to each annotated gene family. It can be useful to reprocess either the reads unaligned after nucleotide mapping (bowtie2_unaligned.fa) or, more likely, the reads still unaligned after amino acid search. Finally, the MetaPhlAn2 bowtie2.txt intermediate output is itself saved for e.g. PanPhlAn strain profiling. For more information, see the HUMAnN2 User Manual's output file section.

4.2 HUMAnN2 for multiple samples

It is typical to execute HUMAnN2 on multiple individual samples, then merge the resulting gene families or pathways into a single tab-delimited multi-sample table. When combining or comparing any HUMAnN2 output, it is critical to normalize the default RPK abundance units (which will be sensitive to the sequencing depth of each sample) into RPKMs (which will not).

We can demonstrate this by downloading a collection of six subsampled input files from the Human Microbiome Project. The original files, and many others, can be downloaded from the HMP DACC.

Click on each individual subsampled file to download. Open a terminal and change directories into the folder where you have downloaded the files.

You will see something like the following contents if you run $ ls -lh:

-rw-r--r-- 1  2.7M Jun  4 11:33 763577454-SRS014459-Stool.fasta
-rw-r--r-- 1  2.6M Jun  4 11:33 763577454-SRS014464-Anterior_nares.fasta
-rw-r--r-- 1  2.7M Jun  4 11:33 763577454-SRS014470-Tongue_dorsum.fasta
-rw-r--r-- 1  2.7M Jun  4 11:34 763577454-SRS014472-Buccal_mucosa.fasta
-rw-r--r-- 1  2.7M Jun  4 11:34 763577454-SRS014476-Supragingival_plaque.fasta
-rw-r--r-- 1  2.7M Jun  4 11:34 763577454-SRS014494-Posterior_fornix.fasta

Process each input file into a single output directory:

$ humann2 -i 763577454-SRS014459-Stool.fasta -o 763577454
$ humann2 -i 763577454-SRS014464-Anterior_nares.fasta -o 763577454
$ humann2 -i 763577454-SRS014470-Tongue_dorsum.fasta -o 763577454
$ humann2 -i 763577454-SRS014472-Buccal_mucosa.fasta -o 763577454
$ humann2 -i 763577454-SRS014476-Supragingival_plaque.fasta -o 763577454
$ humann2 -i 763577454-SRS014494-Posterior_fornix.fasta -o 763577454

Alternatively, if you're comfortable with shell scripting, you can use the following command to process all six at once:

$ for f in *.fasta; do humann2 -i $f -o 763577454; done

If you're feeling particularly impatient, add a & at the end (i.e. 763577454 &), but use that form with caution.

Regardless of how you run HUMAnN2, this will result in six output directories each containing the three main analysis types (gene families, pathway abundances, and pathway coverages) along with intermediate files. Let's merge the 6 per-sample gene family abundance outputs into a single table for this dataset using the script humann2_join_tables:

$ humann2_join_tables -i 763577454 -o 763577454_genefamilies.tsv --file_name genefamilies

Inspect this file using the less command:

# Gene Family                                                         763577454-SRS014459-Stool_Abundance-RPKs  763577454-SRS014464-Anterior_nares_Abundance-RPKs  763577454-SRS014470-Tongue_dorsum_Abundance-RPKs  763577454-SRS014472-Buccal_mucosa_Abundance-RPKs  763577454-SRS014476-Supragingival_plaque_Abundance-RPKs  763577454-SRS014494-Posterior_fornix_Abundance-RPKs
UNMAPPED                                                              15984.0000000000                          19128.0000000000                                   19982.0000000000                                  19994.0000000000                                  19985.0000000000                                         19999.0000000000
UniRef90_A0A015N407                                                   2.3310023310                              0                                                  0                                                 0                                                 0                                                        0
UniRef90_A0A015N407|g__Bacteroides.s__Bacteroides_cellulosilyticus    2.3310023310                              0                                                  0                                                 0                                                 0                                                        0
UniRef90_A0A015N416                                                   5.2910052910                              0                                                  0                                                 0                                                 0                                                        0
UniRef90_A0A015N416|g__Bacteroides.s__Bacteroides_cellulosilyticus    5.2910052910                              0                                                  0                                                 0                                                 0                                                        0
UniRef90_A0A015P3I5                                                   3.7453183521                              0                                                  0                                                 0                                                 0                                                        0
UniRef90_A0A015P3I5|g__Parabacteroides.s__Parabacteroides_distasonis  3.7453183521                              0                                                  0                                                 0                                                 0                                                        0
UniRef90_A0A015VBF4                                                   0.2816901408                              0                                                  0                                                 0                                                 0                                                        0
UniRef90_A0A015VBF4|g__Bacteroides.s__Bacteroides_stercoris           0.2816901408                              0                                                  0                                                 0                                                 0                                                        0

It is especially important to renormalize this merged file as we're now comparing samples with different sequencing depths:

$ humann2_renorm_table -i 763577454_genefamilies.tsv -o 763577454_genefamilies_cpm.tsv --units cpm

For more information, see the HUMAnN2 User Manual's description of a standard analysis workflow.

5. Downstream analyses

5.1 Associating HUMAnN2 functions with metadata

There are many approaches for associating microbial measurements with sample metadata, including several options in the bioBakery:

When associating metadata with HUMAnN2 features, it is often beneficial to associate with community totals and avoid testing each individual feature stratification (to improve statistical power). This is the approach used by the HUMAnN2 utility script humann2_associate, which can compare feature totals across samples with 1) a single continuous metadatum (via the Spearman Correlation) or 2) a single categorical metadatum (via the Kruskal-Wallis H-test). Notably, this is a naive approach to association, but it is useful for tutorial purposes.

For this and the next section, we'll operate on a table of pathway abundances from ~400 samples from the Human Microbiome Project:

Download this file to your working directory and examine it using less:

$ less -S hmp_pathabund.pcl

Which yields:

FEATURE \ SAMPLE                                                                                      SRS011084    SRS011086      SRS011090
STSite                                                                                                Stool        Tongue_dorsum  Buccal_mucosa
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis                                                  0.000498359  0.000628096    0.000304951
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Acidovorax.s__Acidovorax_ebreus               0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Bacteroides.s__Bacteroides_finegoldii         0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Haemophilus.s__Haemophilus_influenzae         0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Klebsiella.s__Klebsiella_pneumoniae           0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Staphylococcus.s__Staphylococcus_epidermidis  0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Veillonella.s__Veillonella_sp_oral_taxon_158  0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|unclassified                                     7.89502e-06  1.60101e-05    0

We can see three samples from three different body sites (via the STSite metadatum). A single pathway, 1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis, and its stratifications are visible. In these three samples, the pathway can only be assigned to the unclassified stratum, and with very low abundance.

Let's associate pathway totals in this file with sample body site:

$ humann2_associate --input hmp_pathabund.pcl --last-metadatum STSite --focal-metadatum STSite --focal-type categorical --output stats.txt

This call will perform a Kruskal-Wallis H-test between each feature's community total and sample body site, summarizing by per-site means and ranking by statistical significance. Let's examine the first few lines of the file:

$ less -S stats.txt

Which yields:

# kruskal-wallis analysis of metadatum: STSite
# feature                                                                        level means                                                                                                                                         p-value    q-value
METSYN-PWY: L-homoserine and L-methionine biosynthesis                           Supragingival_plaque:0.0004077|Stool:9.059e-05|Posterior_fornix:1.275e-05|Tongue_dorsum:0.0005423|Anterior_nares:2.585e-05|Buccal_mucosa:0.00112    6.477e-70  1.224e-67
PWY-5347: superpathway of L-methionine biosynthesis (transsulfuration)           Supragingival_plaque:0.0003854|Stool:9.416e-05|Posterior_fornix:9.412e-06|Tongue_dorsum:0.0005187|Anterior_nares:1.71e-05|Buccal_mucosa:0.0007963   7.074e-70  1.224e-67
PWY-6386: UDP-N-acetylmuramoyl-pentapeptide biosynthesis II (lysine-containing)  Supragingival_plaque:0.0007095|Stool:5.553e-06|Posterior_fornix:1.768e-06|Tongue_dorsum:0.0001989|Anterior_nares:0.0002186|Buccal_mucosa:0.001337   1.115e-69  1.286e-67
HOMOSER-METSYN-PWY: L-methionine biosynthesis I                                  Supragingival_plaque:0.0002616|Stool:5.078e-05|Posterior_fornix:7.601e-06|Tongue_dorsum:0.0003655|Anterior_nares:1.856e-05|Buccal_mucosa:0.0009729  3.124e-69  2.702e-67
NAD-BIOSYNTHESIS-II: NAD salvage pathway II                                      Supragingival_plaque:0.0001445|Stool:3.478e-06|Posterior_fornix:2.321e-07|Tongue_dorsum:0.0001515|Anterior_nares:6.176e-06|Buccal_mucosa:0.0005249  7.574e-68  5.241e-66
MET-SAM-PWY: superpathway of S-adenosyl-L-methionine biosynthesis                Supragingival_plaque:0.000311|Stool:7.31e-05|Posterior_fornix:8.425e-06|Tongue_dorsum:0.0004838|Anterior_nares:5.945e-06|Buccal_mucosa:0.0005264    6.56e-67   3.783e-65
PWY-5188: tetrapyrrole biosynthesis I (from glutamate)                           Supragingival_plaque:0.0007416|Stool:1.759e-05|Posterior_fornix:3.483e-06|Tongue_dorsum:0.0006484|Anterior_nares:0.0002229|Buccal_mucosa:0.0003094  1.098e-66  5.426e-65
PWY-3001: superpathway of L-isoleucine biosynthesis I                            Supragingival_plaque:0.0007748|Stool:0.0002933|Posterior_fornix:1.806e-05|Tongue_dorsum:0.000893|Anterior_nares:5.512e-05|Buccal_mucosa:0.001016    2.768e-66  1.197e-64

We observe, for example, that the first pathway in the table (METSYN-PWY: L-homoserine and L-methionine biosynthesis) tends to be more abundant on average at the three oral sites (buccal mucosa, tongue, and plaque) than at the three non-oral sites (stool, posterior fornix, and nares).

Now that we have identified this enrichment, use the grep command to explore the stratifications of this pathway from the hmp_pathabund.pcl file.

$ grep 'METSYN-PWY' hmp_pathabund.pcl | less -S

  • Do the major contributing species seem reasonable for an orally-enriched pathway?

5.2 Visualizing stratified HUMAnN2 output

HUMAnN2 includes a utility humann2_barplot to assist with visualizing stratified outputs. Let's use this tool to make a plot of the significantly differentially abundant pathway we identified above. (Use see plot1.png to open the resulting figure if working in a graphical environment.)

$ humann2_barplot --input hmp_pathabund.pcl --focal-feature METSYN-PWY --focal-metadatum STSite --last-metadatum STSite --output plot1.png


Not terribly informative... Let's try adding a "sort" on the summed stratified abundance:

$ humann2_barplot --sort sum --input hmp_pathabund.pcl --focal-feature METSYN-PWY --focal-metadatum STSite --last-metadatum STSite --output plot2.png


A pattern has started to emerge: we can clearly see that the oral body sites are enriched on the left (high) end of the plot, consistent with the summary statistics computed above. We can continue this line of analysis with an additional grouping by body site:

$ humann2_barplot --sort sum metadata --input hmp_pathabund.pcl --focal-feature METSYN-PWY --focal-metadatum STSite --last-metadatum STSite --output plot3.png


We see now that the pathway is most strongly enriched in the buccal mucosa, where it is contributed almost exclusively by Streptococcus species. The pathway's relative abundance in the other two body sites is generally lower, and is more often "unclassified" at the tongue body site.

  • How might you find other pathways with a similar abundance distribution in this functional profile?
  • What might be causing the "unclassified" contributors of homoserine/methionine synthesis in tongue dorsum communities?
  • Is it surprising that three apparently closely related communities (tongue, cheek, and tooth surface) differ so strikingly in metagenomic functional potential?

Let's examine one additional pathway, COA-PWY: coenzyme A biosynthesis, which is more broadly conserved across body sites:

$ humann2_barplot --sort sum --input hmp_pathabund.pcl --focal-feature COA-PWY --focal-metadatum STSite --last-metadatum STSite --output plot4.png


Now sorting by total abundance is less informative. Let's try some additional options: sorting by ecological similarity, normalizing pathway contributions within-sample, and expanding the list of species highlighted:

$ humann2_barplot --sort similarity --top-strata 12 --scaling normalize --input hmp_pathabund.pcl --focal-feature COA-PWY --focal-metadatum STSite --last-metadatum STSite --output plot5.png


Notably, sorting by similarity of pathway contributions has tended to group samples from the same body site even without the explicit metadata sort. This is because, for broadly distributed pathways, organismal contributions tend to follow overall sample ecology (which tends to be quite different from one body site to the next).

  • Is it surprising to see coenzyme A biosynthesis present in diverse human body site microbiomes?