Clone wiki

biobakery / humann2

HUMAnN2 Tutorial

Legacy HUMAnN1 tutorial

HUMAnN2 (the HMP Unified Metabolic Analysis Network) is a method for efficiently and accurately determining the presence, absence, and abundance of metabolic pathways in a microbial community from metagenomic or metatranscriptomic sequencing data. It is appropriate for any type of microbial community shotgun sequence profiling (i.e. not just the human microbiome; the name is a historical product of its origin in a Human Microbiome Project working group).

HUMAnN2 can be installed from PyPI, as source code from the HUMAnN2 Bitbucket repository, or using our quickstart documentation.

Support is available from the HUMAnN2 Google Group. Feel free to join the group or to post any questions directly by emailing

If you have already installed HUMAnN2 and would like to upgrade your installed version to the latest version, please see the installation update section of the HUMAnN2 User Manual.

1. Setup

1.1 Requirements

NOTE: Bowtie2, Diamond, and MinPath are automatically installed when installing HUMAnN2. If these dependencies do not appear to be installed after installing HUMAnN2 with pip, it might be that your environment is setup to use wheels instead of installing from source. HUMAnN2 must be installed from source for it to also be able to install dependencies. To force pip to install HUMAnN2 from source add one of the following options to your install command, "--no-use-wheel" or "--no-binary :all:".

1.2 Installation

Please note, if you are using bioBakery (Vagrant VM or Google Cloud) you do not need to install HUMAnN2 because the tool and its dependencies are already installed.

1.2.1 Install HUMAnN2

The easiest way to install HUMAnN2 is with pip, Conda, or Docker. For other options, please refer to the HUMAnN2 User Manual.

To install with pip:

$ pip install humann2

To install with conda (which will also install MetaPhlAn2 and utility dependencies):

$ conda install -c biobakery humann2
* Before installing with conda, make sure to configure your channels so biobakery is at the top of the list.

Note that by default, a pip or other HUMAnN2 install will NOT include the full nucleotide or amino acid search databases needed for raw metagenome or metatranscriptome profiling, as these are quite large and should be installed based on project needs. A typical environment, however, will require one of each, and we default to ChocoPhlAn for the former and UniRef90 for the latter (note that for the purposes of this tutorial alone, you will download demo databases to reduce run time):

$ humann2_databases --download chocophlan DEMO humann2_database_downloads
$ humann2_databases --download uniref DEMO_diamond humann2_database_downloads

The commands above download the databases to a folder named humann2_database_downloads. Please note you can download these databases to any location on your machine. HUMAnN2 will track where you have placed the databases and update its configuration file.

After installation from pip or another source, you may optionally test your local HUMAnN2 environment:

$ humann2_test

Which yields (abbreviated):

test_add_length_annotation (basic_tests_utilities.TestHumann2UtilitiesFunctions) ... ok
test_add_length_annotation_spaces (basic_tests_utilities.TestHumann2UtilitiesFunctions) ... ok
test_break_up_fasta_file (basic_tests_utilities.TestHumann2UtilitiesFunctions) ... ok
test_count_reads_fasta (basic_tests_utilities.TestHumann2UtilitiesFunctions) ... ok

1.2.2 Install MetaPhlAn2

HUMAnN2 provides organism-specific community functional profiles, and to do so it first detects the organisms present in a community using MetaPhlAn2. As MetaPhlAn2 is thus a prerequisite for running the HUMAnN2 software, please ensure that you've installed MetaPhlAn2 before running HUMAnN2.

2. Metagenome functional profiling

A note on viewing files from the command line: During this tutorial you'll inspect many tabular output files. For simplicity the tutorial uses less -S file.tsv for this task. To match the more Excel-like columns seen in the tutorial document, you would first use the column command, as in:

column -t -s $'\t' file.tsv | less -S

2.1 HUMAnN2 input formats

Each HUMAnN2 run can process either a single community sample's raw reads, or (in more advanced usage) pre-computed gene data from multiple samples. Input files can thus be of the following types:

  • Quality controlled shotgun metagenomes or metatranscriptomes (fastq, fastq.gz, fasta, or fasta.gz)
  • Mapping results (sam, bam or blastm8)
  • Gene abundance tables (tsv or biom)

This tutorial will use demo inputs distributed with the HUMAnN2 installation under the examples directory. If you installed from pip or Conda and don't have this handy, you can obtain a copy by right-clicking these links and selecting "save link as":

2.2 Running HUMAnN2: the basics

The most basic HUMAnN2 execution uses nucleotide mapping and translated search to provide organism-specific gene and pathway abundance profiles from a single metagenome. By default, gene families are annotated using UniRef90 identifiers and pathways using MetaCyc IDs. From the examples folder (or the location where you saved demo.fastq above) execute the following:

$ humann2 --input demo.fastq --output demo_fastq

Which yields (blank lines removed):

Creating output directory: demo_fastq
Output files will be written to: demo_fastq
Running ........
Found g__Bacteroides.s__Bacteroides_dorei : 59.18% of mapped reads
Found g__Bacteroides.s__Bacteroides_vulgatus : 40.82% of mapped reads
Total species selected from prescreen: 2
Selected species explain 100.00% of predicted community composition
Creating custom ChocoPhlAn database ........
Running bowtie2-build ........
Running bowtie2 ........
Total bugs from nucleotide alignment: 2
g__Bacteroides.s__Bacteroides_vulgatus: 7336 hits
g__Bacteroides.s__Bacteroides_dorei: 8820 hits
Total gene families from nucleotide alignment: 3685
Unaligned reads after nucleotide alignment: 23.0666666667 %
Running diamond ........
Aligning to reference database: uniref90_demo_prots.dmnd
Total bugs after translated alignment: 3
g__Bacteroides.s__Bacteroides_vulgatus: 7336 hits
unclassified: 974 hits
g__Bacteroides.s__Bacteroides_dorei: 8820 hits
Total gene families after translated alignment: 3728
Unaligned reads after translated alignment: 18.5571428571 %
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created:

In newer versions of HUMAnN2 the flag --gap-fill is set to on by default. If you are using a version prior to v0.9.7, you will want to add --gap-fill on to the commands. Gap-filling provides robustness to missing (undersampled) reactions when quantifying metabolic pathways, which occurs frequently in the demo datasets due to their low sequencing depth.

Please note, the demo outputs in this tutorial were generated using HUMAnN2 v0.11.1 with the demo databases. If you are running a different version of HUMAnN2 or are using a different set of databases, the outputs might differ.

  • Gap filling allows us to quantify a pathway even when a small number of its reactions are conspicuously absent. What might be some pros/cons of this approach in real data?

2.3 HUMAnN2 default output

When HUMAnN2 is run from any input type, three main output files will be created:


For the demo.fastq file analyzed above, these look like:


Let's dig into the contents of the individual files:


This file lists the abundances of each gene family in the community in RPK (Reads Per Kilobase) units. Each gene family's total abundance is broken down into the contributions from individual organisms when possible (delineated by the pipe character, e.g. UniRef90_R6HHA8|g__Bacteroides.s__Bacteroides_dorei), or unclassified when mappings are only found using translated search (which is not organism-specific). We refer to this format as "stratified output" and the individual per-organism and unclassified contributions are referred to as "strata" (singular "stratum"). The RPKs of reads that map to gene families that lack UniRef90 IDs are captured as UniRef90_unknown, and unmapped read abundances as UNMAPPED. The total RPK of each community is thus captured by summing either 1) UNMAPPED, _unknown, and all UniRef IDs prior to organism-specific decomposition, or 2) UNMAPPED, _unknown|<organism>, and all UniRef|<organism> IDs. The sum of either set of values can then be normalized to CPMs (Copies Per Million) or relative abundances. To inspect the demo_genefamilies.tsv file, use:

$ less -S demo_fastq/demo_genefamilies.tsv

Which yields:

# Gene Family       demo_Abundance-RPKs
UNMAPPED    3897.0000000000
UniRef90_unknown    1633.3754216411
UniRef90_unknown|g__Bacteroides.s__Bacteroides_dorei        894.7763053061
UniRef90_unknown|g__Bacteroides.s__Bacteroides_vulgatus     738.5991163350
UniRef90_R6HHA8     333.3333333333
UniRef90_R6HHA8|g__Bacteroides.s__Bacteroides_dorei 333.3333333333
UniRef90_R7NYS9     166.6666666667
UniRef90_R7NYS9|g__Bacteroides.s__Bacteroides_vulgatus      166.6666666667
UniRef90_D1K9F5     66.6666666667

Let's find some of the "unclassified" stratifications in this file using the grep command:

$ grep 'unclassified' demo_fastq/demo_genefamilies.tsv | less -S

Which yields:

UniRef90_X6L320|unclassified        28.0905985144
UniRef90_U5FT06|unclassified        25.9926484436
UniRef90_W8YTG4|unclassified        25.2752668904
UniRef90_Q9ZUH4|unclassified        23.5421011503
UniRef90_Z5XVM9|unclassified        23.4598539432
UniRef90_Z9JRB3|unclassified        22.3949839456
UniRef90_Q8GT21|unclassified        21.6993097732
UniRef90_W7DVV5|unclassified        21.2467233744
UniRef90_W7T1N9|unclassified        20.6229542533
UniRef90_W1IYL7|unclassified        20.2552497738

  • Out of the box, HUMAnN2 can quantify protein families at 90 or 50% identity (via UniRef90 and UniRef50, respectively). What are the advantages of quantifying UniRef90 gene families? UniRef50 families?


This file lists the abundances of each pathway in the community, also in RPK units as described for gene families. For details on the calculation of pathway abundances from constituent gene family abundances, see the HUMAnN2 User Manual. Like gene families, pathways are broken down into per-organism contributions when possible (again delineated by the pipe character), with the total read abundance consisting of pathways assigned to organisms, UNMAPPED reads, and UNINTEGRATED reads that are attributable to specific gene families (typically also within specific organisms), but for gene families that are not annotated to any pathway in the corresponding database (e.g. MetaCyc). As with gene families, RPKs can be normalized to copies per million or relative abundances by summing either 1) UNMAPPED reads, UNINTEGRATED organism-specific features, and organism-specific (i.e. pipe-containing) pathways, _or_ 2) UNMAPPED reads, total UNINTEGRATED features, and pathways that are not broken down per organism. To inspect the demo_pathabundance.tsv file, use:

$ less -S demo_fastq/demo_pathabundance.tsv

Which yields:

# Pathway   demo_Abundance
UNMAPPED    1225.9730368639
UNINTEGRATED        5908.3755167593
UNINTEGRATED|g__Bacteroides.s__Bacteroides_dorei    3123.1902040899
UNINTEGRATED|g__Bacteroides.s__Bacteroides_vulgatus 2753.6043067990
UNINTEGRATED|unclassified   72.8101423152
PWY-6305: putrescine biosynthesis IV        30.3913024756
PWY-6305: putrescine biosynthesis IV|unclassified   30.3913024756
PWY-4203: volatile benzenoid biosynthesis I (ester formation)       22.5319052245
PWY-4203: volatile benzenoid biosynthesis I (ester formation)|unclassified  22.5319052245
PWY490-3: nitrate reduction VI (assimilatory)       21.3761301200
PWY490-3: nitrate reduction VI (assimilatory)|unclassified  21.3761301200
PWY-1269: CMP-3-deoxy-D-manno-octulosonate biosynthesis I   19.6850995143
PWY-1269: CMP-3-deoxy-D-manno-octulosonate biosynthesis I|unclassified      18.6035804538
PWY-5173: superpathway of acetyl-CoA biosynthesis   17.0384635071
PWY-5173: superpathway of acetyl-CoA biosynthesis|unclassified      15.1040744758
PWY-3781: aerobic respiration I (cytochrome c)      15.9398653899
PWY-3781: aerobic respiration I (cytochrome c)|unclassified 15.9398653899
PWY-7279: aerobic respiration II (cytochrome c) (yeast)     15.9398653899
PWY-7279: aerobic respiration II (cytochrome c) (yeast)|unclassified        15.9398653899

  • Unlike gene family abundance, stratified pathway abundance does not necessarily sum to yield the community total pathway abundance. Why not? (Hint: HUMAnN2 estimates the number of *complete* pathways in the community or a given stratification.)


This file lists the coverage of each pathway in the community, that is, the probability that it is provided as a unit to the metagenome or metatranscriptome either 1) within one or more particular organisms (again pipe-delimited) or 2) across one or more organisms in aggregate. For details on the calculation of pathway coverage from constituent gene family abundances, see the HUMAnN2 User Manual. Pathways' coverages should be considered probabilistic presence/absence calls for each pathway and NOT summed, and only cataloged pathways are individually analyzed this way. To inspect the demo_pathcoverage.tsv file, use:

$ less -S demo_fastq/demo_pathcoverage.tsv

Which yields:

# Pathway   demo_Coverage
UNMAPPED    1.0000000000
UNINTEGRATED        1.0000000000
UNINTEGRATED|g__Bacteroides.s__Bacteroides_dorei    1.0000000000
UNINTEGRATED|g__Bacteroides.s__Bacteroides_vulgatus 1.0000000000
UNINTEGRATED|unclassified   1.0000000000
PWY-6305: putrescine biosynthesis IV        0.9999818077
PWY-6305: putrescine biosynthesis IV|unclassified   0.9401999667
PWY-4203: volatile benzenoid biosynthesis I (ester formation)       0.9992342795
PWY-4203: volatile benzenoid biosynthesis I (ester formation)|unclassified  0.6687522557
PWY490-3: nitrate reduction VI (assimilatory)       0.9990091413
PWY490-3: nitrate reduction VI (assimilatory)|unclassified  0.6313428688
PWY-1269: CMP-3-deoxy-D-manno-octulosonate biosynthesis I   0.9971350238
PWY-1269: CMP-3-deoxy-D-manno-octulosonate biosynthesis I|unclassified      0.4525574805
PWY-5173: superpathway of acetyl-CoA biosynthesis   0.9953786390
PWY-5173: superpathway of acetyl-CoA biosynthesis|unclassified      0.2231965843
PWY-3781: aerobic respiration I (cytochrome c)      0.9883574322
PWY-3781: aerobic respiration I (cytochrome c)|unclassified 0.2386440470
PWY-7279: aerobic respiration II (cytochrome c) (yeast)     0.9883574322
PWY-7279: aerobic respiration II (cytochrome c) (yeast)|unclassified        0.2386440470

2.4 Running HUMAnN2: pre-mapped SAM/BAM input

The raw results of mapping metagenome/metatranscriptome reads against a gene family nucleotide database can be saved as a SAM/BAM file and used to run HUMAnN2 after the mapping (e.g. bowtie2) step. The only requirement is that gene family names provided in such a file be annotated using either 1) identifiers compatible with HUMAnN2's default mapping to UniRef, or 2) UniRef identifiers directly. Unmapped reads in such an input file are searched against a non-organism-specific protein database as are unmapped reads in a "full" HUMAnN2 run.

$ less -S demo.sam

which yields:

s__Bacteroides_dorei_read000001  16  gi|423242213|ref|NZ_JH724160.1|:185833-186759|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_R7NX76|UniRef50_R5GDB8|927     423   11  100M  *  0  0  TTTCGTATTAGTGGAGTATGGCATTGAAAGTACGGATGATGACACGTTGCGCAGGATAAACCGGGGACATACTTTTGCCGTTTCTGCAGAGGCTGTTCGG  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-12  XS:i:-30  XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2              MD:Z:3T38A57       YT:Z:UU
s__Bacteroides_dorei_read000002  0   gi|423245752|ref|NZ_JH724135.1|:381614-382081|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5K0|UniRef50_A6L5K0|468     88    25  100M  *  0  0  TTTACTGTAGTCGCCCTATTGGGATTACAGGATTCAGACTCTAACATAACCATCGGAAACAACTTCGAGAGCAAAAAGCTAGGTAAAAAAGGAATTTTCA  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-6   XS:i:-42  XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1              MD:Z:96A3          YT:Z:UU
s__Bacteroides_dorei_read000003  16  gi|423238261|ref|NZ_JH724153.1|:187959-189296|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_A6L5D4|UniRef50_C6X591|1338    1159  0   100M  *  0  0  AATACACGACAGGAAGATTATTCCGGCAAGAATTTTTCGGAGTTTGAACTGAGCGCTATGGAAAAACGTCATATTGCTAAAGTTCTGCAACATACTAAAG  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-24  XS:i:-36  XN:i:0  XM:i:4  XO:i:0  XG:i:0  NM:i:4              MD:Z:5C28A16G0C47  YT:Z:UU
s__Bacteroides_dorei_read000004  16  gi|294777006|ref|NZ_ADKO01000074.1|:c27580-26633|821|g__Bacteroides.s__Bacteroides_vulgatus|UniRef90_A6L7A5|UniRef50_R7NQK6|948  802   1   100M  *  0  0  CAACACTCACGGGAAATCCCCTTCAATGAATTATGGGACAGAATGCTAAGAATCAGAAAAAACAATTATATACATCTTCCACTTTACCTAAAAATAAAGT  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-12  XS:i:-12  XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2              MD:Z:85C0T13       YT:Z:UU
s__Bacteroides_dorei_read000006  16  gi|423231083|ref|NZ_JH724113.1|:c774054-771490|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_unknown|UniRef50_D5EYI2|2565  184   7   100M  *  0  0  AATCCCGTTGCTTTATTGGAACGTTTGAGCTATGAAAAGCAGGAGGCATTAACCCAGGATAAGGTTATTGTAAAGCGCCTGAACGATGTATATGCTAAAT  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-18  XS:i:-30  XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3              MD:Z:40T24G12A21   YT:Z:UU
s__Bacteroides_dorei_read000007  16  gi|423231083|ref|NZ_JH724113.1|:c74511-72820|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_R7P3C8|UniRef50_X5DCD9|1692     629   1   100M  *  0  0  TGGCAGGAGAAGCCTATCTTCGGATGGGTATGCGTGATGCTTCCTAATTCAAGAAGGCGGAAGATGCGGTGACGCCTATTATACCAGGTGGCAAATATGA  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-18  XS:i:-18  XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3              MD:Z:46T36A7T8     YT:Z:UU
s__Bacteroides_dorei_read000009  0   gi|423227774|ref|NZ_JH724110.1|:782275-785145|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_unknown|UniRef50_R7DSH7|2871   804   2   100M  *  0  0  AGGTACAGCCGGCTTTGCCTCCACCGAGAAAAAACCAGAAAAACCCTTTGCACAATGGATTCCCCGTATTCCCGAAACAGGAAAATATGCCGTCGATGTG  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-24  XS:i:-30  XN:i:0  XM:i:4  XO:i:0  XG:i:0  NM:i:4              MD:Z:10C25G6G50T5  YT:Z:UU
s__Bacteroides_dorei_read000013  16  gi|423227774|ref|NZ_JH724110.1|:c157707-156019|357276|g__Bacteroides.s__Bacteroides_dorei|UniRef90_R6AN66|UniRef50_P9WQK2|1689   1394  12  100M  *  0  0  TGGCACTGAAGGAAGAAGGCAACGTGCTGCTGCTGGACGAGCCTACCAATGATATTGACGTGAACTCGCTGCGCGCGCTGGAGGAAGGTCTGGACGATTT  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  AS:i:-6   XS:i:-18  XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1              MD:Z:65A34         YT:Z:UU

Here, we can also demonstrate speeding up a HUMAnN2 run by taking advantage of multiple cores using the --threads argument:

$ humann2 --input demo.sam --output demo_sam --threads 4

which yields (blank lines removed):

Creating output directory: demo_sam
Output files will be written to: demo_sam
Process the sam mapping results ...
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created:

Note the creation of the three expected output files.

2.5 Running HUMAnN2: pre-searched protein BLASTX M8 input

HUMAnN2 typically searches metagenome/metatranscriptome reads against two databases: first a nucleotide database using rapid mapping, then a broader protein database using more computationally expensive translated search. One can execute HUMAnN2 using either search alone, however; using nucleotides only is more computationally efficient but less sensitive, and using translated search alone can emulate HUMAnN1 output or be carried out even for very novel or unusual microbial communities (i.e. those with few reference genomes available). If BLASTX tab-delimited M8 format search results have already been generated for HUMAnN2-compatible protein IDs (UniRef90 by default), they can be run by providing the appropriate input format. Let's examine such a file:

$ less -S demo.m8

Which yields:

s__Bacteroides_vulgatus_read000635|100  UniRef90_X8JRH1|978   57.9  19  8   0  95   39  189  207  2.0e-01  21.6
unclassified_read000001|100             UniRef90_W8YTG4|1614  90.6  32  3   0  98   3   382  413  9.0e-15  65.9
unclassified_read000002|100             UniRef90_U5FT06|1224  90.9  33  3   0  100  2   127  159  2.1e-14  64.7
unclassified_read000003|100             UniRef90_V5V2L3|2169  90.6  32  3   0  3    98  388  419  4.5e-14  63.5
unclassified_read000004|100             UniRef90_U5FT06|1224  90.6  32  3   0  99   4   148  179  4.6e-14  63.5
unclassified_read000005|100             UniRef90_X8FHU5|1080  90.6  32  3   0  98   3   240  271  3.1e-15  67.4
unclassified_read000006|100             UniRef90_Z9JXD8|1428  90.6  32  3   0  98   3   22   53   2.6e-14  64.3
unclassified_read000007|100             UniRef90_Q6XMI3|1137  90.6  32  3   0  3    98  155  186  2.0e-14  64.7
unclassified_read000007|100             UniRef90_Q9SPV4|1077  65.6  32  11  0  3    98  129  160  1.5e-09  48.5
unclassified_read000008|100             UniRef90_Z9JUT2|1308  90.6  32  3   0  98   3   222  253  2.4e-15  67.8

To run HUMAnN2 using this file as input, use:

$ humann2 --input demo.m8 --output demo_m8

This again produces the three expected outputs (blank lines removed):

Creating output directory: /demo_m8
Output files will be written to: /demo_m8
Process the blastm8 mapping results ...
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created:

2.6 Running HUMAnN2: nucleotide search alone

Conversely, to run HUMAnN2 without any translated protein search (i.e. using only rapid nucleotide mapping alone), one can provide the --bypass-translated-search flag. This applies to FASTA, FASTQ, SAM/BAM, or other raw nucleotide inputs:

$ humann2 --input demo.fastq --output demo_fastq_noaa --threads 4 --bypass-translated-search

Note the lack of DIAMOND execution in this example as compared to the FASTQ analysis above:

Creating output directory: demo_fastq_noaa
Output files will be written to: demo_fastq_noaa
Running ........
Found g__Bacteroides.s__Bacteroides_dorei : 59.18% of mapped reads
Found g__Bacteroides.s__Bacteroides_vulgatus : 40.82% of mapped reads
Total species selected from prescreen: 2
Selected species explain 100.00% of predicted community composition
Creating custom ChocoPhlAn database ........
Running bowtie2-build ........
Running bowtie2 ........
Total bugs from nucleotide alignment: 2
g__Bacteroides.s__Bacteroides_vulgatus: 7436 hits
g__Bacteroides.s__Bacteroides_dorei: 8820 hits
Total gene families from nucleotide alignment: 3685
Unaligned reads after nucleotide alignment: 23.0666666667 %
Bypass translated search
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created:

It is informative to compare the results of this demo file with and without translated search; even in a small, subsampled input, including amino acid search is more computationally costly but recovers additional gene families. To count unique gene families in the original file, execute:

$ cut -f1 demo_fastq/demo_genefamilies.tsv | tail -n +3 | grep -v "|" | sort -u | wc -l

(cut -f1 extracts the first column of the file, the gene headers; tail -n +3 removes the first two lines of the file, the header and UNMAPPED; grep -v "|" removes the stratified headers; sort -u isolates the unique headers; and wc -l counts those headers.) The resulting answering is 3730 (unique gene families). The same command executed on the file demo_fastq_noaa/demo_genefamilies.tsv yields 3687 unique gene families. We infer that 3728 - 3685 = 43 additional gene families were uncovered using translated search.

In general, any HUMAnN2 intermediate file can be saved and used to resume processing from an intermediate point in the analysis pipeline, and several steps including taxonomic pre-screening, nucleotide search, or amino acid search can be independently skipped. For more information, see the relevant sections of the HUMAnN2 User Manual.

  • Why is nucleotide-level search generally faster than translated search? What makes nucleotide-level search extra fast in HUMAnN2?
  • Under what circumstances would skipping translated search be reasonable? Under what circumstances would skipping translated search be a terrible idea?

3. Manipulating HUMAnN2 output tables

3.1 Attaching names to features

To reduce file size in HUMAnN2 tables containing large numbers of features, feature names are not included by default. To attach feature names to a table containing only feature identifiers, use the humann2_rename_table script:

$ humann2_rename_table --input demo_fastq/demo_genefamilies.tsv --output demo_fastq/demo_genefamilies-names.tsv --names uniref90

Features without names will be named "NO_NAME". Inspect the named features in the output file demo_fastq/demo_genefamilies-names.tsv using less:

# Gene Family                                                                                                   demo_Abundance-RPKs
UniRef90_C3R7K2: Polysaccharide deacetylase                                                                     45.7650273224
UniRef90_C3R7K2: Polysaccharide deacetylase|g__Bacteroides.s__Bacteroides_dorei                                 45.7650273224
UniRef90_A6KZR7: Fumarate reductase iron-sulfur cluster protein subunit                                         16.7427701674
UniRef90_A6KZR7: Fumarate reductase iron-sulfur cluster protein subunit|g__Bacteroides.s__Bacteroides_dorei     9.1324200913
UniRef90_A6KZR7: Fumarate reductase iron-sulfur cluster protein subunit|g__Bacteroides.s__Bacteroides_vulgatus  7.6103500761
UniRef90_A6L5B2: Ferredoxin domain containing protein                                                           16.2037037037
UniRef90_A6L5B2: Ferredoxin domain containing protein|g__Bacteroides.s__Bacteroides_vulgatus                    9.2592592593
UniRef90_A6L5B2: Ferredoxin domain containing protein|g__Bacteroides.s__Bacteroides_dorei                       6.9444444444
UniRef90_R6HM78: Hydrogenase maturation GTPase HydF                                                             14.6520146520

3.2 Normalizing RPKs to relative abundance

To facilitate comparisons between samples with different sequencing depths, it is beneficial to normalize gene family and pathway abundance RPK values to compositional units (copies per million or relative abundance). Pathway coverage does not need to be normalized in this manner. We can normalize a HUMAnN2 abundance file using the utility script humann2_renorm_table:

$ humann2_renorm_table --input demo_fastq/demo_genefamilies.tsv --output demo_fastq/demo_genefamilies-cpm.tsv --units cpm --update-snames

Inspect the output file demo_fastq/demo_genefamilies-cpm.tsv using the less command as we did above:

# Gene Family       demo_Abundance-CPM
UNMAPPED    167194
UniRef90_unknown    70077
UniRef90_unknown|g__Bacteroides.s__Bacteroides_dorei        38388.7
UniRef90_unknown|g__Bacteroides.s__Bacteroides_vulgatus     31688.2
UniRef90_R6HHA8     14301.1
UniRef90_R6HHA8|g__Bacteroides.s__Bacteroides_dorei 14301.1
UniRef90_R7NYS9     7150.53
UniRef90_R7NYS9|g__Bacteroides.s__Bacteroides_vulgatus      7150.53
UniRef90_D1K9F5     2860.21

Note the difference in scale: units tend to be larger than in the original file as a result of being scaled to sum to 1 million. Also note that the sample header has been updated from demo_Abundance-RPK to demo_Abundance-CPM courtesy of the --update-snames flag. Let's verify that the unstratified abundance values sum up to 1 million:

$ grep -v "#" demo_fastq/demo_genefamilies-cpm.tsv | grep -v "|" | cut -f2 | python -c "import sys; print sum(float(l) for l in sys.stdin)"

The command above uses grep to exclude the header line (marked by #) and the stratified lines (containing |). The cut command then pulls out the second column (CPM values). The python -c command runs a one-line python program to sum up the CPM values. This returns 1000000.46371 (i.e. ~1 million modulo a small amount of rounding error).

  • HUMAnN2 natively normalizes gene family abundance by gene length. Why is this step important? (In other words, why are RPK units more useful than raw read counts?)
  • Under what circumstances would it be useful to consider gene/pathway abundance before normalizing by sample read depth?

3.3 Regrouping genes to other functional categories

HUMAnN2's default "units" of microbial function are gene families and metabolic pathways. However, starting from gene family abundance, it is possible to reconstruct the abundance of other functional categories in a microbiome using the humann2_regroup_table script. Inspect the regrouping options using the humann2_regroup_table -h command.

Let's regroup our CPM-normalized gene family abundance values to enzyme commission (EC) categories:

$ humann2_regroup_table --input demo_fastq/demo_genefamilies-cpm.tsv --output demo_fastq/level4ec-cpm.tsv --groups uniref90_level4ec

We are given the output message:

Original Feature Count: 3729; Grouped 1+ times: 283 (%7.6); Grouped 2+ times: 4 (%0.1)

Indicating that the original file contained 3,729 features (UniRef90 gene families). Of these, 283 were annotated to a level-4 EC, and 4 were annotated to more than one level-4 EC. Inspect the output file demo_fastq/level4ec-cpm.tsv using the less command:

# Gene Family       demo_Abundance-CPM
UNMAPPED    167194.0
UNGROUPED   761276.764
UNGROUPED|g__Bacteroides.s__Bacteroides_dorei       433049.25
UNGROUPED|g__Bacteroides.s__Bacteroides_vulgatus    323309.361
UNGROUPED|unclassified      4918.036   52.771|g__Bacteroides.s__Bacteroides_dorei       52.771   124.628|g__Bacteroides.s__Bacteroides_vulgatus    124.628

We see that UniRef90 features have been regrouped as level-4 EC features, such as A new stratified feature also appears in the output: UNGROUPED. This feature reports the total abundance of features that were not otherwise included in a group (similar to the UNINTEGRATED feature of the pathways output). If each feature was regrouped into at most one new feature, the total community abundance of features in the regrouped file (including UNGROUPED and UNMAPPED) would still be 1 million. The actual sum (1,003,757.446) is slightly higher than 1 million due to the three UniRef90 features whose abundance was counted in two different level-4 EC features. Using the humann2_rename_table script, attach names to the unnamed level-4 EC features using the option --names ec. 2-dehydropantoate 2-reductase                                      52.771 2-dehydropantoate 2-reductase|g__Bacteroides.s__Bacteroides_dorei  52.771 IMP dehydrogenase                                                  124.628 IMP dehydrogenase|g__Bacteroides.s__Bacteroides_vulgatus           124.628

  • From a statistical testing perspective, what is one major advantage of regrouping raw gene abundance to a smaller list of gene groups?
  • When regrouping genes to ECs (or KOs or MetaCyc reactions), we sum over the gene sequences belonging to a particular group. Under what circumstance would it make sense to average over the members of a group rather than sum? (Hint: the answer is related to the difference between the "is_a" and "part_of" relationships used in the Gene Ontology.)

4. Advanced HUMAnN2 topics

4.1 HUMAnN2 intermediate files

It is in some cases useful to understand the collection of intermediate files generated by the individual steps of a typical HUMAnN2 run:

$ ls -lh demo_fastq/demo_humann2_temp/

Which yields:

-rw-rw-r-- 1   11M Aug  2 11:22 demo_bowtie2_aligned.sam
-rw-rw-r-- 1  2.9M Aug  2 11:22 demo_bowtie2_aligned.tsv
-rw-rw-r-- 1   11M Aug  2 11:22 demo_bowtie2_index.1.bt2
-rw-rw-r-- 1  3.6M Aug  2 11:22 demo_bowtie2_index.2.bt2
-rw-rw-r-- 1  136K Aug  2 11:22 demo_bowtie2_index.3.bt2
-rw-rw-r-- 1  3.6M Aug  2 11:22 demo_bowtie2_index.4.bt2
-rw-rw-r-- 1   11M Aug  2 11:22 demo_bowtie2_index.rev.1.bt2
-rw-rw-r-- 1  3.6M Aug  2 11:22 demo_bowtie2_index.rev.2.bt2
-rw-rw-r-- 1  628K Aug  2 11:22 demo_bowtie2_unaligned.fa
-rw-rw-r-- 1   17M Aug  2 11:22 demo_custom_chocophlan_database.ffn
-rw-rw-r-- 1   86K Aug  2 11:22 demo_diamond_aligned.tsv
-rw-rw-r-- 1  521K Aug  2 11:22 demo_diamond_unaligned.fa
-rw-rw-r-- 1   18K Aug  2 11:23 demo.log
-rw-rw-r-- 1   55K Aug  2 11:21 demo_metaphlan_bowtie2.txt
-rw-rw-r-- 1   958 Aug  2 11:22 demo_metaphlan_bugs_list.tsv

These include SAM files containing the alignment of input reads against the default nucleotide database, which are first reduced by counting up and tabulating the hits to each annotated gene family. It can be useful to reprocess either the reads unaligned after nucleotide mapping (bowtie2_unaligned.fa) or, more likely, the reads still unaligned after amino acid search. Finally, the MetaPhlAn2 bowtie2.txt intermediate output is itself saved for e.g. PanPhlAn strain profiling. For more information, see the HUMAnN2 User Manual's output file section.

4.2 HUMAnN2 for multiple samples

It is typical to execute HUMAnN2 on multiple individual samples, then merge the resulting gene families or pathways into a single tab-delimited multi-sample table. When combining or comparing any HUMAnN2 output, it is critical to normalize the default RPK abundance units (which will be sensitive to the sequencing depth of each sample) into RPKMs (which will not).

We can demonstrate this by downloading a collection of six subsampled input files from the Human Microbiome Project. The original files, and many others, can be downloaded from the HMP DACC.

Click on each individual subsampled file to download. Open a terminal and change directories into the folder where you have downloaded the files.

You will see something like the following contents if you run $ ls -lh:

-rw-r--r-- 1  2.7M Jun  4 11:33 763577454-SRS014459-Stool.fasta
-rw-r--r-- 1  2.6M Jun  4 11:33 763577454-SRS014464-Anterior_nares.fasta
-rw-r--r-- 1  2.7M Jun  4 11:33 763577454-SRS014470-Tongue_dorsum.fasta
-rw-r--r-- 1  2.7M Jun  4 11:34 763577454-SRS014472-Buccal_mucosa.fasta
-rw-r--r-- 1  2.7M Jun  4 11:34 763577454-SRS014476-Supragingival_plaque.fasta
-rw-r--r-- 1  2.7M Jun  4 11:34 763577454-SRS014494-Posterior_fornix.fasta

Process each input file into a single output directory:

$ humann2 -i 763577454-SRS014459-Stool.fasta -o 763577454
$ humann2 -i 763577454-SRS014464-Anterior_nares.fasta -o 763577454
$ humann2 -i 763577454-SRS014470-Tongue_dorsum.fasta -o 763577454
$ humann2 -i 763577454-SRS014472-Buccal_mucosa.fasta -o 763577454
$ humann2 -i 763577454-SRS014476-Supragingival_plaque.fasta -o 763577454
$ humann2 -i 763577454-SRS014494-Posterior_fornix.fasta -o 763577454

Alternatively, if you're comfortable with shell scripting, you can use the following command to process all six at once:

$ for f in *.fasta; do humann2 -i $f -o 763577454; done

If you're feeling particularly impatient, add a & at the end (i.e. 763577454 &), but use that form with caution.

Regardless of how you run HUMAnN2, this will result in six output directories each containing the three main analysis types (gene families, pathway abundances, and pathway coverages) along with intermediate files. Let's merge the 6 per-sample gene family abundance outputs into a single table for this dataset using the script humann2_join_tables:

$ humann2_join_tables -i 763577454 -o 763577454_genefamilies.tsv --file_name genefamilies

Inspect this file using the less command:

# Gene Family       763577454-SRS014459-Stool_Abundance-RPKs        763577454-SRS014464-Anterior_nares_Abundance-RPKs       763577454-SRS014470-Tongue_dorsum_Abundance-RPKs        763577454-SRS014472-Buccal_mucosa_Abundance-RPKs        763577454-SRS014476-Supragingival_plaque_Abundance-RPKs 763577454-SRS014494-Posterior_fornix_Abundance-RPKs
UNMAPPED    14210.0000000000        17696.0000000000        16623.00000000009766.0000000000 16605.0000000000        13612.0000000000
UniRef90_A0A015N407 2.3310023310    0       0       0       0       0
UniRef90_A0A015N407|g__Bacteroides.s__Bacteroides_cellulosilyticus  2.3310023310    0       0       0       0       0
UniRef90_A0A015N416 5.2910052910    0       0       0       0       0
UniRef90_A0A015N416|g__Bacteroides.s__Bacteroides_cellulosilyticus  5.2910052910    0       0       0       0       0
UniRef90_A0A015P3I5 3.7453183521    0       0       0       0       0
UniRef90_A0A015P3I5|g__Parabacteroides.s__Parabacteroides_distasonis        3.7453183521    0       0       0       0       0
UniRef90_A0A015QBU0 0.6485084306    0       0       0       0       0
UniRef90_A0A015QBU0|g__Bacteroides.s__Bacteroides_ovatus    0.6485084306    00      0       0       0

It is especially important to renormalize this merged file as we're now comparing samples with different sequencing depths:

$ humann2_renorm_table -i 763577454_genefamilies.tsv -o 763577454_genefamilies_cpm.tsv --units cpm

For more information, see the HUMAnN2 User Manual's description of a standard analysis workflow.

5. Downstream analyses

5.1 Associating HUMAnN2 functions with metadata

There are many approaches for associating microbial measurements with sample metadata, including several options in the bioBakery:

When associating metadata with HUMAnN2 features, it is often beneficial to associate with community totals and avoid testing each individual feature stratification (to improve statistical power). This is the approach used by the HUMAnN2 utility script humann2_associate, which can compare feature totals across samples with 1) a single continuous metadatum (via the Spearman Correlation) or 2) a single categorical metadatum (via the Kruskal-Wallis H-test). Notably, this is a naive approach to association, but it is useful for tutorial purposes.

For this and the next section, we'll operate on a table of pathway abundances from ~400 samples from the Human Microbiome Project:

Download this file to your working directory and examine it using less:

$ less -S hmp_pathabund.pcl

Which yields:

FEATURE \ SAMPLE                                                                                      SRS011084    SRS011086      SRS011090
STSite                                                                                                Stool        Tongue_dorsum  Buccal_mucosa
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis                                                  0.000498359  0.000628096    0.000304951
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Acidovorax.s__Acidovorax_ebreus               0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Bacteroides.s__Bacteroides_finegoldii         0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Haemophilus.s__Haemophilus_influenzae         0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Klebsiella.s__Klebsiella_pneumoniae           0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Staphylococcus.s__Staphylococcus_epidermidis  0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|g__Veillonella.s__Veillonella_sp_oral_taxon_158  0            0              0
1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis|unclassified                                     7.89502e-06  1.60101e-05    0

We can see three samples from three different body sites (via the STSite metadatum). A single pathway, 1CMET2-PWY: N10-formyl-tetrahydrofolate biosynthesis, and its stratifications are visible. In these three samples, the pathway can only be assigned to the unclassified stratum, and with very low abundance.

Let's associate pathway totals in this file with sample body site:

$ humann2_associate --input hmp_pathabund.pcl --last-metadatum STSite --focal-metadatum STSite --focal-type categorical --output stats.txt

This call will perform a Kruskal-Wallis H-test between each feature's community total and sample body site, summarizing by per-site means and ranking by statistical significance. Let's examine the first few lines of the file:

$ less -S stats.txt

Which yields:

# kruskal-wallis analysis of metadatum: STSite
# feature                                                                        level means                                                                                                                                         p-value    q-value
METSYN-PWY: L-homoserine and L-methionine biosynthesis                           Supragingival_plaque:0.0004077|Stool:9.059e-05|Posterior_fornix:1.275e-05|Tongue_dorsum:0.0005423|Anterior_nares:2.585e-05|Buccal_mucosa:0.00112    6.477e-70  1.224e-67
PWY-5347: superpathway of L-methionine biosynthesis (transsulfuration)           Supragingival_plaque:0.0003854|Stool:9.416e-05|Posterior_fornix:9.412e-06|Tongue_dorsum:0.0005187|Anterior_nares:1.71e-05|Buccal_mucosa:0.0007963   7.074e-70  1.224e-67
PWY-6386: UDP-N-acetylmuramoyl-pentapeptide biosynthesis II (lysine-containing)  Supragingival_plaque:0.0007095|Stool:5.553e-06|Posterior_fornix:1.768e-06|Tongue_dorsum:0.0001989|Anterior_nares:0.0002186|Buccal_mucosa:0.001337   1.115e-69  1.286e-67
HOMOSER-METSYN-PWY: L-methionine biosynthesis I                                  Supragingival_plaque:0.0002616|Stool:5.078e-05|Posterior_fornix:7.601e-06|Tongue_dorsum:0.0003655|Anterior_nares:1.856e-05|Buccal_mucosa:0.0009729  3.124e-69  2.702e-67
NAD-BIOSYNTHESIS-II: NAD salvage pathway II                                      Supragingival_plaque:0.0001445|Stool:3.478e-06|Posterior_fornix:2.321e-07|Tongue_dorsum:0.0001515|Anterior_nares:6.176e-06|Buccal_mucosa:0.0005249  7.574e-68  5.241e-66
MET-SAM-PWY: superpathway of S-adenosyl-L-methionine biosynthesis                Supragingival_plaque:0.000311|Stool:7.31e-05|Posterior_fornix:8.425e-06|Tongue_dorsum:0.0004838|Anterior_nares:5.945e-06|Buccal_mucosa:0.0005264    6.56e-67   3.783e-65
PWY-5188: tetrapyrrole biosynthesis I (from glutamate)                           Supragingival_plaque:0.0007416|Stool:1.759e-05|Posterior_fornix:3.483e-06|Tongue_dorsum:0.0006484|Anterior_nares:0.0002229|Buccal_mucosa:0.0003094  1.098e-66  5.426e-65
PWY-3001: superpathway of L-isoleucine biosynthesis I                            Supragingival_plaque:0.0007748|Stool:0.0002933|Posterior_fornix:1.806e-05|Tongue_dorsum:0.000893|Anterior_nares:5.512e-05|Buccal_mucosa:0.001016    2.768e-66  1.197e-64

We observe, for example, that the first pathway in the table (METSYN-PWY: L-homoserine and L-methionine biosynthesis) tends to be more abundant on average at the three oral sites (buccal mucosa, tongue, and plaque) than at the three non-oral sites (stool, posterior fornix, and nares).

Now that we have identified this enrichment, use the grep command to explore the stratifications of this pathway from the hmp_pathabund.pcl file.

$ grep 'METSYN-PWY' hmp_pathabund.pcl | less -S

  • Do the major contributing species seem reasonable for an orally-enriched pathway?

5.2 Visualizing stratified HUMAnN2 output

HUMAnN2 includes a utility humann2_barplot to assist with visualizing stratified outputs. Let's use this tool to make a plot of the significantly differentially abundant pathway we identified above. (Use see plot1.png to open the resulting figure if working in a graphical environment.)

$ humann2_barplot --input hmp_pathabund.pcl --focal-feature METSYN-PWY --focal-metadatum STSite --last-metadatum STSite --output plot1.png


Not terribly informative... Let's try adding a "sort" on the summed stratified abundance:

$ humann2_barplot --sort sum --input hmp_pathabund.pcl --focal-feature METSYN-PWY --focal-metadatum STSite --last-metadatum STSite --output plot2.png


A pattern has started to emerge: we can clearly see that the oral body sites are enriched on the left (high) end of the plot, consistent with the summary statistics computed above. We can continue this line of analysis with an additional grouping by body site:

$ humann2_barplot --sort sum metadata --input hmp_pathabund.pcl --focal-feature METSYN-PWY --focal-metadatum STSite --last-metadatum STSite --output plot3.png


We see now that the pathway is most strongly enriched in the buccal mucosa, where it is contributed almost exclusively by Streptococcus species. The pathway's relative abundance in the other two body sites is generally lower, and is more often "unclassified" at the tongue body site.

  • How might you find other pathways with a similar abundance distribution in this functional profile?
  • What might be causing the "unclassified" contributors of homoserine/methionine synthesis in tongue dorsum communities?
  • Is it surprising that three apparently closely related communities (tongue, cheek, and tooth surface) differ so strikingly in metagenomic functional potential?

Let's examine one additional pathway, COA-PWY: coenzyme A biosynthesis, which is more broadly conserved across body sites:

$ humann2_barplot --sort sum --input hmp_pathabund.pcl --focal-feature COA-PWY --focal-metadatum STSite --last-metadatum STSite --output plot4.png


Now sorting by total abundance is less informative. Let's try some additional options: sorting by ecological similarity, normalizing pathway contributions within-sample, and expanding the list of species highlighted:

$ humann2_barplot --sort similarity --top-strata 12 --scaling normalize --input hmp_pathabund.pcl --focal-feature COA-PWY --focal-metadatum STSite --last-metadatum STSite --output plot5.png


Notably, sorting by similarity of pathway contributions has tended to group samples from the same body site even without the explicit metadata sort. This is because, for broadly distributed pathways, organismal contributions tend to follow overall sample ecology (which tends to be quite different from one body site to the next).

  • Is it surprising to see coenzyme A biosynthesis present in diverse human body site microbiomes?