HUMAnN: The HMP Unified Metabolic Analysis Network
|To find...||...check out:|
|Contact||Curtis Huttenhower, firstname.lastname@example.org|
HUMAnN is a pipeline for efficiently and accurately determining the
presence/absence and abundance of microbial pathways in a community from
metagenomic data. Sequencing a metagenome typically produces millions of
short DNA/RNA reads. HUMAnN takes these reads as inputs and produces gene
and pathway summaries as outputs:
The abundance of each orthologous gene family in the community. Orthologous
families are groups of genes that perform roughly the same biological roles.
HUMAnN uses the KEGG Orthology (KO) by default, but any catalog of orthologs
can be employed with minor changes (COG, NOG, etc.).
For more information, please refer to the "Using alternatives to KEGG" section below.
The presence/absence of each pathway in the community. HUMAnN refers to
pathway presence/absence as "coverage," and defines a pathway as a set of
two or more genes. HUMAnN uses KEGG pathways and modules by default, but
again can easily be modified to use GO terms or other gene sets.
The abundance of each pathway in the community, i.e. how many "copies" of
that pathway are present.
HUMAnN can thus be used in tandem with any translated BLAST program to convert
sequence reads into coverage and abundance tables summarizing the gene families
and pathways in a microbial community. This lets you analyze a collection of
metagenomes as a matrix of gene/pathway abundances, just like you might analyze
a collection of microarrays.
If you use this software, please cite our paper:
Metabolic reconstruction for metagenomic data and its application to the human microbiome
Sahar Abubucker, Nicola Segata, Johannes Goll, Alyxandria M. Schubert,
Jacques Izard, Brandi L. Cantarel, Beltran Rodriguez-Mueller, Jeremy Zucker,
Mathangi Thiagarajan, Bernard Henrissat, Owen White, Scott T. Kelley,
Barbara Meth�, Patrick D. Schloss, Dirk Gevers, Makedonka Mitreva,
Many thanks to the NIH and to the entire Human Microbiome Project team for
making the HMP possible and for the many collaborators who helped to make
HUMAnN a reality. Sahar Abubucker and Makedonka Mitreva (Washington
University) co-led the Metabolic Reconstruction group, the pipeline
incorporates software from Yuzhen Ye (Indiana University), Beltran
Rodriguez-Mueller (SDSU), and Pat Schloss (University of Michigan), and
specific contributors include Alyx Schubert (University of Michigan), Jeremy
Zucker (Broad Institute), Brandi Cantarel (UMD), Qiandong Zeng (Broad
Institute), Johannes Goll (JCVI), and many others.
HUMAnN uses the scons build system to drive its scientific workflow
(see PREREQUISITES). scons works very much like make, converting a set of
inputs into a set of outputs one step at a time, and running only the steps
necessary to produce the desired output. Thus, to analyze your own data:
Place one or more translated BLAST results using KEGG gene identifiers in
the "input" directory (optionally gzipped or bzipped).
Edit the "SConstruct" file; in particular, make sure that the input
processors include one configured for your BLAST file name(s) and format(s).
Run the "scons" command, optionally parallelizing multiple analyses using the
"-j" flag. Results will be placed in the "output" directory.
HUMAnN is highly configurable in order to perform a collection of very
computationally intensive tasks efficiently and flexibly. Please see the
included sample input files, metadata files, and SConstruct settings for an
overview of the software's configuration and the file formats it consumes and
HUMAnN has no installation per se, but it does depend on the following items:
A network connection. HUMAnN downloads a number of data and software
components from standard repositories during exection. Please ensure
that a network connection is available at the least for the first run.
HUMAnN will use "curl" by default to download files, and this can be
changed (e.g. to use "wget") by editing the
A bunch of RAM. Processing one Illumina lane of metagenomic reads can take
as much as ~8-10GB of memory, although it will often take much less
depending on the data. Consider yourself warned.
Python >= 2.7. As of this writing, we use exactly
one feature unique to Python 2.7, math.gamma. If you're willing to edit
it out, the dependency is to Python 2.6 for various modules and syntax.
Please note that HUMAnN does not run blastx for you. It instead consumes
tabular BLAST results as input. We recommend the default "-outfmt 6" setting
as described below and in the provided SConstruct configuration.
Alternatively, input processors are also provided for accelerated BLAST
implementations such as mapx, mblastx, or usearch.
BioCyc (optional, automatically downloaded)
maq (optional, automatically downloaded)
R (optional, for synthetic performance evaluation)
R package ROCR (optional, for synthetic performance evaluation)
One or more of the following:
Tabular translated BLAST (blastx) output files files matching sequence read
IDs to gene IDs.
Mapping (bowtie, bwa, etc.) output in BAM format.
Tab-delimited text containing one or more pre-quantified gene abundances.
Place (or symlink) each file with a .txt, .txt.gz, .txt.bz, .bam, .csv, .tsv,
or .pcl extension as appropriate in the "input" directory
before running HUMAnN. The pipeline includes processors for three
tab-delimited text formats by default (below) and BAM binary format and can
easily by modified to accept more.
As an example, the default inputs provided with HUMAnN were generated using:
blastx -outfmt 6 -db 28_kegg_genomes < mock_even_lc.fasta.gz | gzip -c > mock_even_lc.txt.gz
Where "28_kegg_genomes" is a database of the amino acid sequences of 28
well-characterized KEGG organisms' ORFs (.pep files from
ban, bbr, bqu, bsu, cbo, cdf, cje, eco, efa, ftu, hin, hpy, hsl, lmo, mbo, mtu, ngo, nme, pae, rso, sau, sco, sgo, spn, vch, xfa, yen, and yps).
Please modify the SConstruct file as needed to specify the exact format of your input data:
blastx -outfmt 6 ---------------- qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore mapx ---- template-name frame read-name template-start template-end template-length read-start read-end read-length template-protein read-protein alignment identical %identical positive %positive mismatches raw-score bit-score e-score mblastx ------- Query_Id Reference_Id E_Value_Bit_Score Identity_Percentage_Length_of_HSP Number_of_Positives_Frame_#s Alignment_Start_in_Query Alignment_End_in_Query Alignment_Start_in_Reference Alignment_End_in_Reference BAM (bowtie/bwa/etc.) -------------------- qname flag rname pos mapq cigar rnext pnext tlen seq qual usearch ------- usearch can be treated identically to "blastx -outfmt 6" if the "--blast6out" flag is provided.
Additionally, tab-delimited (.csv, .tsv, or .pcl) files of tabulated gene
abundances (KEGG KO identifiers by default) can be used, with each column
representing one sample and each row one gene family. Any such files placed in
the input directory will be split per column and each sample will be
run through the HUMAnN pipeline separately.
Three or more files per input including:
The relative abundances of each gene in the input metagenome. By default
tagged as type "01". Two columns of tab-delimited text: geneid abundance.
The coverages of each pathway, expressed as a fraction between 0 and 1
inclusive. By default tagged as type "04a". Two columns of
tab-delimited text: pathid coverage.
The relative abundances of each pathway. By default tagged as type "04b".
Two columns of tab-delimited text: pathid abundance.
Optionally, a table of individual gene abundances appropriate for loading
into METAREP. By default tagged as type "99". Five columns of
tab-delimited text: geneid abundance e-score %identical identical. The
abundance is relative and calculated as in HUMAnN's gene family abundances;
e-score, percent identity, and identity length are averaged over all reads
mapping to each gene in the input translated BLAST results.
Three or more merged files are also produced, all tab-delimited text:
A table containing the relative abundances of all genes in any input
metagenomes. By default named as "01-*.txt".
A table containing the coverages of all pathways in any input files. By
default named as "04a-*.txt".
A table containing the relative abundances of all pathways in any input files.
By default named as "04b-*.txt".
A table containing the relative abundances of all individual genes in any
input files. By default named as "99-*.txt".
Please see below for a detailed description of the ways in which these files
are produced and named.
You can reproduce the entire performance evaluation using synthetic communities
in the HUMAnN manuscript with the tools in the "synth" subdirectory. This
optional step is omitted during default HUMAnN operation but will be
automatically incorporated into HUMAnN output if one or more synthetic
communities are built. To do this:
Place a .fa/.qual file pair or .fastq file in the "synth/output" directory
and edit the "synth/SConstruct" file accordingly. This can be any
representative sequencing data from which an error model will be built
with which to synthesize artificial reads.
Unfreeze the synthetic data by changing
c_fFrozento False in
Run "scons" and wait quite a while. High-quality genomes will be
downloaded from KEGG, shredded into artificial reads using maq, and
mixed into a synthetic community.
By default, four communities are built: two with staggered organismal
abundances (stg, using a lognormal distribution) and two with even
abundances (even), and two with 20 organisms (low complexity, lc) and
two with 100 (high complexity, hc).
The true gene and pathway abundances/coverages will be placed in
correspondingly named files in the "synth/output" directory. These will
be automatically detected the next time HUMAnN is executed using scons.
They will be merged into the overall output files, and if correspondingly
named input files ("mock_stg_lc", "mock_even_hc", etc.) are available,
their performance will be automatically plotted as PDF output files using R.
HUMAnN can also add a level of organism specificity to the output. To do this:
Turn on the feature by changing c_fOrg to True in SConstruct.
Run "scons" as you would on your samples without the organism specificity
HUMAnN's scons workflow runs on the assumption that each input file will be
converted into one or more output files by combining a series of pipeable
processing modules. It is also set up to allow easy configuration and
comparison of output settings, so think of it as a tree: any output processor
that can process an input file will. This allows, for example, the quick
and easy comparison of the results of metabolic reconstruction using many
slightly different parameter settings to determine which is optimal.
Note that while HUMAnN uses primarily KEGG Orthology KO identifiers for genes
and KEGG pathway (ko) or module (M) identifiers for pathways, this can be
easily modified to suit your analysis needs. For example, code to process
MetaCyc reaction identifiers (for gene families) and pathway identifiers is
also included by default. Any traceable identifiers (ECs, NOGs, etc.) can be
used for genes and for the pathways in which they're contained.
The SConstruct file includes a configurable set of processing modules used to
convert each input file into one or more outputs. Processing modules are defined
using the CProcessor class with seven arguments:
A file suffix (for input files) or type (for output files). This can be of
the form ".txt.gz" for input files or "##" for output files. This pattern
must match the filename consumed by the processor.
A file type. This is of the form "##" by default and can be any short tag
(numerical or text) used to identify a type of generated file. File types
are used to chain processing modules together; files of type ## will only
and always be further processed by modules that can input type ##.
An identifier. This is of the form "xyz" by default and can be any short
tag (numerical or text) used as a human-readable identifier for the
processing module and its specific configuration options. It is appended
to the resulting output file and provides a minimal form of data provenance.
A program. This is typically a Python script that consumes an input file
on stdin, produces output on stdout, and can optionally include additional
command line arguments (see below).
Zero or more supporting files as command line arguments. This array of file
names is passed to the processing program as command line arguments (and
will be understood by scons as dependencies, thus rebuilding the output
when they change).
Zero or more additional command line arguments. This array of arbitrary
strings is passed to the processing program as command line arguments (and
will be ignored by scons). This can be used to produce multiple output
files from the same processing scripts using command line arguments.
A boolean flag, True if the processor should be used for initial input files
and False (or omitted) otherwise. In the latter case, the processor will
only be used on intermediate files matching its input type.
A boolean flag, True if the output should be gzipped and False (or omitted)
otherwise. Subsequent processors will automatically ungzip compressed
By default, the following output file types are produced:
Generated from raw BLAST results.
A condensed binary representation of translated BLAST results, abstracted
from and independent of the specific format (blastx/mblastx/mapx) in which
they are provided.
Generated from _00-.txt.gz.
Relative gene abundances as calculated from BLAST results in which each read
has been mapped to zero or more gene identifiers based on quality of match.
Total weight of each read is 1.0, distributed over all gene (KO) matches by
Generated from _01-.txt.
Relative gene abundances distributed over all pathways in which the gene is
predicted to occur.
Generated from _02a-.txt.
Relative gene abundances with pathway assignments, taxonomically limited to
remove pathways that could only occur in low-abundance/absent organisms.
Generated from _02b-.txt.
Relative gene abundances with pathway assignments, smoothed so that zero
means zero and non-zero values are imputed to account for non-observed sequences.
Generated from _03a-.txt.
Relative gene abundances with pathway assignments, gap-filled so that
gene/pathway combinations with surprisingly low frequency are imputed to
contain a more plausible value.
Generated from _03b-.txt.
Pathway coverage (presence/absence) measure, i.e. relative confidence of each
pathway being present in the sample. Values are between 0 and 1 inclusive.
Generated from _03b-.txt.
Pathway abundance measure, i.e. relative "copy number" of each pathway in the
sample. On the same relative abundance scale (0 and up) as the original gene
Generated from _00-.txt.gz.
Per-sample gene abundance tables formatted for loading into METAREP. On the
same relative abundance scale (0 and up) as the original gene abundances
Combined pathway coverage matrix for all samples.
Combined pathway abundance matrix for all samples, normalized per column.
Pathway information formatted for input to LefSe (exported)
Pathway information formatted as a tree reflecting the KEGG BRITE hierarchy
for plotting in GraPhlAn.
Pathway abundances formatted for overplotting on the KEGG BRITE hierarchy
Combined KO abundance matrix for all samples, normalized per column.
Gene abundance calculated as the confidence (e-value/p-value) weighted sum of
all hits for each read.
Gene to pathway assignment performed using MinPath.
Gene to pathway assignment performed naively using all pathways.
Pathway abundances adjusted based on A) taxonomic limitation and B) the
expected copy number of each gene in the detected organisms.
Smoothing performed using Witten-Bell discounting, which shifts
sum_observed/(sum_observed + num_observed) probability mass into zero counts
and reduces others by the same fraction.
Smoothing performed naively by adding a constant value (0.1) to missing
No gap filling (no-operation, abundances left as is).
Gap filling by substituting any values below each pathway's median with the
median value itself.
Pathway coverage calculated as fraction of genes in pathway at or above global
Pathway coverage calculated as fraction of genes in pathway at or above global
median abundance, with low-abundance pathways set to zero using Xipe.
Pathway abundance calculated as average abundance of the most abundant half
of genes in the pathway.
Since HUMAnN's development, KEGG has gone commercial. This is a huge blow to
the academic community at large, and an unfortunate inconvenience for HUMAnN
in particular. In order to compromise between minimal disruption of HUMAnN
operation and respecting KEGG's intellectual property, the following guidelines
have been followed:
Files derived from KEGG necessary for normal HUMAnN operation are included in
the data directory. These include the basic structure of KEGG pathways and
modules, gene lengths, and ID/name mappings.
The more detailed files from KEGG needed for synthetic metagenome construction
and evaluation are not included, thus rendering the synth sub-project
inoperable without access to KEGG inputs. We apologize for this
inconvenience and urge any impacted users to contact KEGG directly to inform
them of the problem. Please let us know if you're interested in information
on additional workarounds to enable synthetic metagenome evaluation.
Using alternatives to KEGG
The process of using HUMAnN with a database other than KEGG (e.g. COG, NOG, etc.) requires:
A FASTA file of nucleotide sequences against which the meta'ome is searched, each labeled with a gene ID (for KEGG, the genes.pep file distributed with KEGG).
A file of nucleotide sequence lengths, each labeled with a gene ID (for KEGG, data/genels).
A file of gene-to-OG mappings (for KEGG, data/koc).
A file of OG-to-pathway mappings (for KEGG, data/keggc).
Each of those four files should be replaced for HUMAnN to run on the alternative database.
maq installation issues
Some users have reported freezes during the synthetic metagenome maq
installation process that are outside of HUMAnN's control. If maq locks up
during compilation or installation, feel free to install it directly from the
appropriate Sourceforge package in maq-0.7.1/maq (as expected by HUMAnN) and
replace the funcUntarMAQ function in synth/SConstruct with
def funcUntarMAQ( target, source, env ): return 0
Note that the synthetic metagenomes are an optional component not needed for
normal HUMAnN operation.
If you get the error:
sh: data/MinPath/glpk-4.6/examples/glpsol: cannot execute binary file
Minpath comes bundled with a version of glpk that will only run on Linux;
HUMAnN can't easily replace this itself "on the fly." To fix the problem
and run HUMAnN on MacOS X, after this error is produced, remove MinPath's
glpk folder and replace it with a version compiled for Mac from here:
Please do not provide HUMAnN with any input files including _ followed by a
an_00_example. HUMAnN uses filename tags of
this form to indicate output from each of its processing stages (as
described above), and input files that include such patterns should be
renamed before HUMAnN execution.
- Initial modularization of HUMAnN code
- First application to HMP production data (mblastx)
- Refinement of documentation and KEGG/MetaCyc version updates
- Addition of METAREP output module
- Addition of full (albeit crummy) CFG parser for KEGG modules
- Improvement to module FPs using X2 for enzyme presence/absence calls
- Add default calculation of basic ecology metrics: Shannon, Simpson, Pielou, and richness
- Improve memory usage in METAREP output generation
- Make default Xipe parameters slightly more lenient (retain a few more nonzero pathways)
- Improve gold standard generation in synthetic datasets by retaining full KEGG pathway set
- Fix a problem with non-pathway BRITE entities in synthetic gold standard
- Fix a typo in hits2enzymes.py (only affected unused filter option)
- Add complete parameter evaluation process to HMP pipeline
- MAJOR CHANGE: KEGG is now defunct, and HUMAnN has been updated accordingly
- KEGG derived information needed for normal operation is included
- KEGG files needed for synthetic metagenome construction are not included
- "Frozen" synthetic metagenome evaluation is still possible
- Please contact us directly for more information if needed
- Add documentation on potential maq issues (thanks to Shinichi Sunagawa!)
- Fix a typo in fastq2fasta.py formatting (thanks to Shinichi Sunagawa!)
- Fix a typo in module2modulec.py formatting (thanks to Kathryn Iverson!)
- Fix a typo in eco.py for overly sparse input files (thanks to Jeffrey Werner!)
- Work around Mac OS X zcat issues (thanks to Jeffrey Werner!)
- Add several internal evaluation pipelines in response to initial reviews
- Fix hits2*.py handling of zero/very small e-values (thanks to Fah Sathira!)
- Fix missing exclude.py (thanks to Brandi Cantarel!)
- Allow module2modulec.py to remove unusual duplicate enzymes from KEGG's files
- Allow input filenames to contain underscores
- Fix module size calculation in filter.py
- Fix a bug in hits2enzymes.py to allow a wider range of KEGG gene name detection
- Completely revamped pathway coverage calculation, much more accurate for low-abundance events (thanks to Kat Huang, Sean Sykes!)
- Fixed hits2metarep.py handling of empty hits files (thanks to Pavan Kumar!)
- Fixed missing KO gloss annotations in merged 01b*.txt per-gene tabular abundance quantification output files.
- Added tab-delimited input file formats.
- Added GraPhlAn tree output files to enable visualization of abundance overlays on KEGG hierarchies (thanks to Jovian Yu, Morgan Paull!).
- Added preliminary organism-specific output generation.