HUMAnN: The HMP Unified Metabolic Analysis Network
|To find...||...check out:|
|Contact||Curtis Huttenhower, firstname.lastname@example.org|
HUMAnN is a pipeline for efficiently and accurately determining the presence/absence and abundance of microbial pathways in a community from metagenomic data. Sequencing a metagenome typically produces millions of short DNA/RNA reads. HUMAnN takes these reads as inputs and produces gene and pathway summaries as outputs:
The abundance of each orthologous gene family in the community. Orthologous families are groups of genes that perform roughly the same biological roles. HUMAnN uses the KEGG Orthology (KO) by default, but any catalog of orthologs can be employed with minor changes (COG, NOG, etc.). For more information, please refer to the "Using alternatives to KEGG" section below.
The presence/absence of each pathway in the community. HUMAnN refers to pathway presence/absence as "coverage," and defines a pathway as a set of two or more genes. HUMAnN uses KEGG pathways and modules by default, but again can easily be modified to use GO terms or other gene sets.
The abundance of each pathway in the community, i.e. how many "copies" of that pathway are present.
HUMAnN can thus be used in tandem with any translated BLAST program to convert sequence reads into coverage and abundance tables summarizing the gene families and pathways in a microbial community. This lets you analyze a collection of metagenomes as a matrix of gene/pathway abundances, just like you might analyze a collection of microarrays.
If you use this software, please cite our paper: Metabolic reconstruction for metagenomic data and its application to the human microbiome Sahar Abubucker, Nicola Segata, Johannes Goll, Alyxandria M. Schubert, Jacques Izard, Brandi L. Cantarel, Beltran Rodriguez-Mueller, Jeremy Zucker, Mathangi Thiagarajan, Bernard Henrissat, Owen White, Scott T. Kelley, Barbara Meth�, Patrick D. Schloss, Dirk Gevers, Makedonka Mitreva, Curtis Huttenhower
Many thanks to the NIH and to the entire Human Microbiome Project team for making the HMP possible and for the many collaborators who helped to make HUMAnN a reality. Sahar Abubucker and Makedonka Mitreva (Washington University) co-led the Metabolic Reconstruction group, the pipeline incorporates software from Yuzhen Ye (Indiana University), Beltran Rodriguez-Mueller (SDSU), and Pat Schloss (University of Michigan), and specific contributors include Alyx Schubert (University of Michigan), Jeremy Zucker (Broad Institute), Brandi Cantarel (UMD), Qiandong Zeng (Broad Institute), Johannes Goll (JCVI), and many others.
HUMAnN uses the scons build system to drive its scientific workflow (see PREREQUISITES). scons works very much like make, converting a set of inputs into a set of outputs one step at a time, and running only the steps necessary to produce the desired output. Thus, to analyze your own data:
Place one or more translated BLAST results using KEGG gene identifiers in the "input" directory (optionally gzipped or bzipped).
Edit the "SConstruct" file; in particular, make sure that the input processors include one configured for your BLAST file name(s) and format(s).
Run the "scons" command, optionally parallelizing multiple analyses using the "-j" flag. Results will be placed in the "output" directory.
HUMAnN is highly configurable in order to perform a collection of very computationally intensive tasks efficiently and flexibly. Please see the included sample input files, metadata files, and SConstruct settings for an overview of the software's configuration and the file formats it consumes and produces.
HUMAnN has no installation per se, but it does depend on the following items:
A network connection. HUMAnN downloads a number of data and software components from standard repositories during exection. Please ensure that a network connection is available at the least for the first run. HUMAnN will use "curl" by default to download files, and this can be changed (e.g. to use "wget") by editing the
A bunch of RAM. Processing one Illumina lane of metagenomic reads can take as much as ~8-10GB of memory, although it will often take much less depending on the data. Consider yourself warned.
Python >= 2.7. As of this writing, we use exactly one feature unique to Python 2.7, math.gamma. If you're willing to edit it out, the dependency is to Python 2.6 for various modules and syntax.
Please note that HUMAnN does not run blastx for you. It instead consumes tabular BLAST results as input. We recommend the default "-outfmt 6" setting as described below and in the provided SConstruct configuration. Alternatively, input processors are also provided for accelerated BLAST implementations such as mapx, mblastx, or usearch.
BioCyc (optional, automatically downloaded)
maq (optional, automatically downloaded)
R (optional, for synthetic performance evaluation)
R package ROCR (optional, for synthetic performance evaluation)
One or more of the following:
Tabular translated BLAST (blastx) output files files matching sequence read IDs to gene IDs.
Mapping (bowtie, bwa, etc.) output in BAM format.
Tab-delimited text containing one or more pre-quantified gene abundances.
Place (or symlink) each file with a .txt, .txt.gz, .txt.bz, .bam, .csv, .tsv, or .pcl extension as appropriate in the "input" directory before running HUMAnN. The pipeline includes processors for three tab-delimited text formats by default (below) and BAM binary format and can easily by modified to accept more.
As an example, the default inputs provided with HUMAnN were generated using:
blastx -outfmt 6 -db 28_kegg_genomes < mock_even_lc.fasta.gz | gzip -c > mock_even_lc.txt.gz
Where "28_kegg_genomes" is a database of the amino acid sequences of 28
well-characterized KEGG organisms' ORFs (.pep files from
ban, bbr, bqu, bsu, cbo, cdf, cje, eco, efa, ftu, hin, hpy, hsl, lmo, mbo, mtu, ngo, nme, pae, rso, sau, sco, sgo, spn, vch, xfa, yen, and yps).
Please modify the SConstruct file as needed to specify the exact format of your input data:
blastx -outfmt 6 ---------------- qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore mapx ---- template-name frame read-name template-start template-end template-length read-start read-end read-length template-protein read-protein alignment identical %identical positive %positive mismatches raw-score bit-score e-score mblastx ------- Query_Id Reference_Id E_Value_Bit_Score Identity_Percentage_Length_of_HSP Number_of_Positives_Frame_#s Alignment_Start_in_Query Alignment_End_in_Query Alignment_Start_in_Reference Alignment_End_in_Reference BAM (bowtie/bwa/etc.) -------------------- qname flag rname pos mapq cigar rnext pnext tlen seq qual usearch ------- usearch can be treated identically to "blastx -outfmt 6" if the "--blast6out" flag is provided.
Additionally, tab-delimited (.csv, .tsv, or .pcl) files of tabulated gene abundances (KEGG KO identifiers by default) can be used, with each column representing one sample and each row one gene family. Any such files placed in the input directory will be split per column and each sample will be run through the HUMAnN pipeline separately.
Three or more files per input including:
The relative abundances of each gene in the input metagenome. By default tagged as type "01". Two columns of tab-delimited text: geneid abundance.
The coverages of each pathway, expressed as a fraction between 0 and 1 inclusive. By default tagged as type "04a". Two columns of tab-delimited text: pathid coverage.
The relative abundances of each pathway. By default tagged as type "04b". Two columns of tab-delimited text: pathid abundance.
Optionally, a table of individual gene abundances appropriate for loading into METAREP. By default tagged as type "99". Five columns of tab-delimited text: geneid abundance e-score %identical identical. The abundance is relative and calculated as in HUMAnN's gene family abundances; e-score, percent identity, and identity length are averaged over all reads mapping to each gene in the input translated BLAST results.
Three or more merged files are also produced, all tab-delimited text:
A table containing the relative abundances of all genes in any input metagenomes. By default named as "01-*.txt".
A table containing the coverages of all pathways in any input files. By default named as "04a-*.txt".
A table containing the relative abundances of all pathways in any input files. By default named as "04b-*.txt".
A table containing the relative abundances of all individual genes in any input files. By default named as "99-*.txt".
Please see below for a detailed description of the ways in which these files are produced and named.
You can reproduce the entire performance evaluation using synthetic communities in the HUMAnN manuscript with the tools in the "synth" subdirectory. This optional step is omitted during default HUMAnN operation but will be automatically incorporated into HUMAnN output if one or more synthetic communities are built. To do this:
Place a .fa/.qual file pair or .fastq file in the "synth/output" directory and edit the "synth/SConstruct" file accordingly. This can be any representative sequencing data from which an error model will be built with which to synthesize artificial reads.
Unfreeze the synthetic data by changing
c_fFrozento False in
SConstruct. Run "scons" and wait quite a while. High-quality genomes will be downloaded from KEGG, shredded into artificial reads using maq, and mixed into a synthetic community.
By default, four communities are built: two with staggered organismal abundances (stg, using a lognormal distribution) and two with even abundances (even), and two with 20 organisms (low complexity, lc) and two with 100 (high complexity, hc).
The true gene and pathway abundances/coverages will be placed in correspondingly named files in the "synth/output" directory. These will be automatically detected the next time HUMAnN is executed using scons. They will be merged into the overall output files, and if correspondingly named input files ("mock_stg_lc", "mock_even_hc", etc.) are available, their performance will be automatically plotted as PDF output files using R.
HUMAnN can also add a level of organism specificity to the output. To do this:
Turn on the feature by changing c_fOrg to True in SConstruct.
Run "scons" as you would on your samples without the organism specificity
HUMAnN's scons workflow runs on the assumption that each input file will be converted into one or more output files by combining a series of pipeable processing modules. It is also set up to allow easy configuration and comparison of output settings, so think of it as a tree: any output processor that can process an input file will. This allows, for example, the quick and easy comparison of the results of metabolic reconstruction using many slightly different parameter settings to determine which is optimal.
Note that while HUMAnN uses primarily KEGG Orthology KO identifiers for genes and KEGG pathway (ko) or module (M) identifiers for pathways, this can be easily modified to suit your analysis needs. For example, code to process MetaCyc reaction identifiers (for gene families) and pathway identifiers is also included by default. Any traceable identifiers (ECs, NOGs, etc.) can be used for genes and for the pathways in which they're contained.
The SConstruct file includes a configurable set of processing modules used to convert each input file into one or more outputs. Processing modules are defined using the CProcessor class with seven arguments:
A file suffix (for input files) or type (for output files). This can be of the form ".txt.gz" for input files or "##" for output files. This pattern must match the filename consumed by the processor.
A file type. This is of the form "##" by default and can be any short tag (numerical or text) used to identify a type of generated file. File types are used to chain processing modules together; files of type ## will only and always be further processed by modules that can input type ##.
An identifier. This is of the form "xyz" by default and can be any short tag (numerical or text) used as a human-readable identifier for the processing module and its specific configuration options. It is appended to the resulting output file and provides a minimal form of data provenance.
A program. This is typically a Python script that consumes an input file on stdin, produces output on stdout, and can optionally include additional command line arguments (see below).
Zero or more supporting files as command line arguments. This array of file names is passed to the processing program as command line arguments (and will be understood by scons as dependencies, thus rebuilding the output when they change).
Zero or more additional command line arguments. This array of arbitrary strings is passed to the processing program as command line arguments (and will be ignored by scons). This can be used to produce multiple output files from the same processing scripts using command line arguments.
A boolean flag, True if the processor should be used for initial input files and False (or omitted) otherwise. In the latter case, the processor will only be used on intermediate files matching its input type.
A boolean flag, True if the output should be gzipped and False (or omitted) otherwise. Subsequent processors will automatically ungzip compressed input files.
By default, the following output file types are produced:
Generated from raw BLAST results.
A condensed binary representation of translated BLAST results, abstracted from and independent of the specific format (blastx/mblastx/mapx) in which they are provided.
Generated from _00-.txt.gz.
Relative gene abundances as calculated from BLAST results in which each read has been mapped to zero or more gene identifiers based on quality of match. Total weight of each read is 1.0, distributed over all gene (KO) matches by quality.
Generated from _01-.txt.
Relative gene abundances distributed over all pathways in which the gene is predicted to occur.
Generated from _02a-.txt.
Relative gene abundances with pathway assignments, taxonomically limited to remove pathways that could only occur in low-abundance/absent organisms.
Generated from _02b-.txt.
Relative gene abundances with pathway assignments, smoothed so that zero means zero and non-zero values are imputed to account for non-observed sequences.
Generated from _03a-.txt.
Relative gene abundances with pathway assignments, gap-filled so that gene/pathway combinations with surprisingly low frequency are imputed to contain a more plausible value.
Generated from _03b-.txt.
Pathway coverage (presence/absence) measure, i.e. relative confidence of each pathway being present in the sample. Values are between 0 and 1 inclusive.
Generated from _03b-.txt.
Pathway abundance measure, i.e. relative "copy number" of each pathway in the sample. On the same relative abundance scale (0 and up) as the original gene abundances _01-*.txt.
Generated from _00-.txt.gz.
Per-sample gene abundance tables formatted for loading into METAREP. On the same relative abundance scale (0 and up) as the original gene abundances _01-*.txt.
Combined pathway coverage matrix for all samples.
Combined pathway abundance matrix for all samples, normalized per column.
Pathway information formatted for input to LefSe (exported)
Pathway information formatted as a tree reflecting the KEGG BRITE hierarchy for plotting in GraPhlAn.
Pathway abundances formatted for overplotting on the KEGG BRITE hierarchy using GraPhlAn.
Combined KO abundance matrix for all samples, normalized per column.
Gene abundance calculated as the confidence (e-value/p-value) weighted sum of all hits for each read.
Gene to pathway assignment performed using MinPath.
Gene to pathway assignment performed naively using all pathways.
Pathway abundances adjusted based on A) taxonomic limitation and B) the expected copy number of each gene in the detected organisms.
Smoothing performed using Witten-Bell discounting, which shifts sum_observed/(sum_observed + num_observed) probability mass into zero counts and reduces others by the same fraction.
Smoothing performed naively by adding a constant value (0.1) to missing gene/pathway combinations.
No gap filling (no-operation, abundances left as is).
Gap filling by substituting any values below each pathway's median with the median value itself.
Pathway coverage calculated as fraction of genes in pathway at or above global median abundance.
Pathway coverage calculated as fraction of genes in pathway at or above global median abundance, with low-abundance pathways set to zero using Xipe.
Pathway abundance calculated as average abundance of the most abundant half of genes in the pathway.
Since HUMAnN's development, KEGG has gone commercial. This is a huge blow to the academic community at large, and an unfortunate inconvenience for HUMAnN in particular. In order to compromise between minimal disruption of HUMAnN operation and respecting KEGG's intellectual property, the following guidelines have been followed:
Files derived from KEGG necessary for normal HUMAnN operation are included in the data directory. These include the basic structure of KEGG pathways and modules, gene lengths, and ID/name mappings.
The more detailed files from KEGG needed for synthetic metagenome construction and evaluation are not included, thus rendering the synth sub-project inoperable without access to KEGG inputs. We apologize for this inconvenience and urge any impacted users to contact KEGG directly to inform them of the problem. Please let us know if you're interested in information on additional workarounds to enable synthetic metagenome evaluation.
Using alternatives to KEGG
The process of using HUMAnN with a database other than KEGG (e.g. COG, NOG, etc.) requires:
A FASTA file of nucleotide sequences against which the meta'ome is searched, each labeled with a gene ID (for KEGG, the genes.pep file distributed with KEGG).
A file of nucleotide sequence lengths, each labeled with a gene ID (for KEGG, data/genels).
A file of gene-to-OG mappings (for KEGG, data/koc).
A file of OG-to-pathway mappings (for KEGG, data/keggc).
Each of those four files should be replaced for HUMAnN to run on the alternative database.
maq installation issues
Some users have reported freezes during the synthetic metagenome maq installation process that are outside of HUMAnN's control. If maq locks up during compilation or installation, feel free to install it directly from the appropriate Sourceforge package in maq-0.7.1/maq (as expected by HUMAnN) and replace the funcUntarMAQ function in synth/SConstruct with
def funcUntarMAQ( target, source, env ): return 0
Note that the synthetic metagenomes are an optional component not needed for normal HUMAnN operation.
If you get the error:
sh: data/MinPath/glpk-4.6/examples/glpsol: cannot execute binary file
Minpath comes bundled with a version of glpk that will only run on Linux; HUMAnN can't easily replace this itself "on the fly." To fix the problem and run HUMAnN on MacOS X, after this error is produced, remove MinPath's glpk folder and replace it with a version compiled for Mac from here:
Please do not provide HUMAnN with any input files including _ followed by a
an_00_example. HUMAnN uses filename tags of
this form to indicate output from each of its processing stages (as
described above), and input files that include such patterns should be
renamed before HUMAnN execution.
- Initial modularization of HUMAnN code
- First application to HMP production data (mblastx)
- Refinement of documentation and KEGG/MetaCyc version updates
- Addition of METAREP output module
- Addition of full (albeit crummy) CFG parser for KEGG modules
- Improvement to module FPs using X2 for enzyme presence/absence calls
- Add default calculation of basic ecology metrics: Shannon, Simpson, Pielou, and richness
- Improve memory usage in METAREP output generation
- Make default Xipe parameters slightly more lenient (retain a few more nonzero pathways)
- Improve gold standard generation in synthetic datasets by retaining full KEGG pathway set
- Fix a problem with non-pathway BRITE entities in synthetic gold standard
- Fix a typo in hits2enzymes.py (only affected unused filter option)
- Add complete parameter evaluation process to HMP pipeline
- MAJOR CHANGE: KEGG is now defunct, and HUMAnN has been updated accordingly
- KEGG derived information needed for normal operation is included
- KEGG files needed for synthetic metagenome construction are not included
- "Frozen" synthetic metagenome evaluation is still possible
- Please contact us directly for more information if needed
- Add documentation on potential maq issues (thanks to Shinichi Sunagawa!)
- Fix a typo in fastq2fasta.py formatting (thanks to Shinichi Sunagawa!)
- Fix a typo in module2modulec.py formatting (thanks to Kathryn Iverson!)
- Fix a typo in eco.py for overly sparse input files (thanks to Jeffrey Werner!)
- Work around Mac OS X zcat issues (thanks to Jeffrey Werner!)
- Add several internal evaluation pipelines in response to initial reviews
- Fix hits2*.py handling of zero/very small e-values (thanks to Fah Sathira!)
- Fix missing exclude.py (thanks to Brandi Cantarel!)
- Allow module2modulec.py to remove unusual duplicate enzymes from KEGG's files
- Allow input filenames to contain underscores
- Fix module size calculation in filter.py
- Fix a bug in hits2enzymes.py to allow a wider range of KEGG gene name detection
- Completely revamped pathway coverage calculation, much more accurate for low-abundance events (thanks to Kat Huang, Sean Sykes!)
- Fixed hits2metarep.py handling of empty hits files (thanks to Pavan Kumar!)
- Fixed missing KO gloss annotations in merged 01b*.txt per-gene tabular abundance quantification output files.
- Added tab-delimited input file formats.
- Added GraPhlAn tree output files to enable visualization of abundance overlays on KEGG hierarchies (thanks to Jovian Yu, Morgan Paull!).
- Added preliminary organism-specific output generation.