Large-scale Benchmarking of Microbial Multivariable Association Methods

Himel Mallick 2018-12-12


With the recent development of modern, massively parallel DNA sequencing technologies, there is a growing interest in developing robust and powerful statistical methods to uncover relationships between organisms and multivariate clinical covariates of their hosts (e.g. disease status, diet, and lifestyle variables, among others). Existing methods for microbe-metadata association have primarily focused on univariable (pairwise) association methods. While some of these methods can also perform multivariable association (e.g. edgeR, DESeq2, metagenomeSeq, limma-voom, and MaAsLin), there is a surprising dearth of systematic studies aimed at multiple covariates, with no clear consensus on the appropriate method for scalable epidemiological studies. As microbiome data continue to accumulate, there is a pressing need for a rigorous comparison of these multivariable tools to help guide the researchers and practitioners alike. To this end, a large-scale benchmarking of microbial multivariable association tests is performed, essentially extending the differential abundance benchmarking of McMurdie and Holmes (2014), Weiss et al. (2017), and Hawinkel et al. (2017) to multivariable association analysis.

The basic idea is to input a table of feature counts (e.g. taxa counts or gene/transcript/metabolite abundances on a per sample basis) and a table of metadata (such as clinical covariates on a per sample basis) and evaluate performance across different statistical models with varied normalization and transformation schemes. This is done in two stages. First, Load_Simulator.R is executed with desired parameters (details below) to generate synthetic datasets with user-defined spike-in information (both cross-sectional and repeated measures designs are supported). The null synthetic metagenomic counts are first generated using sparseDOSSA, which are subsequently combined with the user-defined metadata information for spiking-in. Next, Simulation_Main.R script is run, which upon completion, creates an output file containing various performance metrics (e.g. False Positive Rate (FDR), False Positive Rate (FPR) and Area Under the Curve (AUC), among others), based on replicated runs (typically hundreds of iterations).

Simulation Designs

At the top level, we experiment with different metadata structures (metadataType). For each metadataType, we vary a combination of other parameters such as number of subjects (nSubjects), number of samples per subject (nPerSubject), number of microbes (nMicrobes), proportion of spiked-in microbes (spikeMicrobes), number of metadata (nMetadata), proportion of spiked-in metadata (spikeMetadata), and effect size (effectSize). The metadata structures are as follows:

metadataType Type of Association Description
UVA Univariable continuous One continuous metadata
UVB Univariable binary One dichotomous (binary) metadata
MVA Multivariable mixed, independent 50% binary and 50% continuous, independent
MVB Multivariable mixed, collinear 50% binary and 50% continuous, correlated

For each of the above scenario (metadataType), we vary the metadata-agnostic parameters as follows:

  • nSubjects 10 20 50 100 200

  • nMicrobes 100 200 500

  • spikeMicrobes 0.1

  • effectSize 1 2 5 10

In addition, for multiple covariates, we vary the following metadata parameters (for univariable association these values are fixed at 1):

  • nMetadata 5

  • spikeMetadata 0.2

Similarly, for repeated mesaures, we vary the following parameter (fixed at 1 for cross-sectional designs):

  • nPerSubject 5

Simulation Methods

We use the following naming convention: <Model>.<Normalization>.<Transformation>, i.e. for each class of methods, we apply a set of models, that can take different input tables based on user-defined normalization and/or transformation methods. For example, LM.TSS.LOG means a linear model (LM) with TSS normalization and LOG transformation is desired. If no normalization/transformation is applied or a default pipeline is implemented, the corresponding entries are blank (e.g. metagenomeSeq, edgeR, DESeq2, etc.). Both fixed effects and random effects model are implemented for each method (when applicable). For GLM-based models, the logarithm of the normalized library size (with the exception of TSS for which original library size is retained) is included as an offset parameter.

In summary, the following normalzation methods are tested:






In addition, the following transformation methods are tested:




A complete list of tested methods are as follows:


CPLM (Compound Poisson Linear Model)











ZIB (Zero-inflated Beta)

ZICP (Zero-inflated Compound Poisson)

ZINB (Zero-inflated Negative Binomial)

In addition to the above core scenarios, a subset of noncore scenarios are also evaluated to investigate the effect of (i) rarefaction, (ii) filtering, (iii) multiple testing adjustment method, (iv) pseudocounts, (v) nPerSubject, (vi) nMetadata, (vii) zero-inflation, (viii) library size, (ix) multicollinearity, and (x) data-generating model.


The sample codes are as follows:

Rscript Load_Simulator.R --metadataType UVB --nSubjects 100 --nPerSubject 5 --nMicrobes 200 --spikeMicrobes 0.1 --nMetadata 1 --spikeMetadata 1 --effectSize 20 
Rscript Simulation_Main.R --methodName LM.TSS --metadataType UVB --nSubjects 100 --nPerSubject 5 --nMicrobes 200 --spikeMicrobes 0.1 --nMetadata 1 --spikeMetadata 1 --effectSize 20 


Running the above will produce an output file in the Output folder, containing various performance metrics for each test, which can be loaded in R and plotted and interpreted as desired.

Testing A New Method

To test a new method, one can add a function to run the test, and save it in the Library folder. As input, it should take two data.frame objects like above (with matched rows) containing features and metadata respectively and output a feature by metadata table of coefficients, p-values, and q-values. Examples can be found in the Library folder.

Folder descriptions

Codes - Two main simulation functions typically called from terminal with command-line arguments: Load_Simulator.R (for generating synthetic abundances, accompanied by spiked-in metadata) and Simulation_Main.R (univariable and multivariable tests).

Library - Utility functions and individual method scripts.

Input - Folder to store all synthetic data sets generated by Load_Simulator.R.

Output - Folder to store output from Simulation_Main.R.

Shell Scripts - Examples on how to parallelize the scripts with bash to obtain all parameter combinations in a high performance computing cluster such as Harvard Odyssey.

Workflow Files - Examples on how to automate the above parallelization using ANADAMA2.


McMurdie, P. J., and Holmes, S. (2014). Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Computational Biology, 10(4):e1003531.

Weiss, S. et al. (2017). Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome, 5:27.

Hawinkel, S. et al. (2017). A broken promise: microbiome differential abundance methods do not control the false discovery rate. Briefings in Bioinformatics, bbx104.


Mallick, H. et al. (2019+). Multivariable association in population-scale meta'omic surveys (In Preparation).