Large-scale Benchmarking of Microbial Multivariable Association Methods
Himel Mallick 2018-12-12
- Simulation Designs
- Simulation Methods
- Testing A New Method
- Folder descriptions
With the recent development of modern, massively parallel DNA sequencing technologies, there is a growing interest in developing robust and powerful statistical methods to uncover relationships between organisms and multivariate clinical covariates of their hosts (e.g. disease status, diet, and lifestyle variables, among others). Existing methods for microbe-metadata association have primarily focused on univariable (pairwise) association methods. While some of these methods can also perform multivariable association (e.g. edgeR, DESeq2, metagenomeSeq, limma-voom, and MaAsLin), there is a surprising dearth of systematic studies aimed at multiple covariates, with no clear consensus on the appropriate method for scalable epidemiological studies. As microbiome data continue to accumulate, there is a pressing need for a rigorous comparison of these multivariable tools to help guide the researchers and practitioners alike. To this end, a large-scale benchmarking of microbial multivariable association tests is performed, essentially extending the differential abundance benchmarking of McMurdie and Holmes (2014), Weiss et al. (2017), and Hawinkel et al. (2017) to multivariable association analysis.
The basic idea is to input a table of feature counts (e.g. taxa counts or gene/transcript/metabolite abundances on a per sample basis) and a table of metadata (such as clinical covariates on a per sample basis) and evaluate performance across different statistical models with varied normalization and transformation schemes. This is done in two stages. First,
Load_Simulator.R is executed with desired parameters (details below) to generate synthetic datasets with user-defined spike-in information (both cross-sectional and repeated measures designs are supported). The null synthetic metagenomic counts are first generated using sparseDOSSA, which are subsequently combined with the user-defined metadata information for spiking-in. Next,
Simulation_Main.R script is run, which upon completion, creates an output file containing various performance metrics (e.g. False Positive Rate (FDR), False Positive Rate (FPR) and Area Under the Curve (AUC), among others), based on replicated runs (typically hundreds of iterations).
At the top level, we experiment with different metadata structures (metadataType). For each metadataType, we vary a combination of other parameters such as number of subjects (nSubjects), number of samples per subject (nPerSubject), number of microbes (nMicrobes), proportion of spiked-in microbes (spikeMicrobes), number of metadata (nMetadata), proportion of spiked-in metadata (spikeMetadata), and effect size (effectSize). The metadata structures are as follows:
|metadataType||Type of Association||Description|
|UVA||Univariable continuous||One continuous metadata|
|UVB||Univariable binary||One dichotomous (binary) metadata|
|MVA||Multivariable mixed, independent||50% binary and 50% continuous, independent|
|MVB||Multivariable mixed, collinear||50% binary and 50% continuous, correlated|
For each of the above scenario (metadataType), we vary the metadata-agnostic parameters as follows:
10 20 50 100 200
100 200 500
1 2 5 10
In addition, for multiple covariates, we vary the following metadata parameters (for univariable association these values are fixed at 1):
Similarly, for repeated mesaures, we vary the following parameter (fixed at 1 for cross-sectional designs):
We use the following naming convention: <Model>.<Normalization>.<Transformation>, i.e. for each class of methods, we apply a set of models, that can take different input tables based on user-defined normalization and/or transformation methods. For example,
LM.TSS.LOG means a linear model (LM) with TSS normalization and LOG transformation is desired. If no normalization/transformation is applied or a default pipeline is implemented, the corresponding entries are blank (e.g. metagenomeSeq, edgeR, DESeq2, etc.). Both fixed effects and random effects model are implemented for each method (when applicable). For GLM-based models, the logarithm of the normalized library size (with the exception of TSS for which original library size is retained) is included as an offset parameter.
In summary, the following normalzation methods are tested:
CSS RLE TMM TSS CLR
In addition, the following transformation methods are tested:
AST LOG LOGIT
A complete list of tested methods are as follows:
ANCOM CPLM (Compound Poisson Linear Model) DESeq2 edgeR limma limmaVOOM LM metagenomeSeq metagenomeSeq2 negbin Spearman Wilcoxon ZIB (Zero-inflated Beta) ZICP (Zero-inflated Compound Poisson) ZINB (Zero-inflated Negative Binomial)
In addition to the above core scenarios, a subset of noncore scenarios are also evaluated to investigate the effect of (i) rarefaction, (ii) filtering, (iii) multiple testing adjustment method, (iv) pseudocounts, (v)
nMetadata, (vii) zero-inflation, (viii) library size, (ix) multicollinearity, and (x) data-generating model.
The sample codes are as follows:
Rscript Load_Simulator.R --metadataType UVB --nSubjects 100 --nPerSubject 5 --nMicrobes 200 --spikeMicrobes 0.1 --nMetadata 1 --spikeMetadata 1 --effectSize 20
Rscript Simulation_Main.R --methodName LM.TSS --metadataType UVB --nSubjects 100 --nPerSubject 5 --nMicrobes 200 --spikeMicrobes 0.1 --nMetadata 1 --spikeMetadata 1 --effectSize 20
Running the above will produce an output file in the
Output folder, containing various performance metrics for each test, which can be loaded in R and plotted and interpreted as desired.
Testing A New Method
To test a new method, one can add a function to run the test, and save it in the
Library folder. As input, it should take two data.frame objects like above (with matched rows) containing features and metadata respectively and output a feature by metadata table of coefficients, p-values, and q-values. Examples can be found in the
- Two main simulation functions typically called from terminal with command-line arguments:
Load_Simulator.R (for generating synthetic abundances, accompanied by spiked-in metadata) and
Simulation_Main.R (univariable and multivariable tests).
Library - Utility functions and individual method scripts.
- Folder to store all synthetic data sets generated by
- Folder to store output from
Shell Scripts - Examples on how to parallelize the scripts with bash to obtain all parameter combinations in a high performance computing cluster such as Harvard Odyssey.
Workflow Files - Examples on how to automate the above parallelization using ANADAMA2.
McMurdie, P. J., and Holmes, S. (2014). Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Computational Biology, 10(4):e1003531.
Weiss, S. et al. (2017). Normalization and microbial differential abundance strategies depend upon data characteristics. Microbiome, 5:27.
Hawinkel, S. et al. (2017). A broken promise: microbiome differential abundance methods do not control the false discovery rate. Briefings in Bioinformatics, bbx104.
Mallick, H. et al. (2019+). Multivariable association in population-scale meta'omic surveys (In Preparation).