MaAsLin2 User Manual
MaAsLin2 is the next generation of MaAsLin.
MaAsLin is a multivariate statistical framework that finds associations between clinical metadata and potentially high-dimensional experimental data.
If you use the MaAsLin2 software, please cite our manuscript: Himel Mallick, Timothy L. Tickle, Lauren J. McIver, Gholamali Rahnavard, George Weingart, Joseph N. Paulson, Siyuan Ma, Boyu Ren, Emma Schwager, Ayshwarya Subramanian, Eric A. Franzosa, Hector Corrada Bravo, Curtis Huttenhower. "Multivariable Association in Population-scale Meta'omic Surveys" (In Preparation).
If you have questions, please email the MaAsLin Users Google Group.
- How to Run
MaAsLin2 was developed to find associations between microbiome multi'omics features and complex metadata in population-scale epidemiological studies. The software includes multiple analysis methods, normalization, and transform options to customize analysis for your specific study.
MaAsLin2 is an R package that can be run on the command line or as an R function. It requires the following R packages included in Biocondutor and CRAN (Comprehensive R Archive Network). Please install these packages before running MaAsLin2.
- Bioconductor packages
- edgeR: Empirical Analysis of Digital Gene Expression Data in R
- metagenomeSeq: Statistical analysis for sparse high-throughput sequencing
- These packages can be installed through Bioconductor by first sourcing biocLite with
source("https://bioconductor.org/biocLite.R")and then installing each package with
- CRAN packages
- pscl: Political Science Computational Laboratory
- pbapply: Adding Progress Bar to '*apply' Functions
- car: Companion to Applied Regression
- dplyr: A Grammer of Data Manipulation
- vegan: Community Ecology Package
- chemometrics: Multivariate Statistical Analysis in Chemometrics
- ggplot2: Create Elegant Data Visualizations Using the Grammer of Graphics
- pheatmap: Pretty Heatmaps
- cplm: Compound Poisson Linear Models
- logging: R logging package
- data.table: Fast aggregation of large data
- lmerTest: Tests in Linear Mixed Effects Models
- These packages can be installed in R with
install.packages('pscl')or from the command line
$ R -q -e "install.packages('pscl', repos='http://cran.r-project.org')"individually (for those packages which you do not yet have installed) or as a set by providing the complete list as a vector.
MaAsLin2 can be run from the command line or as an R function. If only running from the command line, you do not need to install the MaAsLin2 package but you will need to install the MaAsLin2 dependencies.
From command line
- Download the source: MaAsLin2.tar.gz
- Decompress the download:
$ tar xzvf maaslin2.tar.gz
- Install the Bioconductor dependencies:
$ R -q -e "source('https://bioconductor.org/biocLite.R'); biocLite('edgeR'); biocLite('metagenomeSeq')"
- Install the CRAN dependencies:
$ R -q -e "install.packages(c('lmerTest','pscl','pbapply','car','dplyr','vegan','chemometrics','ggplot2','pheatmap','cplm','hash','logging','data.table','MASS','MuMIn'), repos='http://cran.r-project.org')"
- Install the MaAsLin2 package (only r,equired if running as an R function):
$ R CMD INSTALL maaslin2
- Install devtools :
- Install the Bioconuctor dependencies:
> source('https://bioconductor.org/biocLite.R'); biocLite('edgeR'); biocLite('metagenomeSeq')
- Install MaAsLin2 (and also all dependencies from CRAN):
> devtools::install_bitbucket("biobakery/maaslin2@default", ref="0.2")
How to Run
MaAsLin2 can be run from the command line or as an R function. Both methods require the same arguments, have the same options, and use the same default settings.
To run from the command line:
$ Maaslin2.R $DATA $METADATA $OUTPUT
- Provide the full path to the MaAsLin2 executable (ie ./R/Maaslin2.R if you are in the source folder).
$DATAwith the path to your data (or features) file.
$METADATAwith the path to your metadata file.
$OUTPUTwith the path to the folder to write the output.
To run from R as a function:
$ R > library(Maaslin2) > fit_data <- Maaslin2(data, metadata, output)
MaAsLin2 requires two input files.
- Data (or features) file
- This file is tab-delimited formatted with features as columns and samples as rows (the transpose is also okay).
- Possible features in this file include data like taxonomic or gene abundances.
- Metadata file
- This file is tab-delimited formatted with metadata as columns and samples as rows (the transpose is also okay).
- Possible metadata in this file include gender or age.
The data file can contain samples not included in the metadata file (along with the reverse case). For both cases, those samples not included in both files will be removed from the analysis. Also the samples do not need to be in the same order in the two files.
NOTE: If running MaAsLin2 as a function, the data and metadata inputs can be of type
data.frame instead of a path to a file.
MaAsLin2 generates two types of output files: data and visualization.
- Data output files
all_results.tsv: This file contains all of the association results ordered by increasing q-value.
significant_results.tsv: This file is a subset of the data in the first file. It only includes those associations with q-values less than or equal to the significance threshold.
residuals.rds: This file contains a data frame with residuals for each feature analyzed from the model selected.
maaslin2.log: This file contains all of the debug information for the run. It includes all settings, warnings, errors, and steps run.
- Visualization output files
heatmap.pdf: This file contains a heatmap of the significant associations.
[0-9]+.pdf: These files are scatter plots with one generated for each significant association.
Run a Demo
Example input files can be found in the tests folder of the MaAsLin2 source.
$ Maaslin2.R maaslin2/tests/example1_data.txt maaslin2/tests/example1_metadata.txt demo_output
When running this command, all output files will be written to a folder named
Run MaAsLin2 help to print a list of the options and the default settings.
$ Maaslin2.R --help Usage: ./R/Maaslin2.R [options] <data.tsv> <metadata.tsv> <output_folder> Options: -h, --help Show this help message and exit -a MIN_ABUNDANCE, --min_abundance=MIN_ABUNDANCE The minimum abundance for each feature [ Default: 0 ] -p MIN_PREVALENCE, --min_prevalence=MIN_PREVALENCE The minimum percent of samples for which a feature is detected at minimum abundance [ Default: 0.1 ] -s MAX_SIGNIFICANCE, --max_significance=MAX_SIGNIFICANCE The q-value threshold for significance [ Default: 0.25 ] -n NORMALIZATION, --normalization=NORMALIZATION The normalization method to apply [ Default: TSS ] [ Choices: TSS, CLR, CSS, NONE, TMM ] -t TRANSFORM, --transform=TRANSFORM The transform to apply [ Default: LOG ] [ Choices: LOG, LOGIT, AST, NONE ] -m ANALYSIS_METHOD, --analysis_method=ANALYSIS_METHOD The analysis method to apply [ Default: LM ] [ Choices: LM, CPLM, ZICP, NEGBIN, ZINB ] -r RANDOM_EFFECTS, --random_effects=RANDOM_EFFECTS The random effects for the model, comma-delimited for multiple effects [ Default: none ] -f FIXED_EFFECTS, --fixed_effects=FIXED_EFFECTS The fixed effects for the model, comma-delimited for multiple effects [ Default: all ] -c CORRECTION, --correction=CORRECTION The correction method for computing the q-value [ Default: BH ] -z STANDARDIZE, --standardize=STANDARDIZE Apply z-score so continuous metadata are on the same scale [ Default: TRUE ] -e CORES, --cores=CORES The number of R processes to run in parallel [ Default: 1 ]
There are two functions in MaAsLin2 which visualize the outputs and provide ggplot2 plots that can be used to generate manuscript/report quality figures.
maaslin2_heatmap: this function generates a overview of all associations reported by MaAsLin2 and have the following parameters:
output_path : the path to the MaAsLin2 output
title: a title for the plot
cell_value: default 'Q.value'
data_label: default 'Data'
metadata_label: default 'Metadata'
border_color: default "grey93"
color: default colorRampPalette(c("blue","grey90", "red"))(500)
maaslin2_association_plots: this function produces plots (ggplot2) for each association and depends on the data types can be a scatter plot and boxplot. This function returns a vector of
ggplot2plots. The parameters for this function are as follow:
output_path: 'the path to the MaAsLin2 output'
write_to_file: default True
- Question: When I run from the command line I see the error
Maaslin2.R: command not found. How do I fix this?
- Answer: Provide the full path to the executable when running Maaslin2.R.
- Question: When I run as a function I see the error
Error in library(Maaslin2): there is no package called 'Maaslin2'. How do I fix this?
- Answer: Install the R package and then try loading the library again.
- Question: When I try to install the R package I see errors about dependencies not being installed. Why is this?
- Answer: Installing the R package will not automatically install the packages MaAsLin2 requires. Please install the dependencies and then install the MaAsLin2 R package.