MaAsLin is a multivariate statistical framework that finds associations between clinical metadata and potentially high-dimensional experimental data. MaAsLin performs boosted additive general linear models between one group of data (metadata/the predictors) and another group (in our case relative taxonomic abundances/the response).
MaAsLin is available as a Galaxy module, a Homebrew formula, a Docker image, and included in bioBakery (VM and cloud). For additional information, refer to the manuscript: Morgan XC, Tickle TL, Sokol H, Gevers D, Devaney KL, Ward DV, Reyes JA, Shah SA, LeLeiko N, Snapper SB, Bousvaros A, Korzenik J, Sands BE, Xavier RJ, Huttenhower C. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 2012 Apr 16;13(9):R79..
We provide support for MaAsLin users. Please join our Google group designated specifically for MaAsLin users. Feel free to post any questions on the google group by posting directly or emailing email@example.com.
- 1. Setup
- 2. Identify associations
- 3. MaAsLin in Galaxy
The following figure shows the workflow for MaAsLin.
MaAsLin can be run from the command line, as an R module, or through Galaxy. Follow these steps to install MaAsLin locally. Please note, if you are using bioBakery (Vagrant VM or Google Cloud) you do not need to install MaAsLin because the tool and its dependencies are already installed. For detailed information on how to run MaAsLin in Galaxy see the Galaxy section at the end of this tutorial.
MaAsLin requires the following R packages: agricolae, gam (version 1.14), gamlss, gbm, glmnet, inlinedocs, logging, MASS, nlme (version 3.1-127), optparse, outliers, penalized, pscl, robustbase. If you are installing MaAsLin with Docker or Homebrew, these dependencies will be automatically installed. If you are installing MaAsLin from source, install these packages before installing MaAsLin.
To install with Homebrew: $ brew install biobakery/biobakery/maaslin
To install with Docker: $ docker run -it biobakery/maaslin bash
To install from source, first download the latest version of MaAsLin. Next install MaAsLin, replacing X.Y.Z with the version number: $ R CMD INSTALL Maaslin_X.Y.Z.tar.gz
Once you have obtained the microbial abundance tables through MetaPhlAn2 (See MetaPhlAn2 tutorial for details) or other tools, MaAsLin allows you to determine associations between the microbial abundances (or functions) and the metadata.
MaAsLin requires a single input file that contains the metadata and the microbial abundances. Most likely you will have these as separate files. A utility script is included in MaAsLin to merge these into a single file.
The metadata file contains all of the clinical metadata you would like to analyze to determine if there exist any significant associations with any of the microbial abundances (or functions). This file should be formatted like the example below with the sample name as the first column and additional columns containing the metadata for each of the samples. This file is tab-delimited. Download this file from the following link: HMP.metadata.tsv .
sid RANDSID GENDER STSite SRS043001 550534656 female Stool SRS017127 159551223 male Buccal_mucosa SRS021473 158479027 male Buccal_mucosa
Bash command to download the file:
$ curl -O https://bitbucket.org/biobakery/maaslin/raw/tip/inst/extdata/HMP.metadata.tsv
The microbial abundance (or function) file contains values normalized to relative abundances. It includes data for all of the samples that are described in the metadata file. This file is tab-delimited with the first row the sample name and rows that follow include abundances. The file should be formatted like the example below. Download this file from the following link: HMP.abundances.tsv.
sid SRS043001 SRS017127 SRS021473 k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Porphyromonadaceae 2.57633 0.0 0.0 k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus 0.09916 0.0 0.00175 k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Micrococcaceae 0.0 0.07358 0.71281 k__Bacteria|p__Proteobacteria|c__Epsilonproteobacteria 0.0 0.0049 1.10502
Bash command to download the file:
$ curl -O https://bitbucket.org/biobakery/maaslin/raw/tip/inst/extdata/HMP.abundances.tsv
Follow the instructions below to attach the metadata to the microbial abundance table. This section is optional if you are starting with a single input file of merged metadata and abundance data (relative abundance in the range of 0 to 1 is required). Or if you are using a separate method to attach the metadata, you may skip to the next section (All that is necessary is that the metadata needs to be listed first in the file (before the microbial abundance table), in the format shown by the sample input data). Please ensure that the format of your input files follows that of the demo files above.
Run the following command to create the merged file:
$ merge_metadata.py HMP.metadata.tsv -l < HMP.abundances.tsv > HMP.merged.metadata.abundances.species.renormed.pcl
The resulting file is HMP.merged.metadata.abundances.species.renormed.pcl. By default this MaAsLin script will normalize the values to relative abundances that sum to 1 for each sample. The abundances are required to be in the range of 0 to 1 in order to apply the arcsine square root transformation (default). If you are using a different method to merge your data and metadata, make sure your abundance data is normalized to proportions or relative abundances. Adding the option -l only the abundances at a species-level will be output. You may now use this file as an input for MaAsLin.
This file is of PCL format. It is the expected input file format for MaAsLin. This file format has the following requirements.
- Rows represent metadata and features (bugs), columns represent samples.
- The first row by default should be the sample ids.
- Metadata rows should be next.
- Lastly, rows containing features (bugs) measurements (like abundance) should be after metadata rows.
- The first column should contain the ID describing the column. For metadata, this may be, for example, "Age" for a row containing the age of the patients donating the samples. For measurements, this should be the feature name (bug name).
- By default, the file is expected to be TAB delimited.
- If a consensus lineage or hierarchy of taxonomy is contained in the feature name, the default delimiter between clades is the pipe ("|").
The read config file determines which rows/columns to process without modifying the input metadata-merged-microbial abundance table. A sample .read.config file (HMP.read.config) is shown below:
Matrix: Metadata Read_PCL_Rows: GENDER,STSite Matrix: Abundance Read_PCL_Rows: k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_viscosus-
Bash command to download the file:
$ curl -O https://bitbucket.org/biobakery/maaslin/raw/tip/inst/extdata/HMP.read.config
The text above dictates that the Metadata matrix includes the rows GENDER and STSite, while the Abundance matrix starts with the row the clade appears in the row (as the row name). For more examples, please refer to the MaAsLin documentation.
Once you have the metadata-merged-microbial abundance table, and the .read.config file (see the samples to ensure the format), you are ready to run MaAsLin. Use the demo input files provided in the prior sections to generate a demo output set.
$ Maaslin.R HMP.merged.metadata.abundances.species.renormed.pcl output -i HMP.read.config
The above command will create a directory: output, which will contain the results.
- Looking at the output analysis files there is a pdf of plots generated for the STSite but not the Gender metadata. What do you think this indicates? What file would you look at to confirm your hypothesis? If you need a hint, read the section of the tutorial about output files.
The first example shows how to run a data set with one sample per subject. If you were to have multiple samples per subject, you would want to run a mixed or random effects model, typically with the subject as random effects, although multiple random effects can also be specified.
Looking at the metadata and the merged PCL file, notice there is one additional metadata row that is not used in this run named RANDSID. This row indicates how the sample ids relate to each other with each subject having more than one sample. Modify the read config file to read in this metadata row. The new read config file will now look like the example below.
Matrix: Metadata Read_PCL_Rows: -STSite Matrix: Abundance Read_PCL_Rows: k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_viscosus-
- There is another way the read config file could be written to also include this metadata row. What does that read config file look like?
Now lets rerun with this new read config file and add an option to run with this new row as a random co-variate.
$ Maaslin.R HMP.merged.metadata.abundances.species.renormed.pcl output_random -i HMP.read.config --random RANDSID
The above command will create a directory: output_random, which will contain the results. These results are different than the prior run.
- How do these results differ from the prior results? Why do you think they differ?
Each MaAsLin run will generate multiple output files. The output files in the QC subfolder are all generated by the initial quality control steps that filter the data to remove clades and metadata that are sparse, remove outliers, and arcsine transforms the proportional abundance data. The file QC/*cleaned.tsv will contain the original input file after it has been run through all of the QC steps. The main output files of interest are the analysis files and the log file.
Each metadata row will have at most one txt output file. If significant associations were found for at least one taxon, a pdf of figures will also be generated.
Running the demo input files will generate a pair of analysis files for the STSite metadata. The txt output file will contain all of the significant associations (taxon-metadata pairs) along with the corresponding regression coefficient, number of observations, number of non-zero observations, P-value, and Q-value (FDR-adjusted P-value using Benjamini–Hochberg (BH) as default). The first few lines of the demo output file are included below (with the full taxonomy reduced to just the species to save space). Download the complete file from the following link, HMP-STSite.txt.
Variable Feature Value Coefficient N N not 0 P-value Q-value STSite s__Streptococcus_mitis STSiteStool -0.258471709381982 297 204 3.45919838845225e-166 8.02534026120922e-164 STSite s__Streptococcus_mitis STSitePosterior_fornix -0.261049166635981 297 204 4.22958832642797e-135 4.90632245865644e-133 STSite s__Streptococcus_oralis STSiteStool -0.0336438278067034 297 112 3.11171830769747e-85 2.39063208284818e-83
The corresponding pdf file of plots includes only those taxa which pass the significance threshold. An example plot from the demo output set is shown below.
The log file includes the formulas and values for all of the taxons processed. Each taxon has its own section including formulas and values. An example section from the demo outputs follows with the samples and values removed to save space. Download the complete file from the following link, HMP_log.txt .
#taxon s__Actinomyces_viscosus [full taxonomy removed to save space] #metadata GENDER STSite #samples SRS011061 [... removed to save space] #Boost formula adCur ~ `GENDER` + `STSite` #model-gbm [values ... removed to save space] adCur ~ GENDER + STSite attr(,"variables") list(adCur, GENDER, STSite) attr(,"factors") GENDER STSite adCur 0 0 GENDER 1 0 STSite 0 1 attr(,"term.labels")  "GENDER" "STSite" attr(,"order")  1 1 attr(,"intercept")  1 attr(,"response")  1 attr(,".Environment") <environment: 0x8079ad8> attr(,"predvars") list(adCur, GENDER, STSite) attr(,"dataClasses") adCur GENDER STSite "numeric" "factor" "factor" #summary-gbm var rel.inf STSite STSite 100 GENDER GENDER 0 #model Call: lm(formula = as.formula(strFormula), data = frmeTmp, na.action = c_strNA_Action) Coefficients: (Intercept) STSitePosterior_fornix STSiteStool 0.01745 -0.01737 -0.01744 #summary Call: lm(formula = as.formula(strFormula), data = frmeTmp, na.action = c_strNA_Action) Residuals: Min 1Q Median 3Q Max -0.0174529 -0.0000848 -0.0000121 -0.0000121 0.0214007 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.0174529 0.0005423 32.18 <2e-16 *** STSitePosterior_fornix -0.0173681 0.0009546 -18.20 <2e-16 *** STSiteStool -0.0174407 0.0007215 -24.17 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.00561 on 294 degrees of freedom Multiple R-squared: 0.6919, Adjusted R-squared: 0.6898 F-statistic: 330.1 on 2 and 294 DF, p-value: < 2.2e-16
- If running with random effects, how does the log differ from that shown?
Although we recommend using the default options, arguments exist to modify both the MaAsLin methodology and output files. To see an up-to-date listing of argument usage, type $ Maaslin.R --help (from the command line) or > help(Maaslin) (within the R environment).
strVerbosity = "DEBUG" dSignificanceLevel = 0.1
In the above example, MaAsLin is modified to produce verbose output for debugging and to change the threshold for reporting associations only when the corresponding q-values are less than or equal to 0.1.
All versus All
The 'all versus all' analysis is a way of manipulating how metadata are used. When this option is specified, a univariate analysis is run, i.e. each metadata is separately evaluated for associations with the feature relative abundances, as opposed to the default multivariable association testing that models all the metadata at once. To give a more concrete example, you may have metadata age, sex, and diet. Rather than fitting a full multivariable model (such as features ~ age + sex + diet), you may want to evaluate each metadata separately (i.e. features ~ age; features ~ sex; features ~ diet). This can be specified in the command line environment as an additional option as $ Maaslin.R --fAllvAll or within the R environment inside the Maaslin() function call as fAllvAll = TRUE.
strForcedPredictors indicates which metadata are forced in the analysis. In this method, there is a group of metadata that is always evaluated, in addition to another group of metadata that are not forced. Following up on the example above, you may want to evaluate the association between a set of features (i.e. relative abundances of microbes) and a set of metadata (age and diet), while always adjusting for sex. In this specific case, the sex metadata is the forced part of the evaluation while the others are not forced. In addition, you may also want to evaluate the non-forced set of metadata one at a time (i.e. 'all versus all'). This allows you to individually associate age and diet while adjusting for sex. This does not, however, bypass the feature selection method so the metadata that are not forced are subject to feature selection and may be removed before coming to the evaluation. If you want all the metadata that are not forced to be evaluated in serial you will need to turn off feature selection and will have final combined options (either within the R environment inside the Maaslin() function call or as additional arguments to the Maaslin.R function from the command line environment) as seen here:
fAllvAll = TRUE strForcedPredictors = "sex" strModelSelection = "none"
MaAsLin can be run from the Huttenhower lab Galaxy instance. With this method, you do not need to install MaAsLin on your local machine.
To run on Galaxy, MaAsLin requires a microbial abundance table with metadata attached (e.g.: maaslin_demo2.pcl).
Once you have your feature table with metadata ready, you can run it through MaAsLin on Galaxy by following these steps (these steps use the demo data as input):
- Go to the Huttenhower Galaxy Server: http://huttenhower.sph.harvard.edu/galaxy/.
- Click on the Get Data -> Upload File link on the left pane and upload the demo file maaslin_demo2.pcl. You can do this by clicking on the Browse button, selecting the demo file, and then pressing the Start button. Select format tabular.
- Click on the MaAsLin link on the left pane.
- Select the input data from the Input file drop-down menu.
- Specify the last metadata row in the sample, after which the microbial species are listed (this is Weight in our sample dataset).
- Click on Execute
The results will appear on the right pane. You may proceed with viewing it (by clicking on the Eye symbol) or downloading it on your computer (by clicking on the Save symbol).