# MaAsLin Tutorial

MaAsLin is a multivariate statistical framework that finds associations between clinical metadata and potentially high-dimensional experimental data. MaAsLin performs boosted additive general linear models between one group of data (metadata/the predictors) and another group (in our case relative taxonomic abundances/the response).

MaAsLin is available as a Galaxy module, a Homebrew formula, a Docker image, and included in bioBakery (VM and cloud). For additional information, refer to the manuscript: Morgan XC, Tickle TL, Sokol H, Gevers D, Devaney KL, Ward DV, Reyes JA, Shah SA, LeLeiko N, Snapper SB, Bousvaros A, Korzenik J, Sands BE, Xavier RJ, Huttenhower C. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 2012 Apr 16;13(9):R79..

## Overview

The following figure shows the workflow for MaAsLin.

## 1. Setup

MaAsLin can be run from the command line, as an R module, or through Galaxy. Follow these steps to install MaAsLin locally. Please note, if you are using bioBakery (Vagrant VM or Google Cloud) you do not need to install MaAsLin because the tool and its dependencies are already installed. For detailed information on how to run MaAsLin in Galaxy see the Galaxy section at the end of this tutorial.

### 1.1 Requirements

MaAsLin requires the following R packages: agricolae, gam (version 1.14), gamlss, gbm, glmnet, inlinedocs, logging, MASS, nlme (version 3.1-127), optparse, outliers, penalized, pscl, robustbase. If you are installing MaAsLin with Docker or Homebrew, these dependencies will be automatically installed. If you are installing MaAsLin from source, install these packages before installing MaAsLin.

### 1.2 Installation

To install with Homebrew: $brew install biobakery/biobakery/maaslin To install with Docker:$ docker run -it biobakery/maaslin bash



#### 2.1.2 Abundance file

The microbial abundance (or function) file contains values normalized to relative abundances. It includes data for all of the samples that are described in the metadata file. This file is tab-delimited with the first row the sample name and rows that follow include abundances. The file should be formatted like the example below. Download this file from the following link: HMP.abundances.tsv.

sid        SRS043001         SRS017127         SRS021473
k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus 0.09916 0.0 0.00175
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Micrococcaceae 0.0 0.07358 0.71281
k__Bacteria|p__Proteobacteria|c__Epsilonproteobacteria 0.0 0.0049 1.10502


$curl -O https://bitbucket.org/biobakery/maaslin/raw/tip/inst/extdata/HMP.abundances.tsv  #### 2.1.3 Merged PCL file Follow the instructions below to attach the metadata to the microbial abundance table. This section is optional if you are starting with a single input file of merged metadata and abundance data (relative abundance in the range of 0 to 1 is required). Or if you are using a separate method to attach the metadata, you may skip to the next section (All that is necessary is that the metadata needs to be listed first in the file (before the microbial abundance table), in the format shown by the sample input data). Please ensure that the format of your input files follows that of the demo files above. Run the following command to create the merged file: $ merge_metadata.py HMP.metadata.tsv -l < HMP.abundances.tsv > HMP.merged.metadata.abundances.species.renormed.pcl


The resulting file is HMP.merged.metadata.abundances.species.renormed.pcl. By default this MaAsLin script will normalize the values to relative abundances that sum to 1 for each sample. The abundances are required to be in the range of 0 to 1 in order to apply the arcsine square root transformation (default). If you are using a different method to merge your data and metadata, make sure your abundance data is normalized to proportions or relative abundances. Adding the option -l only the abundances at a species-level will be output. You may now use this file as an input for MaAsLin.

This file is of PCL format. It is the expected input file format for MaAsLin. This file format has the following requirements.

1. Rows represent metadata and features (bugs), columns represent samples.
2. The first row by default should be the sample ids.
3. Metadata rows should be next.
4. Lastly, rows containing features (bugs) measurements (like abundance) should be after metadata rows.
5. The first column should contain the ID describing the column. For metadata, this may be, for example, "Age" for a row containing the age of the patients donating the samples. For measurements, this should be the feature name (bug name).
6. By default, the file is expected to be TAB delimited.
7. If a consensus lineage or hierarchy of taxonomy is contained in the feature name, the default delimiter between clades is the pipe ("|").

The read config file determines which rows/columns to process without modifying the input metadata-merged-microbial abundance table. A sample .read.config file (HMP.read.config) is shown below:

Matrix: Metadata

Matrix: Abundance


$curl -O https://bitbucket.org/biobakery/maaslin/raw/tip/inst/extdata/HMP.read.config  The text above dictates that the Metadata matrix includes the rows GENDER and STSite, while the Abundance matrix starts with the row the clade appears in the row (as the row name). For more examples, please refer to the MaAsLin documentation. ### 2.2 Running MaAsLin Once you have the metadata-merged-microbial abundance table, and the .read.config file (see the samples to ensure the format), you are ready to run MaAsLin. Use the demo input files provided in the prior sections to generate a demo output set.$ Maaslin.R HMP.merged.metadata.abundances.species.renormed.pcl output -i HMP.read.config

The above command will create a directory: output, which will contain the results.

• Looking at the output analysis files there is a pdf of plots generated for the STSite but not the Gender metadata. What do you think this indicates? What file would you look at to confirm your hypothesis? If you need a hint, read the section of the tutorial about output files.

#### 2.2.1 Random Effects

The first example shows how to run a data set with one sample per subject. If you were to have multiple samples per subject, you would want to run a mixed or random effects model, typically with the subject as random effects, although multiple random effects can also be specified.

Looking at the metadata and the merged PCL file, notice there is one additional metadata row that is not used in this run named RANDSID. This row indicates how the sample ids relate to each other with each subject having more than one sample. Modify the read config file to read in this metadata row. The new read config file will now look like the example below.

Matrix: Metadata

Matrix: Abundance


• There is another way the read config file could be written to also include this metadata row. What does that read config file look like?

Now lets rerun with this new read config file and add an option to run with this new row as a random co-variate.

$Maaslin.R HMP.merged.metadata.abundances.species.renormed.pcl output_random -i HMP.read.config --random RANDSID The above command will create a directory: output_random, which will contain the results. These results are different than the prior run. • How do these results differ from the prior results? Why do you think they differ? ### 2.3 Output files Each MaAsLin run will generate multiple output files. The output files in the QC subfolder are all generated by the initial quality control steps that filter the data to remove clades and metadata that are sparse, remove outliers, and arcsine transforms the proportional abundance data. The file QC/*cleaned.tsv will contain the original input file after it has been run through all of the QC steps. The main output files of interest are the analysis files and the log file. #### 2.3.1 Analysis files Each metadata row will have at most one txt output file. If significant associations were found for at least one taxon, a pdf of figures will also be generated. Running the demo input files will generate a pair of analysis files for the STSite metadata. The txt output file will contain all of the significant associations (taxon-metadata pairs) along with the corresponding regression coefficient, number of observations, number of non-zero observations, P-value, and Q-value (FDR-adjusted P-value using Benjamini–Hochberg (BH) as default). The first few lines of the demo output file are included below (with the full taxonomy reduced to just the species to save space). Download the complete file from the following link, HMP-STSite.txt. Variable Feature Value Coefficient N N not 0 P-value Q-value STSite s__Streptococcus_mitis STSiteStool -0.258471709381982 297 204 3.45919838845225e-166 8.02534026120922e-164 STSite s__Streptococcus_mitis STSitePosterior_fornix -0.261049166635981 297 204 4.22958832642797e-135 4.90632245865644e-133 STSite s__Streptococcus_oralis STSiteStool -0.0336438278067034 297 112 3.11171830769747e-85 2.39063208284818e-83  The corresponding pdf file of plots includes only those taxa which pass the significance threshold. An example plot from the demo output set is shown below. #### 2.3.2 Log The log file includes the formulas and values for all of the taxons processed. Each taxon has its own section including formulas and values. An example section from the demo outputs follows with the samples and values removed to save space. Download the complete file from the following link, HMP_log.txt . #taxon s__Actinomyces_viscosus [full taxonomy removed to save space] #metadata GENDER STSite #samples SRS011061 [... removed to save space] #Boost formula adCur ~ GENDER + STSite #model-gbm [values ... removed to save space] adCur ~ GENDER + STSite attr(,"variables") list(adCur, GENDER, STSite) attr(,"factors") GENDER STSite adCur 0 0 GENDER 1 0 STSite 0 1 attr(,"term.labels") [1] "GENDER" "STSite" attr(,"order") [1] 1 1 attr(,"intercept") [1] 1 attr(,"response") [1] 1 attr(,".Environment") <environment: 0x8079ad8> attr(,"predvars") list(adCur, GENDER, STSite) attr(,"dataClasses") adCur GENDER STSite "numeric" "factor" "factor" #summary-gbm var rel.inf STSite STSite 100 GENDER GENDER 0 #model Call: lm(formula = as.formula(strFormula), data = frmeTmp, na.action = c_strNA_Action) Coefficients: (Intercept) STSitePosterior_fornix STSiteStool 0.01745 -0.01737 -0.01744 #summary Call: lm(formula = as.formula(strFormula), data = frmeTmp, na.action = c_strNA_Action) Residuals: Min 1Q Median 3Q Max -0.0174529 -0.0000848 -0.0000121 -0.0000121 0.0214007 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.0174529 0.0005423 32.18 <2e-16 *** STSitePosterior_fornix -0.0173681 0.0009546 -18.20 <2e-16 *** STSiteStool -0.0174407 0.0007215 -24.17 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.00561 on 294 degrees of freedom Multiple R-squared: 0.6919, Adjusted R-squared: 0.6898 F-statistic: 330.1 on 2 and 294 DF, p-value: < 2.2e-16  • If running with random effects, how does the log differ from that shown? ### 2.4 Additional options Although we recommend using the default options, arguments exist to modify both the MaAsLin methodology and output files. To see an up-to-date listing of argument usage, type$ Maaslin.R --help (from the command line) or > help(Maaslin) (within the R environment).

Example args:

strVerbosity = "DEBUG"
dSignificanceLevel = 0.1


In the above example, MaAsLin is modified to produce verbose output for debugging and to change the threshold for reporting associations only when the corresponding q-values are less than or equal to 0.1.

All versus All

The 'all versus all' analysis is a way of manipulating how metadata are used. When this option is specified, a univariate analysis is run, i.e. each metadata is separately evaluated for associations with the feature relative abundances, as opposed to the default multivariable association testing that models all the metadata at once. To give a more concrete example, you may have metadata age, sex, and diet. Rather than fitting a full multivariable model (such as features ~ age + sex + diet), you may want to evaluate each metadata separately (i.e. features ~ age; features ~ sex; features ~ diet). This can be specified in the command line environment as an additional option as \$ Maaslin.R --fAllvAll or within the R environment inside the Maaslin() function call as fAllvAll = TRUE.

Forcing covariates

fAllvAll = TRUE
strForcedPredictors = "sex"
strModelSelection = "none"


## 3. MaAsLin in Galaxy

MaAsLin can be run from the Huttenhower lab Galaxy instance. With this method, you do not need to install MaAsLin on your local machine.

To run on Galaxy, MaAsLin requires a microbial abundance table with metadata attached (e.g.: maaslin_demo2.pcl).

Once you have your feature table with metadata ready, you can run it through MaAsLin on Galaxy by following these steps (these steps use the demo data as input):

• Go to the Huttenhower Galaxy Server: http://huttenhower.sph.harvard.edu/galaxy/.
• Click on the Get Data -> Upload File link on the left pane and upload the demo file maaslin_demo2.pcl. You can do this by clicking on the Browse button, selecting the demo file, and then pressing the Start button. Select format tabular.
• Click on the MaAsLin link on the left pane.
• Select the input data from the Input file drop-down menu.
• Specify the last metadata row in the sample, after which the microbial species are listed (this is Weight in our sample dataset).
• Click on Execute

The results will appear on the right pane. You may proceed with viewing it (by clicking on the Eye symbol) or downloading it on your computer (by clicking on the Save symbol).

Updated