Clone wiki

biobakery / maaslin

MaAsLin Tutorial

MaAsLin is a multivariate statistical framework that finds associations between clinical metadata and potentially high-dimensional experimental data. MaAsLin performs boosted additive general linear models between one group of data (metadata/the predictors) and another group (in our case relative taxonomic abundances/the response).

MaAsLin is available as a Galaxy module, a Homebrew formula, a Docker image, and included in bioBakery (VM and cloud). For additional information, refer to the manuscript: Morgan XC, Tickle TL, Sokol H, Gevers D, Devaney KL, Ward DV, Reyes JA, Shah SA, LeLeiko N, Snapper SB, Bousvaros A, Korzenik J, Sands BE, Xavier RJ, Huttenhower C. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 2012 Apr 16;13(9):R79..

We provide support for MaAsLin users. Please join our Google group designated specifically for MaAsLin users. Feel free to post any questions on the google group by posting directly or emailing maaslin-users@googlegroups.com.



Overview

The following figure shows the workflow for MaAsLin.

MaAsLin.png

1. Setup

MaAsLin can be run from the command line, as an R module, or through Galaxy. Follow these steps to install MaAsLin locally. Please note, if you are using bioBakery (Vagrant VM or Google Cloud) you do not need to install MaAsLin because the tool and its dependencies are already installed. For detailed information on how to run MaAsLin in Galaxy see the Galaxy section at the end of this tutorial.

1.1 Requirements

MaAsLin requires the following R packages: agricolae, gam (version 1.14), gamlss, gbm, glmnet, inlinedocs, logging, MASS, nlme (version 3.1-127), optparse, outliers, penalized, pscl, robustbase. If you are installing MaAsLin with Docker or Homebrew, these dependencies will be automatically installed. If you are installing MaAsLin from source, install these packages before installing MaAsLin.

1.2 Installation

To install with Homebrew: $ brew install biobakery/biobakery/maaslin

To install with Docker: $ docker run -it biobakery/maaslin bash

To install from source, first download the latest version of MaAsLin. Next install MaAsLin, replacing X.Y.Z with the version number: $ R CMD INSTALL Maaslin_X.Y.Z.tar.gz


2. Identify associations

Once you have obtained the microbial abundance tables through MetaPhlAn2 (See MetaPhlAn2 tutorial for details) or other tools, MaAsLin allows you to determine associations between the microbial abundances (or functions) and the metadata.

2.1 Input files

MaAsLin requires a single input file that contains the metadata and the microbial abundances. Most likely you will have these as separate files. A utility script is included in MaAsLin to merge these into a single file.

2.1.1 Metadata file

The metadata file contains all of the clinical metadata you would like to analyze to determine if there exist any significant associations with any of the microbial abundances (or functions). This file should be formatted like the example below with the sample name as the first column and additional columns containing the metadata for each of the samples. This file is tab-delimited. Download this file from the following link: HMP.metadata.tsv .

sid          RANDSID      GENDER     STSite
SRS043001    550534656    female     Stool
SRS017127    159551223    male       Buccal_mucosa
SRS021473    158479027    male       Buccal_mucosa

Bash command to download the file:

$ curl -O https://bitbucket.org/biobakery/maaslin/raw/tip/inst/extdata/HMP.metadata.tsv

2.1.2 Abundance file

The microbial abundance (or function) file contains values normalized to relative abundances. It includes data for all of the samples that are described in the metadata file. This file is tab-delimited with the first row the sample name and rows that follow include abundances. The file should be formatted like the example below. Download this file from the following link: HMP.abundances.tsv.

sid        SRS043001         SRS017127         SRS021473
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Porphyromonadaceae 2.57633 0.0 0.0
k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus 0.09916 0.0 0.00175
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Micrococcaceae 0.0 0.07358 0.71281
k__Bacteria|p__Proteobacteria|c__Epsilonproteobacteria 0.0 0.0049 1.10502

Bash command to download the file:

$ curl -O https://bitbucket.org/biobakery/maaslin/raw/tip/inst/extdata/HMP.abundances.tsv

2.1.3 Merged PCL file

Follow the instructions below to attach the metadata to the microbial abundance table. This section is optional if you are starting with a single input file of merged metadata and abundance data (relative abundance in the range of 0 to 1 is required). Or if you are using a separate method to attach the metadata, you may skip to the next section (All that is necessary is that the metadata needs to be listed first in the file (before the microbial abundance table), in the format shown by the sample input data). Please ensure that the format of your input files follows that of the demo files above.

Run the following command to create the merged file:

$ merge_metadata.py HMP.metadata.tsv -l < HMP.abundances.tsv > HMP.merged.metadata.abundances.species.renormed.pcl

The resulting file is HMP.merged.metadata.abundances.species.renormed.pcl. By default this MaAsLin script will normalize the values to relative abundances that sum to 1 for each sample. The abundances are required to be in the range of 0 to 1 in order to apply the arcsine square root transformation (default). If you are using a different method to merge your data and metadata, make sure your abundance data is normalized to proportions or relative abundances. Adding the option -l only the abundances at a species-level will be output. You may now use this file as an input for MaAsLin.

This file is of PCL format. It is the expected input file format for MaAsLin. This file format has the following requirements.

  1. Rows represent metadata and features (bugs), columns represent samples.
  2. The first row by default should be the sample ids.
  3. Metadata rows should be next.
  4. Lastly, rows containing features (bugs) measurements (like abundance) should be after metadata rows.
  5. The first column should contain the ID describing the column. For metadata, this may be, for example, "Age" for a row containing the age of the patients donating the samples. For measurements, this should be the feature name (bug name).
  6. By default, the file is expected to be TAB delimited.
  7. If a consensus lineage or hierarchy of taxonomy is contained in the feature name, the default delimiter between clades is the pipe ("|").

2.1.4 Read config file

The read config file determines which rows/columns to process without modifying the input metadata-merged-microbial abundance table. A sample .read.config file (HMP.read.config) is shown below:

Matrix: Metadata
Read_PCL_Rows: GENDER,STSite

Matrix: Abundance
Read_PCL_Rows: k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_viscosus-

Bash command to download the file:

$ curl -O https://bitbucket.org/biobakery/maaslin/raw/tip/inst/extdata/HMP.read.config

The text above dictates that the Metadata matrix includes the rows GENDER and STSite, while the Abundance matrix starts with the row the clade appears in the row (as the row name). For more examples, please refer to the MaAsLin documentation.


2.2 Running MaAsLin

Once you have the metadata-merged-microbial abundance table, and the .read.config file (see the samples to ensure the format), you are ready to run MaAsLin. Use the demo input files provided in the prior sections to generate a demo output set.

$ Maaslin.R HMP.merged.metadata.abundances.species.renormed.pcl output -i HMP.read.config

The above command will create a directory: output, which will contain the results.


  • Looking at the output analysis files there is a pdf of plots generated for the STSite but not the Gender metadata. What do you think this indicates? What file would you look at to confirm your hypothesis? If you need a hint, read the section of the tutorial about output files.

2.2.1 Random Effects

The first example shows how to run a data set with one sample per subject. If you were to have multiple samples per subject, you would want to run a mixed or random effects model, typically with the subject as random effects, although multiple random effects can also be specified.

Looking at the metadata and the merged PCL file, notice there is one additional metadata row that is not used in this run named RANDSID. This row indicates how the sample ids relate to each other with each subject having more than one sample. Modify the read config file to read in this metadata row. The new read config file will now look like the example below.

Matrix: Metadata
Read_PCL_Rows: -STSite

Matrix: Abundance
Read_PCL_Rows: k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__Actinomyces_viscosus-

  • There is another way the read config file could be written to also include this metadata row. What does that read config file look like?

Now lets rerun with this new read config file and add an option to run with this new row as a random co-variate.

$ Maaslin.R HMP.merged.metadata.abundances.species.renormed.pcl output_random -i HMP.read.config --random RANDSID

The above command will create a directory: output_random, which will contain the results. These results are different than the prior run.


  • How do these results differ from the prior results? Why do you think they differ?

2.3 Output files

Each MaAsLin run will generate multiple output files. The output files in the QC subfolder are all generated by the initial quality control steps that filter the data to remove clades and metadata that are sparse, remove outliers, and arcsine transforms the proportional abundance data. The file QC/*cleaned.tsv will contain the original input file after it has been run through all of the QC steps. The main output files of interest are the analysis files and the log file.

2.3.1 Analysis files

Each metadata row will have at most one txt output file. If significant associations were found for at least one taxon, a pdf of figures will also be generated.

Running the demo input files will generate a pair of analysis files for the STSite metadata. The txt output file will contain all of the significant associations (taxon-metadata pairs) along with the corresponding regression coefficient, number of observations, number of non-zero observations, P-value, and Q-value (FDR-adjusted P-value using Benjamini–Hochberg (BH) as default). The first few lines of the demo output file are included below (with the full taxonomy reduced to just the species to save space). Download the complete file from the following link, HMP-STSite.txt.

Variable        Feature Value   Coefficient     N       N not 0 P-value Q-value
STSite  s__Streptococcus_mitis     STSiteStool     -0.258471709381982      297     204     3.45919838845225e-166   8.02534026120922e-164
STSite  s__Streptococcus_mitis     STSitePosterior_fornix  -0.261049166635981      297     204     4.22958832642797e-135   4.90632245865644e-133
STSite  s__Streptococcus_oralis    STSiteStool     -0.0336438278067034     297     112     3.11171830769747e-85    2.39063208284818e-83

The corresponding pdf file of plots includes only those taxa which pass the significance threshold. An example plot from the demo output set is shown below.

https://bitbucket.org/repo/49y6o9/images/1617609270-Screenshot%20from%202018-04-09%2018-27-16.png

2.3.2 Log

The log file includes the formulas and values for all of the taxons processed. Each taxon has its own section including formulas and values. An example section from the demo outputs follows with the samples and values removed to save space. Download the complete file from the following link, HMP_log.txt .

#taxon  s__Actinomyces_viscosus [full taxonomy removed to save space]
#metadata       GENDER  STSite
#samples SRS011061 [... removed to save space]
#Boost formula  adCur ~ `GENDER` + `STSite`
#model-gbm
[values ... removed to save space]
adCur ~ GENDER + STSite
attr(,"variables")
list(adCur, GENDER, STSite)
attr(,"factors")
       GENDER STSite
adCur       0      0
GENDER      1      0
STSite      0      1
attr(,"term.labels")
[1] "GENDER" "STSite"
attr(,"order")
[1] 1 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: 0x8079ad8>
attr(,"predvars")
list(adCur, GENDER, STSite)
attr(,"dataClasses")
    adCur    GENDER    STSite
"numeric"  "factor"  "factor"
#summary-gbm
          var  rel.inf
STSite STSite     100
GENDER GENDER       0
#model

Call:
lm(formula = as.formula(strFormula), data = frmeTmp, na.action = c_strNA_Action)

Coefficients:
       (Intercept)  STSitePosterior_fornix             STSiteStool
           0.01745                -0.01737                -0.01744

#summary

Call:
lm(formula = as.formula(strFormula), data = frmeTmp, na.action = c_strNA_Action)

Residuals:
       Min         1Q     Median         3Q        Max
-0.0174529 -0.0000848 -0.0000121 -0.0000121  0.0214007

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)
(Intercept)             0.0174529  0.0005423   32.18   <2e-16 ***
STSitePosterior_fornix -0.0173681  0.0009546  -18.20   <2e-16 ***
STSiteStool            -0.0174407  0.0007215  -24.17   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.00561 on 294 degrees of freedom
Multiple R-squared:  0.6919,    Adjusted R-squared:  0.6898
F-statistic: 330.1 on 2 and 294 DF,  p-value: < 2.2e-16

  • If running with random effects, how does the log differ from that shown?


2.4 Additional options

Although we recommend using the default options, arguments exist to modify both the MaAsLin methodology and output files. To see an up-to-date listing of argument usage, type $ Maaslin.R --help (from the command line) or > help(Maaslin) (within the R environment).

Example args:

strVerbosity = "DEBUG"
dSignificanceLevel = 0.1

In the above example, MaAsLin is modified to produce verbose output for debugging and to change the threshold for reporting associations only when the corresponding q-values are less than or equal to 0.1.

All versus All

The 'all versus all' analysis is a way of manipulating how metadata are used. When this option is specified, a univariate analysis is run, i.e. each metadata is separately evaluated for associations with the feature relative abundances, as opposed to the default multivariable association testing that models all the metadata at once. To give a more concrete example, you may have metadata age, sex, and diet. Rather than fitting a full multivariable model (such as features ~ age + sex + diet), you may want to evaluate each metadata separately (i.e. features ~ age; features ~ sex; features ~ diet). This can be specified in the command line environment as an additional option as $ Maaslin.R --fAllvAll or within the R environment inside the Maaslin() function call as fAllvAll = TRUE.

Forcing covariates

strForcedPredictors indicates which metadata are forced in the analysis. In this method, there is a group of metadata that is always evaluated, in addition to another group of metadata that are not forced. Following up on the example above, you may want to evaluate the association between a set of features (i.e. relative abundances of microbes) and a set of metadata (age and diet), while always adjusting for sex. In this specific case, the sex metadata is the forced part of the evaluation while the others are not forced. In addition, you may also want to evaluate the non-forced set of metadata one at a time (i.e. 'all versus all'). This allows you to individually associate age and diet while adjusting for sex. This does not, however, bypass the feature selection method so the metadata that are not forced are subject to feature selection and may be removed before coming to the evaluation. If you want all the metadata that are not forced to be evaluated in serial you will need to turn off feature selection and will have final combined options (either within the R environment inside the Maaslin() function call or as additional arguments to the Maaslin.R function from the command line environment) as seen here:

fAllvAll = TRUE
strForcedPredictors = "sex"
strModelSelection = "none"

3. MaAsLin in Galaxy

MaAsLin can be run from the Huttenhower lab Galaxy instance. With this method, you do not need to install MaAsLin on your local machine.

To run on Galaxy, MaAsLin requires a microbial abundance table with metadata attached (e.g.: maaslin_demo2.pcl).

Once you have your feature table with metadata ready, you can run it through MaAsLin on Galaxy by following these steps (these steps use the demo data as input):

  • Go to the Huttenhower Galaxy Server: http://huttenhower.sph.harvard.edu/galaxy/.
  • Click on the Get Data -> Upload File link on the left pane and upload the demo file maaslin_demo2.pcl. You can do this by clicking on the Browse button, selecting the demo file, and then pressing the Start button. Select format tabular.
https://bitbucket.org/repo/49y6o9/images/3266810747-Screenshot%20from%202017-09-01%2018-41-15.png
  • Click on the MaAsLin link on the left pane.
    • Select the input data from the Input file drop-down menu.
    • Specify the last metadata row in the sample, after which the microbial species are listed (this is Weight in our sample dataset).
    • Click on Execute
https://bitbucket.org/repo/49y6o9/images/2497207550-Screenshot%20from%202017-09-05%2017-58-38.png

The results will appear on the right pane. You may proceed with viewing it (by clicking on the Eye symbol) or downloading it on your computer (by clicking on the Save symbol).

Updated