Wiki

Clone wiki

metabit / Tutorial

Tutorial

In this tutorial, we will guide the user through the main options of metaBIT by profiling and analysing the microbiota of two human skin samples. The original sequence data were retrieved from the archives of the Human Microbiome Project (HMP: Human Microbiome Project Consortium 2012a).

1. Configuring the pipeline

In order to run all the required programs, you must have installed and configured the metaBIT pipeline as explained in the installation section

2. Description of the source data

In the example folder from metaBIT, we provide two subfolders labeled SRS018978 and SRS024655. Each contain a subset of 120,000 sequence reads sub-sampled from the two HMP skin samples SRS018978 and SRS024655 split in the following files:

  • tutorial_example_SRS018978.pair1.fastq.gz
  • tutorial_example_SRS018978.pair2.fastq.gz
  • tutorial_example_SRS018978.singleton.fastq.gz

and:

  • tutorial_example_SRS024655.pair1.fastq.gz
  • tutorial_example_SRS024655.pair2.fastq.gz
  • tutorial_example_SRS024655.singleton.fastq.gz

As we will run metaBIT in the example folder. You should change the directory to the example folder:

cd /path/to/metaBIT/example/

3. Preparing a Yaml makefile

A single simple makefile, provided by the user in YAML format, defines all the analyses to be performed in metaBIT. This file contains the parameters of each type of analysis, the location of fastq sequence files, and the sample data structure. Full instructions describing the format and structure of metaBIT YAML makefiles are given in the makefile documentation accompanied with metaBIT.

In this tutorial, we will walk-through the content of a simple makefile, tutorial_example.yaml. One can follow the structure of the makefile using any text editor.

The first line specifies the file type:

# -*- mode: Yaml; -*-

YAML is a human-readable format to store data in a hierarchical way. Subcategories are indented with one or more spaces compared to the parent category, never use tabs for indentation. This section of the YAML documentation will give you a quick overview of the language structure. Commented lines begin with a hash character #.

The 'Samples' section

You must first indicate the name and location of the data files to be used as input for the metaBIT pipeline. Input files of the metaBIT pipeline are trimmed reads (adaptors have been removed) from single-end or paired-end sequencing runs. In this tutorial, the two libraries SRS018978 and SRS024655 will be analysed separately, but they can be grouped in one single sample (Skin). The corresponding sample structure in the makefile should be:

Samples:          
  Skin:           # First skin sample name
    SRS018978:    # First library name

You must then report all paths to the sequence data files, which could consist of either Paired-End reads (Paired, Collapsed, Singletons) or Single-End reads.

The SRS018978 library contains paired reads. Using the key "{Pair}" the pipeline retrieves files 1 and 2 (the key "{Pair}" equates to 1 and 2 when searching for the files).

Samples:
  Skin:           # First skin sample name
    SRS018978:    # First library name
      Paired: SRS018978/tutorial_example_SRS018978.pair{Pair}.fastq.gz
      Singles: SRS018978/tutorial_example_SRS018978.singleton.fastq.gz

Do the same for the second sample:

    SRS024655:    # Second skin sample name
      Paired: SRS024655/tutorial_example_SRS024655.pair{Pair}.fastq.gz
      Singles: SRS024655/tutorial_example_SRS024655.singleton.fastq.gz

Setting up parameters of the taxonomic profiling

We will now prepare the section "Taxonomic profiling" of the makefile, which starts with the following text:

# ----------------------------------------------- #
#              Taxonomic profiling                #
# ----------------------------------------------- #

The first subsection provides instructions for "Bowtie 2". For example the flag --phred33 sets quality score options to the phred 33 system when either yes, true, or no value are indicated (see below). For the quality score, you have the choice between --phred33, --phred64, solexa scores --solexa-quals, or integer scores --int-quals depending on the sequencing platform from which the reads were generated. In our example, we will use:

Bowtie2:
  --phred33: yes  # Sets the option. Default.

The second subsection provides options for Metaphlan. most importantly, if you wish to analyze the 2 libraries as a single data set in Metaphlan, uncomment the Pool line:

Metaphlan:
  #Pool: Skin
  --ignore_eukaryotes:   # comment to include eukaryotes (MetaPhlAn2 only)
  --ignore_viruses:      # comment to include viruses (MetaPhlAn2 only)

The abundances will then be calculated for the whole Skin samples as given in the Samples section, instead of calculating separated abundances for each library (SRS018978 and SRS024655).

Note that a series of options could also be given for MetaPhlAn profiling (see Documentation makefile). In this tutorial, eukaryotes and viruses are excluded.

Setting up parameters of the statistical analyses

This section administrate which statistical analyses should be performed, as well as their parameters. This section starts with the following text:

# ----------------------------------------------- #
# Statistical Analysis of the taxonomic profiles  #
# ----------------------------------------------- #

Krona hierarchical pie-charts

One visualization option in metaBIT is to use Krona hierarchical pie-charts. The metaBIT creates one html file including one chart per sample, and representing taxon abundances per taxonomic level. Generating Krona pie-charts is enabled by default, but you can disable it:

Krona:
  #run: no  # will not produce Krona visualization files when uncommented.
  #-a: no   # If you uncomment, Krona charts will require an internet connection to use Krona resources.

Statistical analysis module: Statax

Most of the statistical analyses run by metaBIT are piloted by the Statax section. This section uses the table of abundances produced from the given samples, or given at the 'run_from_table' key.

  • doDiv : computes diversity indices for each sample and each taxonomic level (using the shannon index by default)

  • doBarplot : shows a stacked barplot of the abundances for each sample. Useful option:

    • --order: reorders samples, for example based on an euclidean distance (like in the heatmap). Alphabetical by default.
  • doHeatmap : shows a heatmap of taxonomic abundances (taxon x samples)

  • doPcoa: performs a Principal Coordinates Analysis of the samples, using the R package vegan. Some useful options are:

    • --distance : the distance method (bray by default)
    • --makefile : a R file defining the color and symbol arguments for the plot, and the legend text.
    • --inv-x (/ --inv-y): reverses the x (/y) axis.
  • doClust : performs a hierarchical clustering with bootstrap on samples, using the Pvclust and Vegan R packages. Useful options are:

    • --dist.method : distance method (bray by default)
    • --nboot : number of bootstrap replication to do (10000 by default)
    • --ncores : number of cores to use to parallelize the bootstrap.

If you want the full information on these sub-analyses and their respective options, go in the nodes/tools/statax_Rmodule/ subfolder of the metaBIT folder and look at the help text of the individual programs, for example:

$ /path/to/metabit/nodes/tools/statax_Rmodule/doDiv.R --help

In our example, we would like to estimate Shannon diversity indices, generate barplots and heatmaps of the microbial diversity detected but also perform hierarchical clustering as well as Principal Coordinate Analyses. We, thus, add to the makefile the following text:

Statax:
  Skin_only:
    taxlevels: pcofgs
    filterout: 1
    doDiv:
      --index: shannon
    doBarplot:
    doHeatmap:

In our example, we also wish to compare the two skin microbial profiles to those of 5 human body sites (mouth, nose, skin, stool, vagina), each represented by 2 HMP samples. This comparative panel is just provided as an example here, but can be extended to 689 HMP microbiomes provided as part as of metaBIT. The comparative panel consists of any tabulated flat file providing relative abundances of the samples/tissues/environments/etc to be compared (those can be generated using metaBIT). We therefore add to the makefile the following text:

  Skin_HMP:
    merge:
      #- HMP_10.tsv      # if you are using MetaPhlAn version 1
      - HMPII_10.tsv     # if you are using MetaPhlan version 2
    doDiv:
    doBarplot:
      --order: euclidean
    doHeatmap:
    doPcoa:
      --makefile: pcoa_symbols_and_color.R
    doClust:
      --nboot: 1000

The last section provides instructions for running LEfSe, which performs Linear Discriminant Analysis (LDA) of predefined sample groups. LEfSe is disabled by default, as we recommend the user to investigate the output from Statax prior to selecting groups for LEfSe. However, to demonstrate LEfSe in this tutorial, we have enabled LEfSE and selected groups to match each human body site as shown below.

Lefse:
  run: yes
  merge:
    #- HMP_10.tsv      # if you are using MetaPhlAn version 1
    - HMPII_10.tsv     # if you are using MetaPhlan version 2
  Groups:
    Skin:
      - Skin_SRS018978
      - Skin_SRS024655
      - skin-SRS019063
      - skin-SRS046688
    Stool:
      - stool-SRS013800
      - stool-SRS048870

Important note: The additional table for comparison HMPII_10.tsv has been obtained with MetaPhlAn 2. If you are using MetaPhlAn 1, you should change HMPII_10.tsv to HMP_10.tsv, which is also provided in the example folder.

Now your makefile is ready. After saving it you can launch metaBIT:

4. Running metaBIT

If you haven't written the options --metaphlan-path and --jar-root to the config file, remember to add them to the command-line, see installation.

First perform a dry run to see which actions will be executed by typing the following commands in a terminal (a summary of nodes will be then displayed), from the example/ directory:

$ metaBIT tutorial_example.yaml --dry-run

Once you are ready, you can start metaBIT:

$ metaBIT tutorial_example.yaml

5. Going through the results.

Once the analyses are completed, their respective results can be found in subfolders located in the main folder called out_tutorial_example/.

Here, we show you the results obtained with MetaPhlAn2.

The main output folder includes the summary_readcounts.tsv file, which provides summary statistics about the number of reads considered, and their respective types:

sample_lib        paired.tot    paired.mapped    paired.no_dup    singles.tot    singles.mapped    singles.no_dup    sum.tot    sum.mapped    sum.no_dup
Skin_SRS018978    70678         33296            18104            24956          19768             12963             95634      53064         31067
Skin_SRS024655    70896         34795            18819            37475          20562             20354             108371     55357         39173

In the main output folder, you will also find the all_taxa.tsv, which provides the relative abundances of all identified taxa as a tabulated flat file:

ID  Skin_SRS018978  Skin_SRS024655
k__Bacteria 88.44126    100.0
k__Bacteria|p__Actinobacteria   74.06891    94.51828
k__Bacteria|p__Actinobacteria|c__Actinobacteria 74.06891    94.51828
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales  74.06891    94.51828
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Propionibacteriaceae  74.06891    94.51828
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Propionibacteriaceae|g__Propionibacterium 74.06891    94.51828
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Propionibacteriaceae|g__Propionibacterium|s__Propionibacterium_acnes  74.06891    94.51828

Finally, four subfolders contain results from the different analyses requested, namely Krona visualization (krona/), statistical analyses (statax/), LEfSe figures (lefse/) and all intermediary alignment files (Skin/).

Krona visualisation outputs can be shown in any web-browser: http://htmlpreview.github.io/?https://bitbucket.org/Glouvel/metabit/wiki/img/all_taxa.krona.html

The two sets of statistical analyses will be respectively found in statax/Skin_only/ and statax/Skin_HMP/ and show barplots of taxonomic abundances at the requested taxonomic levels:

Heatmaps are also provided:

The underlying numeric tables are provided in an additional subfolder called tables/, and the Shannon diversity indices are provided in the Skin_HMP_diversities.tsv.

The Skin_HMP/ subfolder also shows, as requested in the makefile, the results from PCoA (Skin_HMP_pcoa_Genera.pdf) and hierarchical clustering analyses (Skin_HMP_clust_Genera.pdf) at all taxonomic levels requested.

Finally, the results from LefSe LDA analyses are provided in the subfolder lefse: graphics can be visualized in the all_taxa_merged.lefse.plot_res.pdf and all_taxa_merged.lefse.cladogram.pdf files.


Updated