Wiki
Clone wikimetabit / Tutorial
Tutorial
In this tutorial, we will guide the user through the main options of metaBIT by profiling and analysing the microbiota of two human skin samples. The original sequence data were retrieved from the archives of the Human Microbiome Project (HMP: Human Microbiome Project Consortium 2012a).
1. Configuring the pipeline
In order to run all the required programs, you must have installed and configured the metaBIT pipeline as explained in the installation section
2. Description of the source data
In the example
folder from metaBIT, we provide two subfolders labeled SRS018978
and SRS024655
.
Each contain a subset of 120,000 sequence reads sub-sampled from the two HMP skin samples SRS018978 and SRS024655 split in the following files:
tutorial_example_SRS018978.pair1.fastq.gz
tutorial_example_SRS018978.pair2.fastq.gz
tutorial_example_SRS018978.singleton.fastq.gz
and:
tutorial_example_SRS024655.pair1.fastq.gz
tutorial_example_SRS024655.pair2.fastq.gz
tutorial_example_SRS024655.singleton.fastq.gz
As we will run metaBIT in the example
folder. You should change the directory to the example folder:
cd /path/to/metaBIT/example/
3. Preparing a Yaml makefile
A single simple makefile, provided by the user in YAML format, defines all the analyses to be performed in metaBIT. This file contains the parameters of each type of analysis, the location of fastq sequence files, and the sample data structure. Full instructions describing the format and structure of metaBIT YAML makefiles are given in the makefile documentation accompanied with metaBIT.
In this tutorial, we will walk-through the content of a simple makefile, tutorial_example.yaml
. One can follow the structure of the makefile using any text editor.
The first line specifies the file type:
# -*- mode: Yaml; -*-
YAML is a human-readable format to store data in a hierarchical way. Subcategories are indented with one or more spaces compared to the parent category, never use tabs for indentation. This section of the YAML documentation will give you a quick overview of the language structure. Commented lines begin with a hash character #
.
The 'Samples' section
You must first indicate the name and location of the data files to be used as input for the metaBIT pipeline. Input files of the metaBIT pipeline are trimmed reads (adaptors have been removed) from single-end or paired-end sequencing runs. In this tutorial, the two libraries SRS018978
and SRS024655
will be analysed separately, but they can be grouped in one single sample (Skin). The corresponding sample structure in the makefile should be:
Samples:
Skin: # First skin sample name
SRS018978: # First library name
You must then report all paths to the sequence data files, which could consist of either Paired-End reads (Paired, Collapsed, Singletons) or Single-End reads.
The SRS018978
library contains paired reads. Using the key "{Pair}" the pipeline retrieves files 1 and 2 (the key "{Pair}" equates to 1 and 2 when searching for the files).
Samples:
Skin: # First skin sample name
SRS018978: # First library name
Paired: SRS018978/tutorial_example_SRS018978.pair{Pair}.fastq.gz
Singles: SRS018978/tutorial_example_SRS018978.singleton.fastq.gz
Do the same for the second sample:
SRS024655: # Second skin sample name
Paired: SRS024655/tutorial_example_SRS024655.pair{Pair}.fastq.gz
Singles: SRS024655/tutorial_example_SRS024655.singleton.fastq.gz
Setting up parameters of the taxonomic profiling
We will now prepare the section "Taxonomic profiling" of the makefile, which starts with the following text:
# ----------------------------------------------- #
# Taxonomic profiling #
# ----------------------------------------------- #
The first subsection provides instructions for "Bowtie 2".
For example the flag --phred33
sets quality score options to the phred 33 system when either yes
, true
, or no value are indicated (see below).
For the quality score, you have the choice between --phred33
, --phred64
, solexa scores --solexa-quals
, or integer scores --int-quals
depending on the sequencing platform from which the reads were generated.
In our example, we will use:
Bowtie2:
--phred33: yes # Sets the option. Default.
The second subsection provides options for Metaphlan. most importantly, if you wish to analyze the 2 libraries as a single data set in Metaphlan, uncomment the Pool line:
Metaphlan:
#Pool: Skin
--ignore_eukaryotes: # comment to include eukaryotes (MetaPhlAn2 only)
--ignore_viruses: # comment to include viruses (MetaPhlAn2 only)
The abundances will then be calculated for the whole Skin
samples as given in the Samples
section, instead of calculating separated abundances for each library (SRS018978
and SRS024655
).
Note that a series of options could also be given for MetaPhlAn profiling (see Documentation makefile). In this tutorial, eukaryotes and viruses are excluded.
Setting up parameters of the statistical analyses
This section administrate which statistical analyses should be performed, as well as their parameters. This section starts with the following text:
# ----------------------------------------------- #
# Statistical Analysis of the taxonomic profiles #
# ----------------------------------------------- #
Krona hierarchical pie-charts
One visualization option in metaBIT is to use Krona hierarchical pie-charts. The metaBIT creates one html file including one chart per sample, and representing taxon abundances per taxonomic level. Generating Krona pie-charts is enabled by default, but you can disable it:
Krona:
#run: no # will not produce Krona visualization files when uncommented.
#-a: no # If you uncomment, Krona charts will require an internet connection to use Krona resources.
Statistical analysis module: Statax
Most of the statistical analyses run by metaBIT are piloted by the Statax
section. This section uses the table of abundances produced from the given samples, or given at the 'run_from_table
' key.
-
doDiv
: computes diversity indices for each sample and each taxonomic level (using the shannon index by default) -
doBarplot
: shows a stacked barplot of the abundances for each sample. Useful option:--order
: reorders samples, for example based on an euclidean distance (like in the heatmap). Alphabetical by default.
-
doHeatmap
: shows a heatmap of taxonomic abundances (taxonx
samples) -
doPcoa
: performs a Principal Coordinates Analysis of the samples, using the R package vegan. Some useful options are:--distance
: the distance method (bray by default)--makefile
: a R file defining the color and symbol arguments for the plot, and the legend text.--inv-x
(/--inv-y
): reverses thex
(/y
) axis.
-
doClust
: performs a hierarchical clustering with bootstrap on samples, using the Pvclust and Vegan R packages. Useful options are:--dist.method
: distance method (bray by default)--nboot
: number of bootstrap replication to do (10000 by default)--ncores
: number of cores to use to parallelize the bootstrap.
If you want the full information on these sub-analyses and their respective options, go in the nodes/tools/statax_Rmodule/
subfolder of the metaBIT folder and look at the help text of the individual programs, for example:
$ /path/to/metabit/nodes/tools/statax_Rmodule/doDiv.R --help
In our example, we would like to estimate Shannon diversity indices, generate barplots and heatmaps of the microbial diversity detected but also perform hierarchical clustering as well as Principal Coordinate Analyses. We, thus, add to the makefile the following text:
Statax:
Skin_only:
taxlevels: pcofgs
filterout: 1
doDiv:
--index: shannon
doBarplot:
doHeatmap:
In our example, we also wish to compare the two skin microbial profiles to those of 5 human body sites (mouth, nose, skin, stool, vagina), each represented by 2 HMP samples. This comparative panel is just provided as an example here, but can be extended to 689 HMP microbiomes provided as part as of metaBIT. The comparative panel consists of any tabulated flat file providing relative abundances of the samples/tissues/environments/etc to be compared (those can be generated using metaBIT). We therefore add to the makefile the following text:
Skin_HMP:
merge:
#- HMP_10.tsv # if you are using MetaPhlAn version 1
- HMPII_10.tsv # if you are using MetaPhlan version 2
doDiv:
doBarplot:
--order: euclidean
doHeatmap:
doPcoa:
--makefile: pcoa_symbols_and_color.R
doClust:
--nboot: 1000
The last section provides instructions for running LEfSe, which performs Linear Discriminant Analysis (LDA) of predefined sample groups.
LEfSe is disabled by default, as we recommend the user to investigate the output from Statax
prior to selecting groups for LEfSe.
However, to demonstrate LEfSe in this tutorial, we have enabled LEfSE and selected groups to match each human body site as shown below.
Lefse:
run: yes
merge:
#- HMP_10.tsv # if you are using MetaPhlAn version 1
- HMPII_10.tsv # if you are using MetaPhlan version 2
Groups:
Skin:
- Skin_SRS018978
- Skin_SRS024655
- skin-SRS019063
- skin-SRS046688
Stool:
- stool-SRS013800
- stool-SRS048870
Important note: The additional table for comparison HMPII_10.tsv
has been obtained with MetaPhlAn 2. If you are using MetaPhlAn 1, you should change HMPII_10.tsv
to HMP_10.tsv
, which is also provided in the example
folder.
Now your makefile is ready. After saving it you can launch metaBIT:
4. Running metaBIT
If you haven't written the options --metaphlan-path
and --jar-root
to the config file, remember to add them to the command-line, see installation.
First perform a dry run to see which actions will be executed by typing the following commands in a terminal (a summary of nodes will be then displayed), from the example/
directory:
$ metaBIT tutorial_example.yaml --dry-run
Once you are ready, you can start metaBIT:
$ metaBIT tutorial_example.yaml
5. Going through the results.
Once the analyses are completed, their respective results can be found in subfolders located in the main folder called out_tutorial_example/
.
Here, we show you the results obtained with MetaPhlAn2.
The main output folder includes the summary_readcounts.tsv
file, which provides summary statistics about the number of reads considered, and their respective types:
sample_lib paired.tot paired.mapped paired.no_dup singles.tot singles.mapped singles.no_dup sum.tot sum.mapped sum.no_dup
Skin_SRS018978 70678 33296 18104 24956 19768 12963 95634 53064 31067
Skin_SRS024655 70896 34795 18819 37475 20562 20354 108371 55357 39173
In the main output folder, you will also find the all_taxa.tsv
, which provides the relative abundances of all identified taxa as a tabulated flat file:
ID Skin_SRS018978 Skin_SRS024655
k__Bacteria 88.44126 100.0
k__Bacteria|p__Actinobacteria 74.06891 94.51828
k__Bacteria|p__Actinobacteria|c__Actinobacteria 74.06891 94.51828
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales 74.06891 94.51828
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Propionibacteriaceae 74.06891 94.51828
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Propionibacteriaceae|g__Propionibacterium 74.06891 94.51828
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Propionibacteriaceae|g__Propionibacterium|s__Propionibacterium_acnes 74.06891 94.51828
Finally, four subfolders contain results from the different analyses requested, namely Krona visualization (krona/
), statistical analyses (statax/
), LEfSe figures (lefse/
) and all intermediary alignment files (Skin/
).
Krona visualisation outputs can be shown in any web-browser: http://htmlpreview.github.io/?https://bitbucket.org/Glouvel/metabit/wiki/img/all_taxa.krona.html
The two sets of statistical analyses will be respectively found in statax/Skin_only/
and statax/Skin_HMP/
and show barplots of taxonomic abundances at the requested taxonomic levels:
-
e.g at the species level:
Skin_only_barplot_Species.pdf
-
e.g at the genus level:
Skin_HMP_barplot_Genera.pdf
.
Heatmaps are also provided:
-
e.g. at the species level:
Skin_only_heatmap_Species.pdf
-
e.g at the genus level:
Skin_HMP_heatmap_Genera.pdf
.
The underlying numeric tables are provided in an additional subfolder called tables/
, and the Shannon diversity indices are provided in the Skin_HMP_diversities.tsv
.
The Skin_HMP/
subfolder also shows, as requested in the makefile, the results from PCoA (Skin_HMP_pcoa_Genera.pdf
) and hierarchical clustering analyses (Skin_HMP_clust_Genera.pdf
) at all taxonomic levels requested.
Finally, the results from LefSe LDA analyses are provided in the subfolder lefse
:
graphics can be visualized in the all_taxa_merged.lefse.plot_res.pdf
and all_taxa_merged.lefse.cladogram.pdf
files.
Updated