1. Computational Metagenomics Lab
  2. Untitled project
  3. MetAML


Clone wiki

MetAML / Home

MetAML - Metagenomic prediction Analysis based on Machine Learning

MetAML is a computational tool for metagenomics-based prediction tasks and for quantitative assessment of the strength of potential microbiome-phenotype associations.

The tool (i) is based on machine learning classifiers, (ii) includes automatic model and feature selection steps, (iii) comprises cross-validation and cross-study analysis, and (iv) uses as features quantitative microbiome profiles including species-level relative abundances and presence of strain-specific markers.

It provides also species-level taxonomic profiles, marker presence data, and metadata for 3000+ public available metagenomes.


MetAML is written in Python (tested on version 2.7) and requires some additional packages (matplotlib, numpy, pandas, scikit-learn, scipy), all included in the Anaconda platform.


1) MetAML can be downloaded using wget

wget https://bitbucket.org/cibiocm/metaml/get/default.zip
unzip default.zip
mv CibioCM-metaml-*/ metaml/

2) or using the Mercurial hg command

hg clone https://bitbucket.org/CibioCM/metaml

Package structure

The main "metaml" folder is organized as follows:

  • "data" folder: available data for 3000+ metagenomes in terms of i) species-level relative abundances ("abundance.txt.bz2"), ii) presence of strain-specific markers ("marker_presence.txt.bz2"), and iii) abundance of strain-specific markers ("marker_abundance.txt.bz2"). iv) The file "markers2clades_DB.txt.bz2" is the lookup table to associate each marker identifier to the corresponding species. Before using such files, it is required to uncompress them:
cd metaml
bunzip2 data/abundance.txt.bz2
bunzip2 data/marker_presence.txt.bz2
bunzip2 data/marker_abundance.txt.bz2
bunzip2 data/markers2clades_DB.txt.bz2
  • dataset_selection.py: script to extract from the whole available data (e.g., from "abundance.txt") only the samples/features of interest;
  • classification.py: script to run the classification task on the selected data;
  • "tools" folder: additional scripts to generate the figures present in the published paper;
  • "scripts" folder: commands to replicate the results reported in the published paper.

How to select a subset of samples (using dataset_selection.py)

Help -h

python dataset_selection.py -h
usage: dataset_selection.py [-h] [-z FEATURE_IDENTIFIER] [-s SELECT]
                            [-r REMOVE] [-i INCLUDE] [-e EXCLUDE] [-t]
                            [INPUT_FILE] [OUTPUT_FILE]

positional arguments:
  INPUT_FILE            The input dataset file [stdin if not present]
  OUTPUT_FILE           The output dataset file

optional arguments:
  -h, --help            show this help message and exit
                        The feature identifier
  -s SELECT, --select SELECT
                        The samples to select
  -r REMOVE, --remove REMOVE
                        The samples to remove
  -i INCLUDE, --include INCLUDE
                        The fields to include
  -e EXCLUDE, --exclude EXCLUDE
                        The fields to exclude
  -t, --tout            Transpose output dataset file

An example

1) With the following command line we select the 440 samples in terms of species-level relative abundances belonging to the T2D and WT2D datasets considered in the published paper:

python dataset_selection.py data/abundance.txt data/abundance_t2d-WT2D.txt -z "k__" -s dataset_name:t2dmeta_long:t2dmeta_short:WT2D -r gender:"-":" -",disease:impaired_glucose_tolerance -i feature_level:s__,dataset_name:disease -e feature_level:t__
  • Input file: We consider as INPUT_FILE the matrix (metadata/features on the rows with the first column that denotes the metadata/feature identifier; samples on the columns) with the species-level relative abundances "data/abundance.txt";
  • Output file: The OUTPUT_FILE is a subset of this matrix and is saved as "data/abundance_t2d-WT2D.txt";
  • Feature identifier: All the rows that contain "k__" in its identifier (i.e., the first column) are identified as features, the rest is considered as metadata;
  • Selection of samples: The couple of options -s (SELECT) and -r (REMOVE) defines which are the samples to select or remove. In this example, we SELECT all the samples having in the metadata field "dataset_name" the value "t2dmeta_long" OR "t2dmeta_short" OR "WT2D". At the same time, we REMOVE all the samples having in the metadata field "gender" the value "-" OR " -" (in this scenario this permits to exclude the samples without metadata information) AND all the samples having in the metadata field "disease" the value "impaired_glucose_tolerance";
  • Selection of metadata/features: The couple of options -i (INCLUDE) and -e (EXCLUDE) defines which are the metadata/features to include or exclude. In this example, we SELECT all the features that go from species (included, denoted as "s_") to sub-species (excluded, denoted as "t_") levels (this implies to select features at species level). Moreover, we keep only the fields "dataset_name" AND "disease" for metadata.

2) We can extract the same set of samples but in terms of presence of strain-specific markers by slightly modifying the command in the following way:

python dataset_selection.py data/marker_presence.txt data/marker_presence_t2d-WT2D.txt -z "GeneID":"gi|" -s dataset_name:t2dmeta_long:t2dmeta_short:WT2D -r gender:"-":" -",disease:impaired_glucose_tolerance -i dataset_name:disease

How to run a prediction analysis (using classification.py)

Help -h

python classification.py -h
usage: classification.py [-h] [-z FEATURE_IDENTIFIER] [-d DEFINE] [-t TARGET]
                         [-u UNIQUE] [-b] [-r RUNS_N] [-p RUNS_CV_FOLDS] [-w]
                         [-l {rf,svm,lasso,enet}] [-i {lasso,enet}]
                         [-f CV_FOLDS] [-g CV_GRID] [-s CV_SCORING]
                         [-j FS_GRID] [-e FIGURE_EXTENSION]
                         [INPUT_FILE] [OUTPUT_FILE]

positional arguments:
  INPUT_FILE            The input dataset file [stdin if not present]
  OUTPUT_FILE           The output file [stdout if not present]

optional arguments:
  -h, --help            show this help message and exit
                        The feature identifier
  -d DEFINE, --define DEFINE
                        Define the classification problem
  -t TARGET, --target TARGET
                        Define the target domain
  -u UNIQUE, --unique UNIQUE
                        The unique samples to select
  -b, --label_shuffling
                        Label shuffling
  -r RUNS_N, --runs_n RUNS_N
                        The number of runs
  -p RUNS_CV_FOLDS, --runs_cv_folds RUNS_CV_FOLDS
                        The number of cross-validation folds per run
  -w, --set_seed        Setting seed
  -l {rf,svm,lasso,enet}, --learner_type {rf,svm,lasso,enet}
                        The type of learner/classifier
  -i {lasso,enet}, --feature_selection {lasso,enet}
                        The type of feature selection
  -f CV_FOLDS, --cv_folds CV_FOLDS
                        The number of cross-validation folds for model selection
  -g CV_GRID, --cv_grid CV_GRID
                        The parameter grid for model selection
  -s CV_SCORING, --cv_scoring CV_SCORING
                        The scoring function for model selection
  -j FS_GRID, --fs_grid FS_GRID
                        The parameter grid for feature selection
                        The extension of output figure

An example

1) With the following command we run a cross-validation analysis to discriminate between healthy and affected by T2D subjects (such results are denoted as T2D+WT2D* in the Figure 7(a) of the published paper):

mkdir results
python classification.py data/abundance_t2d-WT2D.txt results/abundance_t2d-WT2D_rf -d 1:disease:t2d -g [] -w
  • Input file: We consider as INPUT_FILE the data matrix "data/abundance_t2d-WT2D.txt" generated in the above paragraph;
  • Output file: The results are saved as OUTPUT_FILES in multiple files with <prefix> "results/abundance_t2d-WT2D_rf". In particular, the main results with prediction accuracies and eventual feature importance are saved in "<prefix>.txt", the estimation values in "<prefix>_estimations.txt", the ROC curve values in "<prefix>_roccurve.txt", the PCA plot in "<prefix>_pca.png" (figure extension can be changed through -e);
  • Definition of the classification problem: We DEFINE (-d) the classification problem by setting to class "1" all the samples having in the metadata field "disease" the value "t2d". The remaining samples are automatically assigned to class "0" (note that in general we can use the syntax "-d 1:field_i:V1:V2,2:field_j:V3" to assign i) to class "1" all the samples having in the metadata field "field_i" the value "V1" or "V2", ii) to class "2" all the samples having in the metadata field "field_j" the value "V3", iii) and the remaining samples to class "0");
  • Definition of the learning setting: Prediction accuracies are estimated through cross-validation (NUMBER OF FOLDS defined with -f; default = 10) and averaged on independent runs (NUMBER OF RUNS defined with -r; default = 20);
  • Putting -w (SEED SETTING) guarantees that different executions of the script (also changing some parameters such as the type of learner) are characterized by the same training and validation sets in the cross-validation procedure. This is crucial, for example, for doing a statistical test between different classifiers or when comparing between true and shuffled labels;
  • Definition of the learner: Different types of classifiers are implemented (LEARNER TYPE defined with -l; default = rf, i.e. Random Forests);
  • Definition of the feature selection strategy: Different feature selection strategies are implemented (FEATURE SELECTION defined with -i; default = none);
  • Definition of model selection and feature selection parameters: Default parameters can be changed by acting on the NUMBER OF CROSS-VALIDATION FOLDS FOR MODEL SELECTION (-f), PARAMETER GRID FOR MODEL SELECTION (-g), SCORING FUNCTION FOR MODEL SELECTION (-s), PARAMETER GRID FOR FEATURE SELECTION (-s). Putting "-g []" when using random forest as classifier ("-l rf") disables the re-training of the model on the most discriminative features (this saves time, although only classification results obtained on the entire set of features are reported).

2) Results using Lasso as feature selection and Support Vector Machine (SVM) as classifier can be obtained by acting on the parameters "-i" and "-l":

python classification.py data/abundance_t2d-WT2D.txt results/abundance_t2d-WT2D_lasso_svm -d 1:disease:t2d -i lasso -l svm -w

3) Results with shuffled labels can be obtained by just adding the option "-b":

python classification.py data/abundance_t2d-WT2D.txt results/abundance_t2d-WT2D_rf-shuffled -d 1:disease:t2d -g [] -w -b

4) We can conduct a cross-study analysis by first training the model on a specific dataset and then by validating it on a different dataset. Adding the option "-t" is sufficient for doing this:

python classification.py data/abundance_t2d-WT2D.txt results/abundance_t2d-WT2D_rf_t-t2d -d 1:disease:t2d -g [] -w -t dataset_name:t2dmeta_long:t2dmeta_short 
  • Definition of the validation set: The option -t (TARGET DOMAIN) defines which are the samples to consider as validation set. In this example, all the samples having in the metadata "dataset_name" the value "t2dmeta_long" OR "t2dmeta_short" are considered for validation. The remaining samples are automatically considered for training.


E. Pasolli, D. T. Truong, F. Malik, L. Waldron, and N. Segata, Machine learning meta-analysis of large metagenomic datasets: tools and biological insights, PLOS Computational Biology, 12(7), Jul. 2016.

MetAML is a project of the Computational Metagenomics Lab at CIBIO, University of Trento, Italy.