**MetaMarker: a de novo pipeline to discover novel metagenomic biomarkers**

MetaMarker is a new pipeline to discover metagenomic biomarkers from whole metagenome sequencing samples. MetaMarker works based on a de novo approach without mapping raw reads to a reference database. Biomarkers found by MetaMarker are conserved domains which are shared in a subgroup of bacteria. These domains usually have a specific biological function which play important roles in microbial communities, but not yet in the reference database.

MetaMarker published in Bioinformatics Journal:

Mohamad Koohi-Moghadam, Mitesh Borad, Nhan Tran, Kristin Swanson, Lisa Boardman, Hongzhe Sun and Junwen Wang, "MetaMarker: a de novo pipeline for discovering novel metagenomic biomarkers", Bioinformatics

Installing MetaMarker

To install MetaMarker, simply download the file below and extract it on your computer.

https://bitbucket.org/mkoohim/metamarker/get/master.zip

MetaMarker needs the following dependencies:

Why MetaMarker?

MetaMarker can discover conserved domains which are shared in a subgroup of bacteria as biomarker. These conserved domains may not be discovered by previous approaches which use OTU table. In OTU approach, a reference database is used to annotate the reads. The reference database contains unique genome part of each clade as representative of that clade. For example, Clade1, Clade2 and Clade3 are well known clades and their unique part of genome (red, pink and yellow part respectively) are already added to the reference database. While Clade4, Clade5 and Clade6 are new emerging bacteria which are not available in reference database. Here, based on OTU table we can see no difference between case and control group as it just considers Clade1, Clade2 and Clade3. But a conserved domain (green part) is more abundant in case than control. To discover this shared domain we developed MetaMarker.

MetaMarker Pipeline:

Preparing input files:

The first step of MetaMarker is preparing the input files for case and control group. MetaMarker works with FASTA format WMS sample files. In this tutorial we use WMS samples of France population (Zeller, Georg, et al. Molecular systems biology 10.11 (2014): 766). These population has 53 samples from CRC and 61 from healthy individuals. You can find the sample IDs from here. You can download these samples from NCBI ftp server and use fastq-dump script from SRA toolkit to convert the SRA files to FASTA format

Build a hash table for each sample:

Here we use Step1_MakeHashTable.py to generate the hash table for each sample. We slide a window with length k on all reads of case and control groups. A k-mer can potentially have 4^k different strings. We built a hash table for each sample and recorded the normalized abundance of each k-mer in that sample. We suggest k=12 to process a large population same as France (A larger k needs more RAM). The output hash table will be saved into NPZ compressed format. Listed bellow are the parameters of this script.

Input::
  --sample STRSAMPLE    Enter the path of the folder that contain FASTA read
                        files.

Output::
  --output STROUT       Enter the path of the output directory. The final NPZ
                        format hash table will be saved into this folder.
  --npz STRNPZ          Enter the name of the output NPZ file. The default
                        name is same as the input sample file name.

Parameters::
  --window_size WINDOWLENGTH
                        Enter the length of window. The default value is 12.
                        If you increase the size of the window you need more
                        RAM to run the program. The window size should be an
                        even number.
  --step_size SAMPLINGRATE
                        Enter the sampling rate. The default value is 1. It
                        means the window move through the reads one base by
                        one base. Increasing this parameter may affect on
                        accuracy but the pipeline will run faster.

We applied this script on each sample of case and saved the result into npz_case folder (Here as example we do it for one sample, you need run this script for all samples in case and save result into one folder)

Step1_MakeHashTable.py --sample sample_case_1 --output npz_case

We also applied this script on each sample of control and saved the result into npz_control (Here as example we do it for one sample, you need run this script for all samples in control and save result into one folder)

Step1_MakeHashTable.py --sample sample_control_1 --output npz_control

This step is IO intensive and may takes time for big WMS samples. We already built the NPZ result for France population. You can download the hash tables for France from npz_case and npz_control.

Generate short-markers:

Using npz_case and npz_control we made two numpy matrices (case and control matrices). We used these matrices as profile table of the case and control groups. Each row in these matrices indicates normalized abundance of a k-mer in different samples. We selected those k-mers which their normalized abundance are statistically different in case and control. We then used SPAdes to assemble the k-mers which we selected. We processed SPAdes output and extracted those sequences which are longer than k and call them short-markers. This step can be done by Step2_MakeShortMarker.py. The input parameters of this scrips are:

Input::
  --caseHash STRCASEHASHTABLES
                        Enter the path of hash tables (NPZ files) which are
                        extracted for case.
  --controlHash STRCONTROLHASHTABLES
                        Enter the path of hash tables (NPZ files) which are
                        extracted for control.

Programs::
  --spade STRSPADE      Provide the path to Spades.

Output::
  --output STROUT       Enter the path and name of the output directory.

Parameters::
  --threads ITHREADS    Enter the number of CPUs available for SPAdes. The
                        default value is 1.
  --test STRTESTTYPE    Enter type of test 'wilcoxon' or 'ttest'. By default
                        we used Wilcoxon Mann Whitney test.
  --pvalue STRPVALUE    Enter the pvalue threshold for statistical test. The
                        default value is 0.05.

Here we applied this script on the hash tables (npz_case and npz_control) which we found from the case and control group in the previous step:

Step2_MakeShortMarker.py --caseHash npz_case --controlHash npz_control --spade /software/SPAdes-3.11.1/bin --output output

The script generates some intermediate files and two short-marker files. We need these two Short_Markers_case.fasta and Short_Markers_control.fasta files for next step analysis.

Generate long-markers:

In this step, for each sample in case we extracted reads which are aligned (identity>0.95) with at least one short-marker in Short_Markers_case.fasta.

Also, for each control sample we extracted reads which are aligned (identity>0.95) with at least one short-marker in Short_Markers_control.fasta

We then used SPAdes to assemble the selected reads in each sample and generate long-marker. This step can be done by applying Step3_MakeLongMarker.py on each sample. This script has the following parameters:

Input::
  --short_marker SHORTMARKERSTR
                        Enter the path of the short-marker file which is found
                        from MakeShortMarker.
  --sample STRSAMPLE    Enter the path of the folder that contain FASTA files.

Output::
  --tmp STRTMP          Enter the path and name of the tmp directory.
  --output STROUT       Enter the path and name of the output directory.

Programs::
  --spade STRSPADE      Provide the path to Spades. Default call will be
                        spades.py.
  --usearch USEARCHSTR  Enter the path of USEARCH tool. Please provide the full path of usearch. i.e. "tools/usearch11.0.667_i86linux32"
  --gt GTSTR            Enter the gt tool path.

Parameters::
  --threads ITHREADS    Enter the number of CPUs available for USEARCH and SPAdes. The
                        default value is 1.
  --id IDSTR            Enter similarity rate for the USEARCH. The default
                        value is 0.95.

For each sample in case (Here as example we do it for one sample, you need run this script for all samples in case and save result into one folder):

Step3_MakeLongMarker.py --sample sample_case_1 --short_marker Short_Markers_case.fasta --output /case_long_marker --tmp /temp --threads 3 --usearch /software/usearch_tool/ --id 0.95 --spade /software/SPAdes-3.11.1/bin --gt /software/genometools-1.5.8/bin

For each sample in control (Here as example we do it for one sample, you need run this script for all samples in case and save result into one folder):

Step3_MakeLongMarker.py --sample sample_control_1 --short_marker Short_Markers_control.fasta --output /control_long_marker --tmp /temp --threads 3 --usearch /software/usearch_tool/ --id 0.95 --spade /software/SPAdes-3.11.1/bin --gt /software/genometools-1.5.8/bin

We saved the case enriched long-marker into case_long_marker and control enriched long-markers into control_long_marker. You can download the result of this step for France population samples (case_long_marker and control_long_marker).

Cleaning long-markers:

Here, we removed the redundant long-markers which are repeated in different samples. We first merged all long-markers from case and control into two separated FASTA files (case enriched and control enriched long-marker files). We excluded those markers which are less than 100bp in merging step. We used CD-HIT to cluster the long-markers and remove markers with more than 90% similarity. Users also can exclude those low-prevalent biomarkers using pRate parameters. The default value for this parameter is 0.5 and it means biomarker which repeated in less than 50% of the case and control will be excluded. We recommend to set this parameter not less than 0.1 as low-prevalence bacteria are not universal enough. This step can be done by Step4_MarkerCleaning.py script. The input parameters of this script are:

Input::
  --caseMarker STRINPUTCASE
                        Enter the path of the folder that contain Case samples
                        markers.
  --caseSize STRCASESIZE
                        Enter the size of case group (number of samples in
                        case).
  --controlMarker STRINPUTCONTROL
                        Enter the path of the folder that contain Control
                        samples markers.
  --controlSize STRCONTROLSIZE
                        Enter the size of control group (number of samples in
                        control).
  --pRate STRPREVALENCERATE
                        Enter the prevalence rate of the markers. The default 
                        value is 0.5.

Output::
  --output STROUT       Enter the path and name of the output directory.

Parameters::
  --cdhit STRCDHIT      Enter the path of CD-HIT tool.

We then run this script on the long-markers results (case_long_marker and control_long_marker):

Step4_MarkerCleaning.py --caseMarker result_case --controlMarker result_control --caseSize 53 --controlSize 61 --cdhit /software/cd-hit-4.6.4 --output output

The result will be saved into Selected_Markers.fasta in output folder.

Calculate normalized abundance of long-markers:

We then calculated the normalized abundance of the long-markers in Selected_Markers.fasta in case and control groups using Step5_MarkerAbundace.py. This script has the following parameters:

Input::
  --marker STRMARKER    Enter the path of the marker that you want to search,
                        in FASTA format.
  --sample STRSAMPLE    Enter the path of the folder that contain reads. The
                        read files should be in Fasta format.

Output::
  --tmp STRTMP          Enter the path of the tmp directory.
  --output STROUT       Enter the path of the output directory.

Programs::
  --bowtie2 STRBOWTIE2  Provide the path to bowtie2

Parameters::
  --id DID              Enter the percent identity for the match
  --alnlength ALNLENGTH
                        Enter the minimum alignment length. The default is 50
  --threads ITHREADS    Enter the number of CPUs available for Bowtie2.
  --removeTemp          If you use this option, the temp folder will be
                        deleted from your disk.

We run this script for each sample in case and control group.

For case samples (Here as example we do it for one sample, you need run this script for all samples in case and save result into one folder)

Step5_MarkerAbundace.py --sample sample_case_1 --marker Selected_Markers.fasta --output /result_abundance_case --tmp /temp --threads 3 --bowtie2 /data2/microbiome/software/bowtie2-2.2.6/bin

For control samples (Here as example we do it for one sample, you need run this script for all samples in control and save result into one folder)

Step5_MarkerAbundace.py --sample sample_control_1 --marker Selected_Markers.fasta --output /result_abundance_control --tmp /temp --threads 3 --bowtie2 /data2/microbiome/software/bowtie2-2.2.6/bin

The result will be saved into result_abundance_case and result_abundance_control.

Rank long-markers:

We then compared the normalized abundance score of each marker in case and control using Wilcoxon–Mann–Whitney test. We selected those markers which have p-value less than 0.05. We then applied CD-HIT on the selected markers to remove the markers with more than 60% similarity. We finally ranked the unique markers based on the p-values and saved them into two CSV file. This step can be done by Step6_MarkerRank.py. This script has the following parameters:

Input::
  --case STRCASE        Enter the path of the marker abundance result for case
                        samples.
  --control STRCONTROL  Enter the path of the marker abundance result for
                        control samples.
  --marker STRMARKER    Enter the path of .

Parameters::
  --cdhit STRCDHIT      Enter the path of CD-HIT tool.

Output::
  --output STROUT       Enter the path of the output directory.

Parameters::
  --pvalue STRPVAL      Enter the cut-off for p-value
  --filter STRFILTER    Enter the cut-off to exclude markers with average
                        abundance score less than this value.
  --topMarker STRTOPMARKER
                        Enter the number of top markrs which you like to see
                        in output. The number of final marker maybe less than
                        your input value as we removeredundunt markers from
                        top markers

We run the script on long-marker abundance:

Step6_MarkerRank.py --case /result_abundance_case --control /result_abundance_control --marker Selected_Markers.fasta --cdhit /software/cd-hit-4.6.4 --output /output_ranking

The result will be saved into output_ranking. In this folder five file will be generated:

Final_Marker_Case.csv: This file is the ranked list of case enriched long-marker
Final_Marker_Case.fasta: This file is FASTA format of case enriched long-marker
Final_Marker_Control.csv: This file is the ranked list of control enriched long-marker
Final_Marker_Control.fasta: This file is FASTA format of control enriched long-marker
Marker_Stats.csv: This file contains the normalized abundance of the markers in different samples. This file can be used in future analysis such as making a machine learning model.

Please report any bug or issue here

Wiki

MetaMarker / Home