SparseDOSSA introduces a hierarchical model of microbial ecological population structure. It is capable of simulating realistic metagenomic data with known correlation structures, and thus provides a gold standard to enable benchmarking of statistical metagenomics methods.
We provide support for SparseDOSSA users. Please join our Google group designated specifically for SparseDOSSA users. Feel free to post any questions on the google group by posting directly or emailing email@example.com
The following figure shows the workflow for SparseDOSSA.
SparseDOSSA can be installed with Homebrew or run from a Docker image. Please note, if you are using bioBakery (Vagrant VM or Google Cloud) you do not need to install SparseDOSSA because the tool and its dependencies are already installed.
Install with Homebrew: $ brew install biobakery/biobakery/sparsedossa
Install with Docker: $ docker run -it biobakery/sparsedossa bash
If you would like to install from source, refer to the SparseDOSSA user manual for the pre-requisites/dependencies and installation instructions.
This section presents some basic usages of SparseDOSSA.
SparseDOSSA's hierarchical model is calibrated using the PRISM dataset by default. If you have your own reference dataset and would like to simulate data based on it, please follow the example below. Your dataset must be in a QIIME OTU table format, that is taxonomic units in rows and samples in columns, with each cell indicates the observed counts. Assume the file is reference_OTU.txt, using the following command, we can simulate microbiome dataset that has the same dimension and follows similar patterns with reference_OTU.txt:
$ synthetic_datasets_script.R -c reference_OTU.txt
Here is a basic example of simulating dataset with 150 features (OTUs), 180 samples and 10 metadata for each type (binary, quaternary and continuous), without any correlation structure. We use the default model parameters:
$ synthetic_datasets_script.R -f 150 -n 180 -p 10
If we want to add feature-metadata correlation, with 2% of the features spiked and each spiked feature correlated with one randomly selected metadata, we can use:
$ synthetic_datasets_script.R -f 150 -n 180 -i 2 -k 0.02 -p 10
You can also simulate dataset with feature-feature correlation only. Assume each spiked feature is correlated with two other randomly selected features and 10 of the features are spiked:
$ synthetic_datasets_script.R -f 50 -b 10 -m 2 -n 10 -p 10 --runBugBug
As an final example to show sparseDOSSA can replicate results in previous literature, we choose to benckmark CSS normalization introduced in metagenomeSeq (Paulson, et al, 2013). We use their testing dataset to calibrate our model and introduce binary association to emulate the cluster structure present in the original dataset. The mice dataset can be found here. Assume the data file is $mice.txt$, the command we used to simulate the data is
$ synthetic_datasets_script.R -c mice.txt -m 1 -p 10
For each simulation using default model parameters, SparseDOSSA will produce three txt files: SyntheticMicrobiome.pcl, SyntheticMicrobiome-Counts.pcl, SyntheticMicrobiomeParameterFile.txt. The first two files contain the actual microbiome abundance data and the third file records values of model parameters, diagnostic information and spike-in assignment.
This file records the synthetic microbiome data for null community (no spike-in and outliers), outlier-added community without spike-in and final spiked data. We put samples in columns and features in rows. The first chunk of the file is metadata, with row names Metdata_*. The second chunk is for null community, with row names Feature_Lognormal_*. The third chunk is for outlier-introduced community, with row names Feature_Outlier_*. The last chunk is for spiked data, with row names Feature_spike. This file records relative abundance data.
This file has the same organization as SyntheticMicrobiome.pcl but records raw counts data.
This file records diagnostic information and values of model paramters as well as the spike-in assignment. The most part of this file is used only for debugging. Users can focus on lines after Minimum Spiked-in Samples. Those lines record which metadata are correlated with which feature. The format is all metadata that are correlated with a specific features are listed under the name of the feature.