ConStrains: identifying Conspecific Strains within metagenomic species
One line pitcher
ConStrains reconstructs the within-species diversity and SNP types by deconvoluting SNPs patterns across different conserved genes and across different samples.
git clone https://bitbucket.org/luo-chengwei/constrains
hg clone https://bitbucket.org/luo-chengwei/constrains
You can also download the zip archive of it from bitbucket repository:
Python-2.7 or above
- Third party pipelines:
Note: older versions of these dependencies might work but it's not guaranteed.
You don't have to install it, just call ConStrains.py from wherever you put the whole folder.
The basic ConStrains analysis runs as below:
ConStrains.py [options] -c/--conf <config.file> -o/--outdir <output directory>
The format config file follows:
// sample: [sample1_ID] fq1: [forward reads fastq] fq2: [reverse/mate reads fastq] // sample: [sample2_ID] fq1: [forward reads fastq] fq2: [reverse/mate reads fastq] ...
Or, if only one fastq file per sample, the config file should look like this:
// sample: [sample1_ID] fq: [sample reads fastq] // sample: [sample2_ID] fq: [sample reads fastq] ...
If you already have MetaPhlAn results for some(or, all) of the samples, you can specify them to save some time as:
// sample: [sample1_ID] fq1: [forward reads fastq] fq2: [reverse/mate reads fastq] metaphlan2: [metaphlan2 results for sample1] // sample: [sample2_ID] fq1: [forward reads fastq] fq2: [reverse/mate reads fastq] metaphlan2: [metaphlan2 results for sample2]
Overall, you can have a config file for a 3-sample project that looks like this:
// sample: sample_1 fq1: /home/sample_1.1.fq fq2: /home/sample_1.2.fq metaphlan2: /home/sample_1.metaphlan2_results.rel_ab.txt // sample: sample_2 fq: /home/sample_2.fq metaphlan2: /home/sample_2.metaphlan2_results.rel_ab.txt // sample: sample_3 fq1: /home/sample_3.1.fq fq2: /home/sample_3.2.fq
Below is a detailed usage of ConStrains.py:
--version show program's version number and exit -h, --help show this help message and exit
There options are compulsory, and may be supplied in any order. -o DIR, --outdir=DIR The output directory of ConStrains. -c FILE, --config=FILE The configuration file of the ConStrains project.
There options are optional, and may be supplied in any order. -t INT, --num_proc=INT Number of processor for ConStrain to use [default: 1]. -d STRING, --ref_db=STRING The prefix of species reference. [default: ConStrains/db/ref_db]. -g STRING, --gsize_db=STRING The directory of species average genome size DB. [default: ConStrains/db/gsize.db]. --bowtie2=STRING Path to bowtie2 binary, specify if not in env path [default: bowtie2]. Bowtie2 citation: Langmead B. and Salzberg S., Nat. Methods, 2012. Bowtie2 page: http://bowtie- bio.sourceforge.net/bowtie2 --bowtie2_build=STRING Path to bowtie2-build binary, specify if not in env path [default: bowtie2-build]. Bowtie2 citation: Langmead B. and Salzberg S., Nat. Methods, 2012. Bowtie2 page: http://bowtie- bio.sourceforge.net/bowtie2 --samtools=STRING Path to samtools binary, specify if not in env path [default: samtools]. Samtools citation: Li H., et al, Bioinformatics, 2009. Samtools webpage: http://samtools.sourceforge.net -m STRING, --metaphlan2=STRING Path to metaphlan2 script, specify if not in env path [default: metaphlan.py]. MetaPhlAn citation: Segata N. et al, Nat. Methods, 2012. MetaPhlAn2 page: http://huttenhower.sph.harvard.edu/metaphlan2
There options are optional, and may be supplied in any order. --min_cov=FLOAT Minimum coverage of a species in a sample to be considered [default: 10, range: 5+].
There options are optional, and may be supplied in any order. -q, --quiet Suppress printing detailed runtime information, only important warnings/errors will show [default: False].
In the project/output directory, you will find a folder called "results", in which you can find the below files and directories:
Intra_sp_rel_ab.profiles # this is a tabular file with strain relative abundance within species. See header for details. Overall_rel_ab.profiles # this is a tabular file with strain relative abundance in overall samples. See header for details. uniGcode # this is the directory for all ".uniGcode" files, which are genotypes for strains.
The formats for "Intra_sp_rel_ab.profiles" and "Overall_rel_ab.profiles" are similar. An example is as below:
# Species strain_ID masked_samples sample_1 sample_2 Escherichia_coli str-1 NA 53.252835 37.212245 Escherichia_coli str-2 NA 46.747165 62.787755 Salmonella_typhi str-1 1 15.492194 41.212442 Salmonella_typhi str-2 1 38.313441 21.483291 Salmonella_typhi str-3 1 46.194365 37.304267
This means there are two species that passed the minimum coverage requirement for strain inference, and the relative abundance of each strain (E.coli has 2 strains and S.typhi has 3) are listed in sample_1 and sample_2 columns.
ConStrains tries to infer strain compositions in samples with lower coverage, but these results might not be reliable. We offer a column called "masked_samples" to label such samples; the index (1-offset) of samples are comma-delimited.
In the "uniGcode/" directory, you will find a few "*.uniGcode" files with names indicating the species. The format looks like the example shown below:
# *: not covered base; -: uncertain base #pid position ref str-1 str-2 p0387 1 A * * # <- insufficient mapped reads for inference p0387 2 T * * ...... p0212 317 A A A p0212 318 C T C # <- SNP site p0212 319 G G G p0212 320 C C C ...... p0027 271 G G G p0027 272 T T T p0027 273 G - - # <- uncertain base p0027 274 A A A p0027 275 A A A
As you probably already guessed, in the *.uniGcode files, each row is a position on the universally conserved genes, defined by the 1st column (gene's pid) and the 2nd column (1-offset position). The 3rd column is the reference base defined by the phylophlan seed DB, and from the 4th column, it lists the genotypes of each strain within the species.
As you can see in the above example, the .uniGcode files uses "" to label loci without sufficient coverage or read mapping for a confident inference, and '-' labels loci with umbiguous bases, usually caused by low-quality or spurious mapping.
Below is an example to walk you through what a real ConStrains analysis should look like.
First of all, you'll need some test data. We have prepared this for you. To download the data, simply run the utils script:
python utils/download_tutorial.py -o [outdir]
You just need to specify the output directory where you want to put the downloaded files. Let's call the outdir "constrains.test".
There are two files to be downloaded, you can see them with ls:
You should see:
You need to extract the fastq files from the "fq.tar.gz" tarball by running:
tar xzvf constrains.test/fq.tar.gz
that will extract the fastq files: "sample_1.1.fq", "sample_1.2.fq", "sample_2.1.fq", "sample_2.2.fq".
The two samples contain ART simulated E.coli Illumina reads. In sample_1, the composition is 5% (ETEC H10407) and 95% (HS), and in sample_2, the composition is 15.2% (IAI1), 26.3% (CFT073), and 58.5% (O157:H7).
Next is to run ConStrains.py. We simply run it as:
ConStrains.py -c test.conf -o constrains.test -t 2
This uses 2 threads to run the ConStrains.py. The output to stdout would look like below:
Project info saved to: constrains.test/proj.log Now constructing SNP flows... Profiling metagenome with MetaPhlAn... MetaPhlAn profiling done. There are 1 species selected for straining. Selected species list: # Species sample_1 sample_2 Escherichia_coli: 86.35X 98.52X Now Bowtie2 aligning reads to references, and call SNPs Done. Now Samtools mpileup to call SNPs, and construct SNP flows. Generating mpileup files... Done. Now merging mpileups for species... Done. Now calling SNPs. Done.
This should take about ten minutes to finish. You should see the following content in the constrains.test direcotry:
$ ls -lh constrains.test total 1.2M -rw-rw-r-- 1 cluo xavierlab 584K Mar 11 02:38 merged_ref.ffn drwxrwsr-x 2 cluo xavierlab 80 Mar 11 01:59 metaphlan/ drwxrwsr-x 2 cluo xavierlab 42 Mar 11 02:52 pileups/ -rw-rw-r-- 1 cluo xavierlab 1.2K Mar 11 02:38 proj.log drwxrwsr-x 2 cluo xavierlab 0 Mar 11 02:52 snpflows/ -rw-rw-r-- 1 cluo xavierlab 141 Mar 11 02:00 species.list
The results are stored at constrains.test/results:
$ ls -lh constrains.test/results total 55K -rw-rw-r-- 1 cluo xavierlab 277 Mar 11 03:26 Intra_sp_rel_ab.profiles -rw-rw-r-- 1 cluo xavierlab 277 Mar 11 03:26 Overall_rel_ab.profiles drwxrwsr-x 2 cluo xavierlab 43 Mar 11 03:27 uniGcode/
The Intra_sp_rel_ab.profiles and Overall_rel_ab.profile follow the same format.
As the name indicated, the former reports the relative abundance of each strain within the species, thus the strains within the same species' percentage relative abundance should sum to 100.0 The latter reports the overall relative abundance of strain within the community, thus all the strains together should sum to the sum of all the reported species' relative abundance.
As for the tutorial's case, the Intra_sp_rel_ab.profiles looks like:
# Species strain_ID masked_samples sample_1 sample_2 Escherichia_coli str-1 NA 0.00000000 14.88058489 Escherichia_coli str-2 NA 3.75299123 0.00000000 Escherichia_coli str-3 NA 96.24700877 85.11941511
The uniGcode/ folder has the uniGcodes for each strain deconvoluted. In out case, the E.coli uniGcodes look like:
# *: not covered base; -: uncertain base #pid position ref str-1 str-2 str-3 str-4 str-5 p0387 1 A * * * * * p0387 2 T * * * * * ...... p0212 179 A A A A A A p0212 180 G G G G G G p0212 181 G G G G G G p0212 182 G T G G G G p0212 183 C C C C C C ......