HTTPS SSH

ConStrains

ConStrains: identifying Conspecific Strains within metagenomic species

One line pitcher

ConStrains reconstructs the within-species diversity and SNP types by deconvoluting SNPs patterns across different conserved genes and across different samples.

Install

use git:

git clone https://bitbucket.org/luo-chengwei/constrains

use hg:

hg clone https://bitbucket.org/luo-chengwei/constrains

You can also download the zip archive of it from bitbucket repository:

https://bitbucket.org/luo-chengwei/constrains

Dependencies

  • Python-2.7 or above

  • Python libraries:

BioPython

Numpy

Scipy

NetworkX

  • Third party pipelines:

Bowtie2 2.2.1+

Metaphlan2 2.1+

Note: older versions of these dependencies might work but it's not guaranteed.

You don't have to install it, just call ConStrains.py from wherever you put the whole folder.

Usage

The basic ConStrains analysis runs as below:

ConStrains.py [options] -c/--conf <config.file> -o/--outdir <output directory>

The format config file follows:

//
sample: [sample1_ID]
fq1: [forward reads fastq]
fq2: [reverse/mate reads fastq]
//
sample: [sample2_ID]
fq1: [forward reads fastq]
fq2: [reverse/mate reads fastq]
...

Or, if only one fastq file per sample, the config file should look like this:

//
sample: [sample1_ID]
fq: [sample reads fastq]
//
sample: [sample2_ID]
fq: [sample reads fastq]
...

If you already have MetaPhlAn results for some(or, all) of the samples, you can specify them to save some time as:

//
sample: [sample1_ID]
fq1: [forward reads fastq]
fq2: [reverse/mate reads fastq]
metaphlan2: [metaphlan2 results for sample1]
//
sample: [sample2_ID]
fq1: [forward reads fastq]
fq2: [reverse/mate reads fastq]
metaphlan2: [metaphlan2 results for sample2]

Overall, you can have a config file for a 3-sample project that looks like this:

//
sample: sample_1
fq1: /home/sample_1.1.fq
fq2: /home/sample_1.2.fq
metaphlan2: /home/sample_1.metaphlan2_results.rel_ab.txt
//
sample: sample_2
fq: /home/sample_2.fq
metaphlan2: /home/sample_2.metaphlan2_results.rel_ab.txt
//
sample: sample_3
fq1: /home/sample_3.1.fq
fq2: /home/sample_3.2.fq

Below is a detailed usage of ConStrains.py:

Options:

--version             show program's version number and exit

-h, --help            show this help message and exit

Compulsory parameters:

There options are compulsory, and may be supplied in any order.

-o DIR, --outdir=DIR
                    The output directory of ConStrains.
-c FILE, --config=FILE
                    The configuration file of the ConStrains project.

Optional parameters:

There options are optional, and may be supplied in any order.

-t INT, --num_proc=INT
                    Number of processor for ConStrain to use [default: 1].
-d STRING, --ref_db=STRING
                    The prefix of species reference. [default:
                    ConStrains/db/ref_db].
-g STRING, --gsize_db=STRING
                    The directory of species average genome size DB.
                    [default: ConStrains/db/gsize.db].
--bowtie2=STRING    Path to bowtie2 binary, specify if not in env path
                    [default: bowtie2].
                    Bowtie2 citation: Langmead B. and Salzberg S., Nat.
                    Methods, 2012.
                    Bowtie2 page: http://bowtie-
                    bio.sourceforge.net/bowtie2
--bowtie2_build=STRING
                    Path to bowtie2-build binary, specify if not in env
                    path [default: bowtie2-build].
                    Bowtie2 citation: Langmead B. and Salzberg S., Nat.
                    Methods, 2012.
                    Bowtie2 page: http://bowtie-
                    bio.sourceforge.net/bowtie2
--samtools=STRING   Path to samtools binary, specify if not in env path
                    [default: samtools].
                    Samtools citation: Li H., et al, Bioinformatics, 2009.
                    Samtools webpage: http://samtools.sourceforge.net
-m STRING, --metaphlan2=STRING
                    Path to metaphlan2 script, specify if not in env path
                    [default: metaphlan.py].
                    MetaPhlAn citation: Segata N. et al, Nat. Methods,
                    2012.
                    MetaPhlAn2 page:
                    http://huttenhower.sph.harvard.edu/metaphlan2

Straining parameters:

There options are optional, and may be supplied in any order.

--min_cov=FLOAT     Minimum coverage of a species in a sample to be
                    considered [default: 10, range: 5+].

Runtime settings:

There options are optional, and may be supplied in any order.

-q, --quiet         Suppress printing detailed runtime information,
                    only important warnings/errors will show [default:
                    False].

Interpret output

In the project/output directory, you will find a folder called "results", in which you can find the below files and directories:

Intra_sp_rel_ab.profiles     # this is a tabular file with strain relative abundance within species. See header for details.
Overall_rel_ab.profiles      # this is a tabular file with strain relative abundance in overall samples. See header for details.
uniGcode                     # this is the directory for all ".uniGcode" files, which are genotypes for strains.

The formats for "Intra_sp_rel_ab.profiles" and "Overall_rel_ab.profiles" are similar. An example is as below:

# Species   strain_ID   masked_samples  sample_1   sample_2
Escherichia_coli    str-1   NA  53.252835   37.212245
Escherichia_coli    str-2   NA  46.747165   62.787755
Salmonella_typhi    str-1   1   15.492194   41.212442
Salmonella_typhi    str-2   1   38.313441   21.483291
Salmonella_typhi    str-3   1   46.194365   37.304267

This means there are two species that passed the minimum coverage requirement for strain inference, and the relative abundance of each strain (E.coli has 2 strains and S.typhi has 3) are listed in sample_1 and sample_2 columns.

ConStrains tries to infer strain compositions in samples with lower coverage, but these results might not be reliable. We offer a column called "masked_samples" to label such samples; the index (1-offset) of samples are comma-delimited.

In the "uniGcode/" directory, you will find a few "*.uniGcode" files with names indicating the species. The format looks like the example shown below:

# *: not covered base; -: uncertain base
#pid    position    ref str-1   str-2
p0387   1   A   *   *   # <- insufficient mapped reads for inference
p0387   2   T   *   *
......
p0212   317 A   A   A
p0212   318 C   T   C   # <- SNP site
p0212   319 G   G   G
p0212   320 C   C   C
......
p0027   271 G   G   G
p0027   272 T   T   T
p0027   273 G   -   -   # <- uncertain base
p0027   274 A   A   A
p0027   275 A   A   A

As you probably already guessed, in the *.uniGcode files, each row is a position on the universally conserved genes, defined by the 1st column (gene's pid) and the 2nd column (1-offset position). The 3rd column is the reference base defined by the phylophlan seed DB, and from the 4th column, it lists the genotypes of each strain within the species.

As you can see in the above example, the .uniGcode files uses "" to label loci without sufficient coverage or read mapping for a confident inference, and '-' labels loci with umbiguous bases, usually caused by low-quality or spurious mapping.

Tutorial

Below is an example to walk you through what a real ConStrains analysis should look like.

First of all, you'll need some test data. We have prepared this for you. To download the data, simply run the utils script:

python utils/download_tutorial.py -o [outdir]

You just need to specify the output directory where you want to put the downloaded files. Let's call the outdir "constrains.test".

There are two files to be downloaded, you can see them with ls:

ls constrains.test/

You should see:

constrains.test/fq.tar.gz
constrains.test/test.conf

You need to extract the fastq files from the "fq.tar.gz" tarball by running:

tar xzvf constrains.test/fq.tar.gz

that will extract the fastq files: "sample_1.1.fq", "sample_1.2.fq", "sample_2.1.fq", "sample_2.2.fq".

The two samples contain ART simulated E.coli Illumina reads. In sample_1, the composition is 5% (ETEC H10407) and 95% (HS), and in sample_2, the composition is 15.2% (IAI1), 26.3% (CFT073), and 58.5% (O157:H7).

Next is to run ConStrains.py. We simply run it as:

ConStrains.py -c test.conf -o constrains.test -t 2

This uses 2 threads to run the ConStrains.py. The output to stdout would look like below:

Project info saved to: constrains.test/proj.log
Now constructing SNP flows...
Profiling metagenome with MetaPhlAn...
MetaPhlAn profiling done.
There are 1 species selected for straining.
Selected species list:  
    # Species   sample_1    sample_2
      Escherichia_coli: 86.35X  98.52X

Now Bowtie2 aligning reads to references, and call SNPs
Done.
Now Samtools mpileup to call SNPs, and construct SNP flows.
  Generating mpileup files...
  Done.
  Now merging mpileups for species...
  Done.
  Now calling SNPs.
   Done.

...

This should take about ten minutes to finish. You should see the following content in the constrains.test direcotry:

$ ls -lh constrains.test
total 1.2M
-rw-rw-r-- 1 cluo xavierlab 584K Mar 11 02:38 merged_ref.ffn
drwxrwsr-x 2 cluo xavierlab   80 Mar 11 01:59 metaphlan/
drwxrwsr-x 2 cluo xavierlab   42 Mar 11 02:52 pileups/
-rw-rw-r-- 1 cluo xavierlab 1.2K Mar 11 02:38 proj.log
drwxrwsr-x 2 cluo xavierlab    0 Mar 11 02:52 snpflows/
-rw-rw-r-- 1 cluo xavierlab  141 Mar 11 02:00 species.list

The results are stored at constrains.test/results:

$ ls -lh constrains.test/results
total 55K
-rw-rw-r-- 1 cluo xavierlab 277 Mar 11 03:26 Intra_sp_rel_ab.profiles
-rw-rw-r-- 1 cluo xavierlab 277 Mar 11 03:26 Overall_rel_ab.profiles
drwxrwsr-x 2 cluo xavierlab  43 Mar 11 03:27 uniGcode/

The Intra_sp_rel_ab.profiles and Overall_rel_ab.profile follow the same format.

As the name indicated, the former reports the relative abundance of each strain within the species, thus the strains within the same species' percentage relative abundance should sum to 100.0 The latter reports the overall relative abundance of strain within the community, thus all the strains together should sum to the sum of all the reported species' relative abundance.

As for the tutorial's case, the Intra_sp_rel_ab.profiles looks like:

# Species   strain_ID   masked_samples  sample_1    sample_2
Escherichia_coli    str-1   NA  0.00000000  14.88058489
Escherichia_coli    str-2   NA  3.75299123  0.00000000
Escherichia_coli    str-3   NA  96.24700877 85.11941511

The uniGcode/ folder has the uniGcodes for each strain deconvoluted. In out case, the E.coli uniGcodes look like:

# *: not covered base; -: uncertain base
#pid    position    ref str-1   str-2   str-3   str-4   str-5
p0387   1   A   *   *   *   *   *
p0387   2   T   *   *   *   *   *
......
p0212   179 A   A   A   A   A   A
p0212   180 G   G   G   G   G   G
p0212   181 G   G   G   G   G   G
p0212   182 G   T   G   G   G   G
p0212   183 C   C   C   C   C   C
......