Clone wiki

biobakery / shortbred

ShortBRED Tutorial

ShortBRED (Short, Better Representative Extract Dataset) is a pipeline to take a set of protein sequences, cluster them into families, build consensus sequences to represent the families, and then reduce these consensus sequences to a set of unique identifying strings ("markers"). The pipeline then searches for these markers in metagenomic data and determines the presence and abundance of the protein families of interest.

For additional information, please refer to the manuscript: Kaminski J, Gibson MK, Franzosa EA, Segata N, Dantas G, Huttenhower C.High-specificity targeted functional profiling in microbial communities with ShortBRED. PLoS Comput Biol. 2015 Dec 18;11(12):e1004557.

We provide support for ShortBRED users. Please join our Google group designated specifically for ShortBRED users. Feel free to post any questions on the google group by posting directly or emailing shortbred-users@googlegroups.com



Overview

The following figure shows the workflow of ShortBRED.

http://huttenhower.sph.harvard.edu/sites/default/files/webfm/Jim/Figure1.png

1. Install

ShortBRED can be installed with Homebrew or run from a Docker image. Please note, if you are using bioBakery (Vagrant VM or cloud) you do not need to install ShortBRED because the tool and its dependencies are already installed. However, you will need to install the dependency USEARCH which requires a license. Follow the commands in the instructions to install bioBakery dependencies that require licences.

Install with Homebrew: $ brew install biobakery/biobakery/shortbred

Install with Docker: $ docker run -it biobakery/shortbred bash

If you would like to install from source, refer to the ShortBRED user manual for the pre-requisites/dependencies and installation instructions.

2. ShortBRED-Identify

ShortBRED-Identify clusters the input protein sequences into families, builds consensus sequences, and then identifies regions of overlap among the consensus sequences and between the consensus sequences and a set of reference proteins. This information is used to construct a set of representative markers for the families.

2.1 Input Files

The input files required for the script are the following (Sample input files for the purpose of this tutorial are provided) :

2.2 Running ShortBRED-Identify

To create markers for the sample data, run the following command from the shortbred working directory:

$ shortbred_identify.py --goi  example/input_prots.faa --ref example/ref_prots.faa --markers mytestmarkers.faa --tmp example_identify

The above command will create a set of markers (mytestmarkers.faa). An example of the output is below:

>P23181_TM_#01
MLRSSNDVTQQGSRPKTKLGGSSMGIIRTCRLGPDQVKSMRAALDLFGREFGDVATYSQH
QPDSDYLGNLLRSKTFIALAAFDQEAVVGALAAYVLPKFEQARSEIYIYDLAVSGEHRRQ
GIATALINLLKHEANALGAYVIYVQADYGDDPAVALYTKLGIREEVMHFDIDPSTAT
>P13246_TM_#01
IGPVEGGAETVVAALRSAVGPTGTVMGYASWDRSPYEETLNGARLDD
>P13246_TM_#02
PFDPATAGTYRGFGLLNQFLVQAPGARRSAHPDASMVAVGPLAETLTEPHELGHALGEGS
P
>P13246_TM_#03
ERFVRLGGKALLLGAPLNSVTALHYAEAVADIPNKRWVTYEMPM
>P13246_TM_#04
GRDGEVAWKTASDYDSNGILDCFAIEGK
>P13246_TM_#05
DAVETIANAYVKLGRHREGV
>YP_884847_TM_#01
SHHGALIAHGAVVQRRLMYRGPDGRGHALRCGYVEAVAVREDRRGDGLGTAVLDALEQVI
RGAYQIGALSASDIARPMYIARGWLSWEGPTSVLTPTEGIVRTPEDDRSLFVLPVDLPDG
LELDTAREITCDWRSGDPW
>NP_753952_TM_#01
MQKYISEARLLLALAIPVILAQIAQTAMGFVDTVMAGGYSATDMAAVAIGTSIWLPAILF
GHGLLLALTPVIAQLNGSGRRERIAHQVRQGFWLAGFVSVLIMLVLWNAGYIIRSMQNID
PALADKAVGYLRALLWGAPGYLFFQVARNQCEGLAKTKPGMVMGFIGLLVNIPVNYIFIY
GHFGMPELGGVGCGVATAAV
>YP_001848841_TM_#01
HALGGMHALIWHRGAIIAHGAVVQRRLIYRGS
>Q49157_TM_#01
EGDFSDADWEHALGGMHAFICH
>Q49157_TM_#02
VEQVLRGAYQLGALSASDTARGMYLSRGWLPWQGPTSVLQPAGVTRTPEDDEGLFVLPVG
LPAGMELDTTAEITCDWRDGDVW
>NP_214776_TM_#01
DIRQMVTGAFAGDFTETDWEHTLGGMHALIWHHGAIIAHAAVIQRRLIYRGNALRCGYVE
GVAVRADWRGQRLVSALLDAVEQVMRGAYQLGALSSSAR
>ZP_02959935_TM_#01
MGIEYRSLHTSQLTLSEKEALYDLLIEGFEGDFSHDDFAHTLGGMHVMAFDQQKLVGHVA
IIQRHMALDNTPISVGYVEAMVVEQSYRRQGIGRQLMLQTNKIIASCYQLGLLSASDDGQ
KLYHSVGWQIWKGKLFELKQGSYIRSIEEEGGVMGWKADGEVDFTASLYCDFRGGDQW
>YP_001068559_TM_#01
MAGTPRWYNDGVLPQLSSEVRGHGVIHTARLVHTADLDNETREGARRMVSEAFRG
>YP_001068559_TM_#02
CRGQGLGSAVMDACEQVLRGAYQLGALATSTMARPMYRARGWVPWRGPTSVLSPGGRIST
P
>YP_001068559_TM_#03
DDGSVFVYPVGSALGSTDLDTTAELTCDWRHGDVW

The directory example_identify (folder name provided with the tmp flag) should contain the processed data from ShortBRED (including blast results). Please refer to the documentation for further details.


3. ShortBRED-Quantify

ShortBRED-Quantify then searches for the markers in nucleotide data, and returns a normalized, relative abundance table of the protein families found in the data. This script takes the FASTA file of markers and quantifies their relative abundance in a FASTA file of nucleotide metagenomic reads.

3.1 Input Files

The input files required for the script are the following (Sample input files for the purpose of this tutorial are provided) :

  • Markers file (generated from Section 1)
  • Short nucleotide reads (wgs.fna)

3.2 Running ShortBRED-Quantify

To create markers for the sample data, run the following command from the shortbred working directory:

$ shortbred_quantify.py --markers mytestmarkers.faa --wgs example/wgs.fna  --results exampleresults.txt --tmp example_quantify

The above command will create an output file exampleresults.txt containing relative abundance data of the protein families in the wgs data. An example of the output is below:

Family      Count   Hits    TotMarkerLength
NP_214776   0.0     0       99
NP_753952   19569471.6243   1       200
P13246      0.0     0       200
P23181      0.0     0       177
Q49157      0.0     0       105
YP_001068559        0.0     0       151
YP_001848841        0.0     0       32
YP_884847   0.0     0       139
ZP_02959935 0.0     0       178

The directory example_quantify (folder name provided with the tmp flag) should contain the processed data from ShortBRED. Please refer to the documentation for further details.


Updated