# ShortBRED Tutorial

ShortBRED (Short, Better Representative Extract Dataset) is a pipeline to take a set of protein sequences, cluster them into families, build consensus sequences to represent the families, and then reduce these consensus sequences to a set of unique identifying strings ("markers"). The pipeline then searches for these markers in metagenomic data and determines the presence and abundance of the protein families of interest.

For additional information, please refer to the manuscript: Kaminski J, Gibson MK, Franzosa EA, Segata N, Dantas G, Huttenhower C.High-specificity targeted functional profiling in microbial communities with ShortBRED. PLoS Comput Biol. 2015 Dec 18;11(12):e1004557.

## Overview

The following figure shows the workflow of ShortBRED.

## 1. Install

ShortBRED can be installed with Homebrew or run from a Docker image. Please note, if you are using bioBakery (Vagrant VM or cloud) you do not need to install ShortBRED because the tool and its dependencies are already installed. However, you will need to install the dependency USEARCH which requires a license. Follow the commands in the instructions to install bioBakery dependencies that require licences.

Install with Homebrew: $brew install biobakery/biobakery/shortbred Install with Docker:$ docker run -it biobakery/shortbred bash

If you would like to install from source, refer to the ShortBRED user manual for the pre-requisites/dependencies and installation instructions.

## 2. ShortBRED-Identify

ShortBRED-Identify clusters the input protein sequences into families, builds consensus sequences, and then identifies regions of overlap among the consensus sequences and between the consensus sequences and a set of reference proteins. This information is used to construct a set of representative markers for the families.

### 2.1 Input Files

The input files required for the script are the following (Sample input files for the purpose of this tutorial are provided) :

### 2.2 Running ShortBRED-Identify

To create markers for the sample data, run the following command from the shortbred working directory:

$shortbred_identify.py --goi example/input_prots.faa --ref example/ref_prots.faa --markers mytestmarkers.faa --tmp example_identify  The above command will create a set of markers (mytestmarkers.faa). An example of the output is below: >P23181_TM_#01 MLRSSNDVTQQGSRPKTKLGGSSMGIIRTCRLGPDQVKSMRAALDLFGREFGDVATYSQH QPDSDYLGNLLRSKTFIALAAFDQEAVVGALAAYVLPKFEQARSEIYIYDLAVSGEHRRQ GIATALINLLKHEANALGAYVIYVQADYGDDPAVALYTKLGIREEVMHFDIDPSTAT >P13246_TM_#01 IGPVEGGAETVVAALRSAVGPTGTVMGYASWDRSPYEETLNGARLDD >P13246_TM_#02 PFDPATAGTYRGFGLLNQFLVQAPGARRSAHPDASMVAVGPLAETLTEPHELGHALGEGS P >P13246_TM_#03 ERFVRLGGKALLLGAPLNSVTALHYAEAVADIPNKRWVTYEMPM >P13246_TM_#04 GRDGEVAWKTASDYDSNGILDCFAIEGK >P13246_TM_#05 DAVETIANAYVKLGRHREGV >YP_884847_TM_#01 SHHGALIAHGAVVQRRLMYRGPDGRGHALRCGYVEAVAVREDRRGDGLGTAVLDALEQVI RGAYQIGALSASDIARPMYIARGWLSWEGPTSVLTPTEGIVRTPEDDRSLFVLPVDLPDG LELDTAREITCDWRSGDPW >NP_753952_TM_#01 MQKYISEARLLLALAIPVILAQIAQTAMGFVDTVMAGGYSATDMAAVAIGTSIWLPAILF GHGLLLALTPVIAQLNGSGRRERIAHQVRQGFWLAGFVSVLIMLVLWNAGYIIRSMQNID PALADKAVGYLRALLWGAPGYLFFQVARNQCEGLAKTKPGMVMGFIGLLVNIPVNYIFIY GHFGMPELGGVGCGVATAAV >YP_001848841_TM_#01 HALGGMHALIWHRGAIIAHGAVVQRRLIYRGS >Q49157_TM_#01 EGDFSDADWEHALGGMHAFICH >Q49157_TM_#02 VEQVLRGAYQLGALSASDTARGMYLSRGWLPWQGPTSVLQPAGVTRTPEDDEGLFVLPVG LPAGMELDTTAEITCDWRDGDVW >NP_214776_TM_#01 DIRQMVTGAFAGDFTETDWEHTLGGMHALIWHHGAIIAHAAVIQRRLIYRGNALRCGYVE GVAVRADWRGQRLVSALLDAVEQVMRGAYQLGALSSSAR >ZP_02959935_TM_#01 MGIEYRSLHTSQLTLSEKEALYDLLIEGFEGDFSHDDFAHTLGGMHVMAFDQQKLVGHVA IIQRHMALDNTPISVGYVEAMVVEQSYRRQGIGRQLMLQTNKIIASCYQLGLLSASDDGQ KLYHSVGWQIWKGKLFELKQGSYIRSIEEEGGVMGWKADGEVDFTASLYCDFRGGDQW >YP_001068559_TM_#01 MAGTPRWYNDGVLPQLSSEVRGHGVIHTARLVHTADLDNETREGARRMVSEAFRG >YP_001068559_TM_#02 CRGQGLGSAVMDACEQVLRGAYQLGALATSTMARPMYRARGWVPWRGPTSVLSPGGRIST P >YP_001068559_TM_#03 DDGSVFVYPVGSALGSTDLDTTAELTCDWRHGDVW  The directory example_identify (folder name provided with the tmp flag) should contain the processed data from ShortBRED (including blast results). Please refer to the documentation for further details. ## 3. ShortBRED-Quantify ShortBRED-Quantify then searches for the markers in nucleotide data, and returns a normalized, relative abundance table of the protein families found in the data. This script takes the FASTA file of markers and quantifies their relative abundance in a FASTA file of nucleotide metagenomic reads. ### 3.1 Input Files The input files required for the script are the following (Sample input files for the purpose of this tutorial are provided) : • Markers file (generated from Section 1) • Short nucleotide reads (wgs.fna) ### 3.2 Running ShortBRED-Quantify To create markers for the sample data, run the following command from the shortbred working directory: $ shortbred_quantify.py --markers mytestmarkers.faa --wgs example/wgs.fna  --results exampleresults.txt --tmp example_quantify


The above command will create an output file exampleresults.txt containing relative abundance data of the protein families in the wgs data. An example of the output is below:

Family      Count   Hits    TotMarkerLength
NP_214776   0.0     0       99
NP_753952   19569471.6243   1       200
P13246      0.0     0       200
P23181      0.0     0       177
Q49157      0.0     0       105
YP_001068559        0.0     0       151
YP_001848841        0.0     0       32
YP_884847   0.0     0       139
ZP_02959935 0.0     0       178


The directory example_quantify (folder name provided with the tmp flag) should contain the processed data from ShortBRED. Please refer to the documentation for further details.

Updated