A protocol for chemically guided functional profiling (CGFP) in meta’omics data
- A protocol for chemically guided functional profiling (CGFP) in meta’omics data
- Benjamin J. Levin (firstname.lastname@example.org)
- Eric Franzosa
This repository contains a protocol and several scripts to assist users in combining Sequence Similarity Network (SSN) information with ShortBRED output. This approach, called "chemically guided functional profiling," is useful for studying large protein families in metagenomic and metatranscriptomic datasets.
SSNs are visual tools used to analyze large protein families. SSNs consist of nodes, which represent protein sequences, and edges, which connect nodes that share a sequence identity greater than a user-defined cutoff value. By carefully selecting an edge threshold, an SSN can be used to divide large protein families into putatively isofunctional clusters. Experimental validation is needed to verify that clusters are isofunctional, but an SSN can provide an excellent starting point.
ShortBRED is a method for quantifying the abundance of protein-encoding genes in metagenomes and metatranscriptomes. Given a set of amino acid sequences, ShortBRED first identifies short peptide markers unique to highly similar representative sequences and then quantifies the abundance of these markers in meta’omic datasets.
By combining the knowledge embedded within an SSN with ShortBRED quantification, one can determine the abundances of functionally distinct members of an enzyme superfamily in meta’omic datasets. First, an SSN is constructed for a given protein family of interest (PFOI), with the goal of dividing the family into isofunctional clusters. The abundances of members of the PFOI in meta’omics data are then determined using ShortBRED. Each member of the PFOI is linked to a cluster on the SSN, and so abundance information of entire clusters can be found by summing the abundances of each cluster’s members. Assuming clusters in the SSN are isofunctional, the abundances of particular biochemical functions in microbiomes can be determined.
If you use the protocol or data files provided here, please cite the following publication:
B. J. Levin et al., Science 355, eaai8386 (2017). (DOI: 10.1126/science.aai8386)
Sequence Similarity Network (SSN) for Protein Family of Interest (PFOI). SSNs can be generated using the Enzyme Function Initiatives-Enzyme Similarity Tool (EFI-EST) at http://efi.igb.illinois.edu/efi-est/. In addition to the tutorial on the EFI-EST website, see this review for more information on SSN construction. For chemically guided functional profiling to be successful, SSNs must be well-constructed. Ideally, the clusters within an SSN should be known to represent isofunctional proteins. The code and tutorial below assume that the SSN has been generated from an InterPro family. If another option was used, such as with user-supplied sequences, the general workflow will be identical, but the scripts below will not apply.
Cytoscape. A program used to visualize SSNs. Cytoscape can be downloaded from www.cytoscape.org. The tools below have been validated with Cytoscape 3.2.1.
ShortBRED. A method to quantify protein families in metagenomes and metatranscriptomes with high specificity. Detailed instructions for installing and running ShortBRED can be found at https://bitbucket.org/biobakery/shortbred/wiki/Home. For users having issues with installing or running ShortBRED, please direct questions to email@example.com.
Background protein reference. ShortBRED requires a "protein universe" to aid in finding peptide markers for proteins of interest. We recommend UniRef90 for this purpose, which can be downloaded here. UniRef90 is a comprehensive, non-redundant representation of the proteins in UniProt (clustered at 90% amino acid identity). More information about UniRef90 can be found here.
Metagenomes of interest. User-provided metagenomes and metatranscriptomes can be used in this protocol. Publicly available datasets can be used for this analysis as well. The Human Microbiome Project has made available hundreds of metagenomes. MG-RAST and the EBI metagenomics database are additional, excellent resources for acquiring shotgun meta’omic sequencing data for a variety of projects and microbial community types.
CGFP scripts. Two scripts, compatible with Python 2.7+, are required to complete the protocol. Download those scripts + demo data here. Alternatively, you may directly clone this repository:
hg clone https://bitbucket.org/biobakery/cgfp.
Processing an SSN
1. Extracting Data from the SSN
Two tables must be exported from the SSN of interest. These tables, containing information about the nodes and edges in the SSN, can be exported from Cytoscape. Open the SSN of interest and select
Table and select the node table from the list that appears. Repeat the process, selecting the edge table to obtain both tables (see figure below).
From these two tables, two additional files need to be generated. We have provided a script that can construct those files:
parse_ssn.py. This script outputs 1) a list of UniProt accession codes for all sequences on the network and 2) a mapping file to link each node to the cluster it belongs to. The list of accession codes will be used to download the FASTA sequences for all members of the PFOI. The other file, the mapping file, is used to link members of the PFOI to the clusters they belong to. Although the EFI-EST and Cytoscape calculate and illustrate nodes and edges, for our purposes it is more helpful to divide the network into connected components and assign a cluster number to each component. The mapping file links each sequence to the cluster to which it belongs. If the SSN contains isofunctional clusters, then all sequences with the same cluster number (i.e. in the same connected component) correspond to proteins with the same function.
The script to generate the accession code list and the mapping file can be run as follows:
./scripts/parse_ssn.py $NODES $EDGES -e $N -a $ACCESSION -c $CLUSTERS
$NODESis the nodes table exported from the SSN.
$EDGESis the edges table exported from the SSN.
-a $ACCESSIONis the location of the output file containing the UniProt accession codes for all sequences in the network. This flag is optional.
-c $CLUSTERSis the location of the output file containing the cluster mapping file. This flag is optional.
-e $Nis an additional edge filter. Instead of forcing users to regenerate a new SSN or remove edges in Cytoscape if the SSN is to be refined (i.e. increase the edge threshold), this flag forces the script to only consider edges connecting nodes sharing greater than N% identity. This flag is optional.
If the SSN was not generated from an InterPro family, this script will not work. A list of accession codes and a cluster mapping file would need to be made manually from the table of nodes or with an application-specific script.
2. Obtaining FASTA Sequences
The sequences for all proteins in the SSN need to be obtained. Using the list of UniProt accession codes obtained in the previous step, sequences can be downloaded easily from http://www.uniprot.org/uploadlists/. After uploading the list of sequences, confirm that the default options to convert the accession codes (UniProtKB AC/ID) to UniProtKB output and select
Go (see figure below).
On the results page, select
Download and download all of the sequences as
FASTA (canonical) (see figure below). Be sure to unzip the file if a compressed version is downloaded.
If the SSN was not generated with an InterPro family, the sequence list would need to be constructed either manually or with an application-specific script.
3. Running ShortBRED-Identify
ShortBRED-Identify is used to find markers for similar members of the PFOI. It requires a FASTA file containing all sequences of interest and a reference list, typically UniRef90. Only the list of markers is needed for subsequent analysis, but temporary files should be retained if ShortBRED-Identify will need to be rerun with modified parameters.
A typical ShortBRED-Identify call is provided below. Note that ShortBRED-Identify takes ~100 CPU-hours to find markers for ~1,000 sequences, but the rate-limiting steps are parallelized and can benefit from the use of multiple CPU cores. For a full list of flags, run
./shortbred_identify.py --threads $THREADS --goi $GOIFASTA --ref $UNIREF90 --markers $MARKERS --tmpdir $TMP
$THREADSis the number of threads for ShortBRED to use. This is important when running on computational clusters.
$GOIFASTAis the location of a FASTA file containing all of the sequences for proteins on the SSN. This file was generated in Step 2 of this protocol.
$UNIREF90is the location of UniRef90 (see Prerequisites).
$MARKERSis the location of the output marker file generated by ShortBRED.
$TMPis the location for ShortBRED to store temporary files. Some of these files are useful for analysis and troubleshooting and it is recommended they be saved.
4. Running ShortBRED-Quantify
ShortBRED-Quantify is run with the markers generated in Step 3 against a user-provided metagenome (or metatranscriptome). A typical ShortBRED-Quantify call is provided below. This script will need to be run for each metagenome of interest. For a full list of flags, run
./shortbred_quantify.py --threads $THREADS --markers $MARKERS --wgs $METAGENOME --results $RESULTS.txt --tmp $TMP
$THREADSis the number of threads for ShortBRED to use.
$MARKERSis the location of the marker file generated in Step 2.
$METAGENOMEis the location of the metagenome to be queried (see Prerequisites).
$RESULTSis the location of where to store the resulting output file.
$TMPis the location for ShortBRED to store temporary files. This flag is not necessary, but highly useful for troubleshooting and detailed analysis.
Merging and analyzing ShortBRED results
5. Compiling ShortBRED Results
ShortBRED-Quantify outputs one
$RESULTS file for each metagenome analyzed. For analyzing more than one metagenome, we have written a script to combine the output from multiple results files:
merge_shortbred.py. It can be run as follows:
./scripts/merge_shortbred.py $RESULTS_1 $RESULTS_2 … $RESULTS_N -c $CLUSTERS -p $PROTEINS -C $CLUSTERS
$RESULTSfiles are ShortBRED-Quantify output files.
$CLUSTERSis the cluster mapping file generated in Step 1.
$PROTEINSis the name of the output file containing the abundances of individual proteins.
$CLUSTERSis the name of the output file containing the abundances of clusters on the SSN.
This script takes as inputs a list of locations of results files (from Step 4) and the cluster mapping file (from Step 2) and outputs two tables: one with results for each protein and one where proteins from the same cluster in the SSN have had their abundances summed. For more information and a full list of flags, run
6. Output normalization
merge_shortbred.py script can be used to normalize the output to counts per microbial genome. To do this calculation, append the flag
-g $AGS, where
$AGS is a two column file: the first column has metagenome identifiers and the second column lists the average genome size. Average genome sizes (AGS) have been previously computed for many of the HMP metagenomes using the software system MicrobeCensus. MicrobeCensus can be applied to compute AGS values for additional metagenomes. If you use MicrobeCensus or the AGS values bundled with the CGFP workflow, please cite this publication.
merge_shortbred.py can normalize protein abundance information to relative abundance (sum=1) units. To perform this type of normalization, append the flag
-n. This does not require any additional files. If no extra normalization options are specified, the table is output in RPKM units (reads per kilobase of sequence per million sample reads).
7. Downstream analysis
The tables produced by
merge_shortbred.py can be analyzed in any number of ways. For example, heatmap-style visualization provides a convenient summary of the entire merged abundance table.
hclust2 provides a Python interface for heatmap construction (along with feature selection and feature/sample clustering).
We have included sample files to use as test cases. The expected inputs and outputs for each of the steps in the protocol is given below:
- Step 1
- Input: IPR004184_nodes_10-13-16.csv; IPR004184_edges_10-13-16.csv
- Output: UniProtIDs.txt; Mapping.tsv
- Step 2
- Input: UniProtIDs.txt
- Output: Sequences.fasta
- Step 3
- Input: Sequence.fasta (UniRef90 was used as the reference set)
- Output: Markers.faa
- Step 4
- Input: Markers.faa (ShortBRED-Quantify was run separately against each metagenome; metagenomes used in this tutorial can be obtained from here)
- Output: SRS011061.txt; SRS011134.txt; SRS011239.txt
- Step 5
- Input: SRS011061.txt; SRS011134.txt; SRS011239.txt; Mapping.tsv
- Output: ProteinCounts.tsv; ClusterCounts.tsv
- Step 6
- Input: SRS011061.txt; SRS011134.txt; SRS011239.txt; Mapping.tsv; AvgGenomeSizes.txt
- Output: ProteinCountsNorm.tsv; ClusterCountsNorm.tsv