Wiki

General scheme

Profs is a database which integrates the sequence information provided in the profiles of pfam database with structural information given by CATH domain database. Starting from a proteins sequences are identified domain by Profs families. For construction of Profs and sequence annotation, we follow the following scheme:

Extraction of sequences

We read the ATOM and SEQRES records of the PDB, these are alignment using Muscle in order to identify missing residues.

Post-processing families

Uniprot sequence are consulted in pfam profiles database employing hmmscan v3.0 with the options -cut_ga (E-value < 0.001). For the PDB, we used the sequences obtained from SEQRES records.
For the families identified, we calculate the overlap by equations:

Overlap1 = Lo / L1 • 100 % & Overlap2 = Lo / L2 • 100 %

The equations relate the number of overlapping residues Lo versus families length L1 and L2. Families with overlaps greater than 10% in both equations are fused, only the families with the lowest e-value are preserved.
The families that in the description record of pfamA database have the word "repeat" are identified as repeated families.
We group repeated families that either have the same pfam id or belong to the same clan, and rename the resulting families by their clan id.

Annotations of the PDB

The processed PDBs are treated in two different ways depending on whether these have entries in CATH database or not. For the first group, our Pfam-based domain definitions are mapped with the CATH v4.0 definitions with an empirical minimal overlap of 15 amino acids, we have considered two types of relationships: single, when a pfam family corresponds to one or more CATH domain, and multiple or supradomain, when several pfam families are associated to a CATH domain. In the lastest case, supradomains are renamed concatenating pfam id by plus sign (e.g. LRR+LRRNT), and new limits are taken with the aim of have the broadest domain definition. We have designed a set of rules analyzing three-dimensional structural information of supradomain.

Supradomains formed by copies of domains from the same family are processed as single domain
Families are merged in supradomain, if there are supradomain formed in other PDBs chains with same families.
PDBs chains that have consecutive supradomains with families in common are assigned to the second group explained below.

The regions orphan of pfam families are discarded despite having CATH domain definitions, pfam domain definitions without CATH definitions are preserved without change. Later, the region between domain are distributed in equal parts to give larger domains definition, only if there are less than 15 residues between them. Thereby, we guarantee that there are no orphan regions misallocated. The second group is annotated using the procedure described in the section Annotations of the Uniprot.

Generation of libraries

The libraries are constructed in two part. Initially, We employ the first group of PDBs to generate non-redundant library of domain sequences for each family using CD-HIT at 100% identity. Then, The second group of PDBs are annotated by the library created in the first part and these are used to generate the second part of the libraries. In both libraries, given that there are domain formed by different repeat families of the same clan (e.g. Ank), and these may be named by the clan name or any member of the clan, we often found the best templates in libraries of other members of clan. Due to this, we have gathered all repeated members of clan in same library. Finally, the libraries for annotate Uniprot sequences are constructed using both parties.

Annotations of the Uniprot

We identify homologous structures for use as templates, this can be done quickly using Blastp, a sequence alignment software. In the process, a target protein sequence is aligned against all family members using the libraries previously constructed.
The best templates are chosen by the following scoring function:

Score = α • SeqId / 100 • Cov / 100 + ( 1 - α ) • Cov / 100

The scoring function takes into account the coverage of the sequence (Cov, in the range of 0–100) and the sequence identity of the template to the target protein (SeqId, in the range of 0–100). The parameter α is used to balance the contribution of the two terms (α is set to 0.95)
The best matching domain obtained with blastp is realigned using Muscle in order to to cover all domain by expanding the flanking regions. If there exists overlapping between domains or the unassigned regions have less than 15 residues, the final domain boundaries are uniformly distributed.

BACK