Fit3D - A tool for small structural motif screening in proteins and further analyses.

Fit3D is a command line software tool to search for small structural motifs of amino acids in target proteins. It can also be used for batch-processing by applying to a large number of target structures. Basically it is based on a combinatoric search approach using amino acid distributions in local environments. This allows Fit3D to locate inter- as well as intra-molecular structural motifs. The similarity between the search motif and the possible hit is measured using least-root-mean-square deviation (LRMSD). A hit is found if LRMSD is below a user-defined threshold. There is a variety of options available enabling the user to perform a highly customized search or post-analyses.

Features

Search

highly sensitive search for structural motifs in proteins
locate intra- and inter-molecular structural motifs, which is especially of interest for interacting sites
high-throughput processing of thousands of target structures
calculation of statistical significance (p-values) of hits
freely define any valid PDB atoms used for alignment calculation
definition of position-specific residue exchanges allows relaxation of physico-chemical and steric constraints
output aligned hits in Protein Data Bank (PDB) format for visual inspection
output extensive summary of results
pause computational extensive searches and resume later
multi-threading capability

Analyses

cluster hits based on structural similarity to estimate structural homology
mapping of enzyme classification (EC)
mapping of Pfam accession codes
EC based classification of hits to distinguish between biological and non-biological results
combination of clustering and classification to derive functional phylogeny

Input

The query motif must be in PDB format. The built-in extraction wizard (option -E) can be used to define a new query motif. You can either provide a single search target by its PDB-ID (e. g. 1PQS) or you can provide a file containing a list of PDB-IDs separated by line break. There is also the possibility to process custom structures in PDB format by providing a file path instead of PDB-ID.

Output (for each hit)

LRMSD to query set
p-value (if option -P enabled)
PDB-ID
EC number (if option -E or -B enabled)
Pfam accession code (if option -M enabled)
class: positive (+), negative (-) or used for training (o) (if option -B enabled)
class probability (if option -B enabled)
intra- or inter-molecular occurrence
amino acid sequence
motif representation:

:::text
[chain ID]-[amino acid][residue number][insertion code]

* title (if option -T enabled)

example Fit3D output:

:::text
LRMSD     p-value              PDB-ID    EC          class    prob      occ      seq    motif                   title
2.6156    2.563957525084E-1    2Y42      1.1.1.85    +        0.7461    intra    HDS    A-H213 A-D184 A-S182    "STRUCTURE OF ISOPROPYLMALATE DEHYDROGENASE FROM THERMUS THERMOPHILUS - COMPLEX WITH NADH AND MN"

Parameters

:::text
 -a,--atoms <CA,CB,CG,CD,...>    PDB identifier of atoms used for alignment (default: all non-hydrogen motif atoms)
 -B,--classify-hits <arg>        classify hits based on EC class, e.g 3.4.21 for peptide hydrolases (experimental)
 -C,--cluster-hits <arg>         hierarchical clustering of hits below LRMSD cutoff
 -c,--conserve                   conserve atoms for structure output
 -d,--distance-tolerance <arg>   allowed tolerance of query motif spatial extent (default: 1.0 Å)
                                 WARNING: performance decrease if raised, set lower value for larger motifs
 -E,--ec-mapping                 map EC numbers by getting RESTful data from RCSB
                                 WARNING: requires internet connection and decreases performance
 -e,--exchange-residues <arg>    allowed residue exchanges for input motif (default: none)
                                 syntax: [motif residue number]:[allowed residues],...
                                 e.g. 12:ASHPW,43:PR
 -F,--filter-environment         pre-filter mirco environments based on local distance constraints
                                 WARNING: performance gain, can result in loss of some hits
 -f,--result-file <arg>          result file
 -G,--gap-mapping                map gap sequence between hit amino acids
                                 WARNING: hit amino acids are reordered according to sequential order
 -g,--align-output               align output structures (default: false)
 -h,--help                       show help dialog
 -i,--ignore-atoms               ignore missing atoms, force alignment (default: false)
 -l,--target-list <arg>          target list of PDB-IDs or files**
 -M,--pfam-mapping               map Pfam annotation by getting RESTful data from RCSB
                                 WARNING: requires internet connection and decreases performance
 -m,--motif <arg>                motif PDB structure file*
 -N,--ref-size <arg>             size of reference population for p-value calculation to estimate point-weight correction
                                 according to Fofanov et al. 2008 (default: 31133)
 -n,--num-threads <arg>          number of threads used for calculation (default: all available)
 -o,--output-structures <arg>    output structures directory
 -P,--pvalues <F|S>              calculate p-values for matches according to Fafoanov et al. 2008 (F) or Stark et al. 2003 (S) (default: false)
                                 WARNING: F needs R in path with package sfsmisc installed
 -p,--pdb <arg>                  path to local PDB directory
 -q,--quiet                      show only results
 -R,--restore-session <arg>      restore session from file
 -r,--rmsd <arg>                 maximal allowed LRMSD for hits  (default: 2.0 Å)
 -s,--no-pdb-split               disable PDB directory split (default: false)
 -T,--title-mapping              map structure titles assigned by PDB
 -t,--target <arg>               target PDB-ID or file**
 -v,--verbose                    verbose output
 -X,--extract <arg>              extract motif from structure input (-m) following the syntax [chain]-[residue type][residue
                                 number]_... (e.g. A-E651_A-D649_A-T177)
                                 INFO: a subsequent search is performed and the extracted motif is written in PDB format
 -x,--vverbose                   extra verbose output

* = required
** = one of these required

Advanced

Exchange definition

You can define position specific alternative amino acid labels of the query motif. For example, if one wants to search for serine proteases catalytic triad HDS you may also want to get results where glutamine (Q) instead of histidine (H) is present. The exchange definition to allow substitutions of histidine 56 of the query motif with glutamine therefore is: 56:Q. Multiple alternative amino acid labels are allowed (e.g. 56:QE) and several exchange definitions can be concatenated by comma (e.g. 56:Q,102:DA).

p-value calculation

Fit3D implements two different statistical models to estimate match significance. If enabled the statistical significance of each hit is calculated by calling external R code (only for method Fofanov et al., see System requirements). The two statistical models are currently implemented:

:::text
Fofanov, V.; Chen, B.; Bryant, D.; Moll, M.; Lichtarge, O.; Kavraki, L. & Kimmel, M.
A statistical model to correct systematic bias introduced by algorithmic thresholds in protein structural comparison algorithms.
Bioinformatics and Biomeidcine Workshops, 2008. BIBMW 2008. IEEE International Conference on, 2008, 1-8

:::text
Stark, A.; Sunyaev, S. & Russell, R. B.
A model for statistical significance of local similarities in structure.
J. Mol. Biol., 2003, 326, 1307-1316

Classification (experimental)

Hits can be classified regarding a positive reference set (all hits belonging to a user-defined EC class). For classification supervised learning of a support vector machine was implemented. For each hit a variety of features are considered: secondary structure, accessible surface area (ASA) of hit amino acids, physico-chemical properties of local environments. Additionally the class probability is calculated. Classification should allow the user to decide whether hits are of biological function or not.

The calculation of ASA is based on:

:::text
Shrake, A. & Rupley, J. A.
Environment and exposure to solvent of protein atoms. Lysozyme and insulin.
J. Mol. Biol., 1973, 79, 351-371

The calculation of physico-chemical properties of local environments is based on:

:::text
To be published.

Clustering

To cluster hits below a certain LRMSD cutoff pairwise alignments are calculated. The LRMSD of these alignments is used as dissimilarity measure for the neighbor joining to construct a phylogenetic tree. The user receives the underlying distance matrix as well as the tree in Newick format.

Installation (Windows/Linux)

No installation of Fit3D is necessary. Run the software by executing the command:

:::text
java -jar Fit3D.jar -m [motif] -t [target] [OPTIONS]

Example

An example of a Fit3D search and its respective output is included in the software package. It shows a search of serine proteases catalytic triad (-m motif_HDS.pdb) in a list of target structures (-l targets.txt) allowing a maximal LRMSD of 2.0 (-r 2.0). The alignment for LRMSD calculation is performed by matching C alpha and C beta atoms (-a CA,CB). Exchanges of histidine (H) to glutamine (Q) at position 56 of the query motif are allowed (-e 57:Q). A summary file is created (-f motif_HDS.csv) and aligned structures (-g) are written (-o motif_HDS). Furthermore a local PDB installation is provided to avoid download of each structure (-p /opt/pdb or -p C:\\PDB).

To run the example on Linux systems execute the following command in the command shell:

:::text
bash run_example.sh

For windows systems it is sufficient to execute:

:::text
run_example.bat

Example Catalytic Site Atlas

An example of an iterative Fit3D search for Catalytic Site Atlas (CSA) derived motifs is included in the software package. It simulates screening against a set of modeled Dengue virus protein structures derived from ModBase to discover and annotate function.

To run the CSA example on Linux systems execute the following command in the command shell:

:::text
bash run_example_CSA.sh

For windows systems it is sufficient to execute:

:::text
run_example_CSA.bat

Troubleshooting

Sometimes the allocated heap size by Java is not sufficient for processing many large structures and the error OutOfMemory is thrown. Therefore it can be useful to manually increase heap size by passing the parameter

:::text
-Xmx

to the Java virtual machine.

System requirements

Operating system(s): platform independent
Java Runtime Environment (JRE) 1.8 or higher
at least 4 GB system memory recommended
only for p-value calculation: R installation in your path with the following packages installed: sfsmisc

Changelog

2015-09-16 (v005)

implemented new motif extraction wizard (option -E)
mapping of Pfam accession codes
influence on parameter "reference population" of Fofanov et al. model (option -N)
Java Runtime Environment (JRE) 1.8 is now necessary

2014-11-12 (v004)

many bug fixes and improvements

2014-07-10 (v003)

new main features: p-value calculation, classification, clustering, pre-filter environment
reimplementation of core algorithm
many bug fixes

2013-10-10 (v002)

huge performance improvement for larger motifs (up to 10 times faster)
new features added (ignore missing atoms, toggle PDB directory split)

2013-08-07 (v001)

performance improvement
several bugs fixed

Citation

:::text
Kaiser, F., Eisold A. and & Labudde, D. (2015)
A novel algorithm for enhanced structural motif matching in proteins,
J. Comput. Biol., 22, 698-713.

Wiki

Fit3D / Home