Fit3D - A tool for small structural motif screening in proteins and further analyses.
Fit3D is a command line software tool to search for small structural motifs of amino acids in target proteins. It can also be used for batch-processing by applying to a large number of target structures. Basically it is based on a combinatoric search approach using amino acid distributions in local environments. This allows Fit3D to locate inter- as well as intra-molecular structural motifs. The similarity between the search motif and the possible hit is measured using least-root-mean-square deviation (LRMSD). A hit is found if LRMSD is below a user-defined threshold. There is a variety of options available enabling the user to perform a highly customized search or post-analyses.
- highly sensitive search for structural motifs in proteins
- locate intra- and inter-molecular structural motifs, which is especially of interest for interacting sites
- high-throughput processing of thousands of target structures
- calculation of statistical significance (p-values) of hits
- freely define any valid PDB atoms used for alignment calculation
- definition of position-specific residue exchanges allows relaxation of physico-chemical and steric constraints
- output aligned hits in Protein Data Bank (PDB) format for visual inspection
- output extensive summary of results
- pause computational extensive searches and resume later
- multi-threading capability
- cluster hits based on structural similarity to estimate structural homology
- mapping of enzyme classification (EC)
- mapping of Pfam accession codes
- EC based classification of hits to distinguish between biological and non-biological results
- combination of clustering and classification to derive functional phylogeny
The query motif must be in PDB format. The built-in extraction wizard (option -E) can be used to define a new query motif. You can either provide a single search target by its PDB-ID (e. g. 1PQS) or you can provide a file containing a list of PDB-IDs separated by line break. There is also the possibility to process custom structures in PDB format by providing a file path instead of PDB-ID.
Output (for each hit)
- LRMSD to query set
- p-value (if option -P enabled)
- EC number (if option -E or -B enabled)
- Pfam accession code (if option -M enabled)
- class: positive (+), negative (-) or used for training (o) (if option -B enabled)
- class probability (if option -B enabled)
- intra- or inter-molecular occurrence
- amino acid sequence
- motif representation:
[chain ID]-[amino acid][residue number][insertion code]
- title (if option -T enabled)
example Fit3D output:
LRMSD p-value PDB-ID EC class prob occ seq motif title 2.6156 2.563957525084E-1 2Y42 220.127.116.11 + 0.7461 intra HDS A-H213 A-D184 A-S182 "STRUCTURE OF ISOPROPYLMALATE DEHYDROGENASE FROM THERMUS THERMOPHILUS - COMPLEX WITH NADH AND MN"
-a,--atoms <CA,CB,CG,CD,...> PDB identifier of atoms used for alignment (default: all non-hydrogen motif atoms) -B,--classify-hits <arg> classify hits based on EC class, e.g 3.4.21 for peptide hydrolases (experimental) -C,--cluster-hits <arg> hierarchical clustering of hits below LRMSD cutoff -c,--conserve conserve atoms for structure output -d,--distance-tolerance <arg> allowed tolerance of query motif spatial extent (default: 1.0 Å) WARNING: performance decrease if raised, set lower value for larger motifs -E,--ec-mapping map EC numbers by getting RESTful data from RCSB WARNING: requires internet connection and decreases performance -e,--exchange-residues <arg> allowed residue exchanges for input motif (default: none) syntax: [motif residue number]:[allowed residues],... e.g. 12:ASHPW,43:PR -F,--filter-environment pre-filter mirco environments based on local distance constraints WARNING: performance gain, can result in loss of some hits -f,--result-file <arg> result file -G,--gap-mapping map gap sequence between hit amino acids WARNING: hit amino acids are reordered according to sequential order -g,--align-output align output structures (default: false) -h,--help show help dialog -i,--ignore-atoms ignore missing atoms, force alignment (default: false) -l,--target-list <arg> target list of PDB-IDs or files** -M,--pfam-mapping map Pfam annotation by getting RESTful data from RCSB WARNING: requires internet connection and decreases performance -m,--motif <arg> motif PDB structure file* -N,--ref-size <arg> size of reference population for p-value calculation to estimate point-weight correction according to Fofanov et al. 2008 (default: 31133) -n,--num-threads <arg> number of threads used for calculation (default: all available) -o,--output-structures <arg> output structures directory -P,--pvalues <F|S> calculate p-values for matches according to Fafoanov et al. 2008 (F) or Stark et al. 2003 (S) (default: false) WARNING: F needs R in path with package sfsmisc installed -p,--pdb <arg> path to local PDB directory -q,--quiet show only results -R,--restore-session <arg> restore session from file -r,--rmsd <arg> maximal allowed LRMSD for hits (default: 2.0 Å) -s,--no-pdb-split disable PDB directory split (default: false) -T,--title-mapping map structure titles assigned by PDB -t,--target <arg> target PDB-ID or file** -v,--verbose verbose output -X,--extract <arg> extract motif from structure input (-m) following the syntax [chain]-[residue type][residue number]_... (e.g. A-E651_A-D649_A-T177) INFO: a subsequent search is performed and the extracted motif is written in PDB format -x,--vverbose extra verbose output * = required ** = one of these required
You can define position specific alternative amino acid labels of the query motif. For example, if one wants to search for serine proteases catalytic triad HDS you may also want to get results where glutamine (Q) instead of histidine (H) is present. The exchange definition to allow substitutions of histidine 56 of the query motif with glutamine therefore is: 56:Q. Multiple alternative amino acid labels are allowed (e.g. 56:QE) and several exchange definitions can be concatenated by comma (e.g. 56:Q,102:DA).
Fit3D implements two different statistical models to estimate match significance. If enabled the statistical significance of each hit is calculated by calling external R code (only for method Fofanov et al., see System requirements). The two statistical models are currently implemented:
Fofanov, V.; Chen, B.; Bryant, D.; Moll, M.; Lichtarge, O.; Kavraki, L. & Kimmel, M. A statistical model to correct systematic bias introduced by algorithmic thresholds in protein structural comparison algorithms. Bioinformatics and Biomeidcine Workshops, 2008. BIBMW 2008. IEEE International Conference on, 2008, 1-8
Stark, A.; Sunyaev, S. & Russell, R. B. A model for statistical significance of local similarities in structure. J. Mol. Biol., 2003, 326, 1307-1316
Hits can be classified regarding a positive reference set (all hits belonging to a user-defined EC class). For classification supervised learning of a support vector machine was implemented. For each hit a variety of features are considered: secondary structure, accessible surface area (ASA) of hit amino acids, physico-chemical properties of local environments. Additionally the class probability is calculated. Classification should allow the user to decide whether hits are of biological function or not.
The calculation of ASA is based on:
Shrake, A. & Rupley, J. A. Environment and exposure to solvent of protein atoms. Lysozyme and insulin. J. Mol. Biol., 1973, 79, 351-371
The calculation of physico-chemical properties of local environments is based on:
To be published.
To cluster hits below a certain LRMSD cutoff pairwise alignments are calculated. The LRMSD of these alignments is used as dissimilarity measure for the neighbor joining to construct a phylogenetic tree. The user receives the underlying distance matrix as well as the tree in Newick format.
No installation of Fit3D is necessary. Run the software by executing the command:
java -jar Fit3D.jar -m [motif] -t [target] [OPTIONS]
An example of a Fit3D search and its respective output is included in the software package. It shows a search of serine proteases catalytic triad (-m motif_HDS.pdb) in a list of target structures (-l targets.txt) allowing a maximal LRMSD of 2.0 (-r 2.0). The alignment for LRMSD calculation is performed by matching C alpha and C beta atoms (-a CA,CB). Exchanges of histidine (H) to glutamine (Q) at position 56 of the query motif are allowed (-e 57:Q). A summary file is created (-f motif_HDS.csv) and aligned structures (-g) are written (-o motif_HDS). Furthermore a local PDB installation is provided to avoid download of each structure (-p /opt/pdb or -p C:\\PDB).
To run the example on Linux systems execute the following command in the command shell:
For windows systems it is sufficient to execute:
Example Catalytic Site Atlas
An example of an iterative Fit3D search for Catalytic Site Atlas (CSA) derived motifs is included in the software package. It simulates screening against a set of modeled Dengue virus protein structures derived from ModBase to discover and annotate function.
To run the CSA example on Linux systems execute the following command in the command shell:
For windows systems it is sufficient to execute:
Sometimes the allocated heap size by Java is not sufficient for processing many large structures and the error OutOfMemory is thrown. Therefore it can be useful to manually increase heap size by passing the parameter
to the Java virtual machine.
- Operating system(s): platform independent
- Java Runtime Environment (JRE) 1.8 or higher
- at least 4 GB system memory recommended
- only for p-value calculation: R installation in your path with the following packages installed: sfsmisc
- implemented new motif extraction wizard (option -E)
- mapping of Pfam accession codes
- influence on parameter "reference population" of Fofanov et al. model (option -N)
- Java Runtime Environment (JRE) 1.8 is now necessary
- many bug fixes and improvements
- new main features: p-value calculation, classification, clustering, pre-filter environment
- reimplementation of core algorithm
- many bug fixes
- huge performance improvement for larger motifs (up to 10 times faster)
- new features added (ignore missing atoms, toggle PDB directory split)
- performance improvement
- several bugs fixed
Kaiser, F., Eisold A. and & Labudde, D. (2015) A novel algorithm for enhanced structural motif matching in proteins, J. Comput. Biol., 22, 698-713.