$ git clone https://bitbucket.org/lcbio/surpass.git/wiki
Welcome to SURPASS wiki page! The method outline are provided on the SURPASS OVERVIEW PAGE
Table of contents
1. SURPASS representation
SURPASS - Single United Residue per Pre-Averaged Secondary Structure fragment is a new low-resolution coarse-grained model for protein simulations that can be an interesting alternative for existing coarse-grained models. Deep simplification of SURPASS representation results in a powerful computational speed-up.
The representation of SURPASS assumes a very drastic simplification of the protein structure. A single interaction centre in this model corresponds to one all-atom residue, and the number of pseudo atoms in the SURPASS chain is N - 3, where N is the length of the amino acid sequence. The united residue of SURPASS encodes the properties of four consecutive alpha carbons and is located in their centre of mass.
Averaging of 4-residue fragments of the structure is not accidental - it leads to almost linear shapes of secondary structure elements, both α-helix and β-strand, as well as significant smoothing of unstructured fragments (loops, turns). Statistical analysis of the local geometry of SURPASS chains proved that pseudo atoms are differentiated due to the type of secondary structure. These differences manifest themselves mainly in the size and shape of the excluded volume of pseudo atoms, which quite realistically recreates the spatial packing of the structure.
Helices are much thicker than β-strands, also the packing density of helical pseudo atoms is the highest, and the excluded volume remains isotropic. On the contrary, the anisotropy of the excluded volume for the β-strand codifies several rules controlling the fold of globular proteins containing beta-type elements. The pseudo atoms in the ribbon are ellipsoidal in shape, whose shorter radius lies in the β-sheet plane and can be identified with the vector of a coarse-grained hydrogen bond between neighboring strands. The longer radius is perpendicular to the plane of the β-sheet and corresponds to the vector of contact interaction between the strands lying in different β-sheets, e.g. in the architecture of a multilayered sandwich. On this basis 3 types of pseudo atoms were distinguished in the SURPASS representation:
- H, helix
- S, β strand
- C, loop
Pseudo bonds are flexible and their length oscillates around the expected mean values characteristic for particular fragments of the secondary structure.
2. SURPASS Force Field
The SURPASS model employs a knowledge-based statistical force field, which consists of a set of general, sequentially independent short- and long-range potentials. In the basic variant of the SURPASS force field, sequence dependence is defined only by assigning the preferred secondary structure in the three-letter code (H, E, C) to pseudo atoms. The all-atomic pattern of the secondary structure is averaged similarly to representation, with the rule that 3 consecutive positions in the 4-residue fragment have the same type of secondary structure. The solvent is treated in implicit manner and its influence is taken into account in the statistical potentials, which describe the multibody interactions between the united atoms.
A group of short-range distance (R12, R13, R14, R15) and planar angle (A13) potentials depending on the secondary structure at the interacting positions, controls the local geometry of the SURPASS chain and implicitly takes into account the hydrogen bonds between the residues close in the sequence (e.g. H-bonds occurring in the main chain of helix).
Hydrogen bonds between residues sequentially distant but close in space are treated more directly. This situation concerns mainly pairs of pseudo atoms lying in different β-strands, bound by a coarse-grained hydrogen bond and forming one β-sheet. Conditions for formation of pseudo hydrogen bonds in the SURPASS model are strongly restrictive, which forces a very regular structure of a β-sheet.
Characteristic distances between helices and β-sheets and other structural regularities resulting from the interaction of secondary structure elements are reproduced by the contact potential, containing the hard repulsion part due to the excluded volume of pseudo atoms and the soft attraction part for pairs of pseudo atoms being in spatial contact.
The last overall potential is repulsive and controls the local packing of pseudo atoms within a 6 Å radius of burial centres located near the hydrophobic core of protein structure. Currently, the general force field contains an additional potential forcing dense packing the central part of one-domain globular proteins, which should contains about half of the protein chain. In this way the target structures do not have to be spherical, but the hydrophobic core is preserved. The centrosymmetric component of the force field will be removed by adding a sequence-dependent contact potential for pairs of pseudo atoms.
The total interaction energy for the single-domain globular proteins in the SURPASS force field is defined by the combination of the described statistical potentials. The weights of the individual components have been optimized in a series of long simulations. Significant interactions stabilizing local geometry (i.e. short-range R15 or hydrogen bond potentials) required several times larger weight.
Knowledge-based statistics of the SURPASS force field can be downloaded from here.
3. SURPASS Reconstruction to CA-trace
Reconstruction of SURPASS representations to higher resolution levels is not trivial due to the strongly averaged nature of the model. Using the SUReLib, SURPASS Rebuild Library of fragments, it is possible to switch between the SURPASS model and the chain composed of alpha carbons. The local geometry of the rebuilt structures reproduces the regularities observed in known experimental structures. The orientation of alpha carbons in the secondary structure fragments of β-type is also retained. It means that the neighboring pseudo atoms in a single strand have opposite orientation to the β-sheet surface, and the pseudo atoms lying in neighboring strands within one β-sheet and connected by a coarse-grained hydrogen bond have the same orientation. Further reconstruction from the level of the Cα-chain to the complete main chain or full-atomic detail is a solved problem. In this context, the structural accuracy of the model is in the range of 2 - 3 Å. This is the acceptable resolution range for known all-atom structure optimisation protocols.
3.1 SUReLib - SURPASS Rebuild Library
The SUReLib library of fragments is composed of 300 pairs of unique 5-residue long fragments in the SURPASS representation and the corresponding 6-residue long fragments of the chain made of alpha carbons. The repository is divided into 3 categories according to the type of secondary structure of the fragment: helical in the number of 198 fragments, 88 of beta type and 14 of mainly unstructured fragments or loops. To build this library all chains from the PISCES_4600 library have been projected onto SURPASS representation. Using geometric criteria for secondary structure assignment in SURPASS representation we have obtained a large number of appropriate fragment pairs. The content of SUReLib has been selected using clustering of all (24106) observations, and picking representatives of the most dense clusters. Finally, the library of fragments has been subject of rototranslational superposition and ordering of SURPASS and Cα-carbon pairs in the library to simplify its further use in chain reconstruction procedures.
Knowledge-based SURPASS Rebuild Library SUReLib can be downloaded from here.
SUReLib file content:
T - type of secondary structure: 0 - helix, 1 - β-starnd, 2 - coil
from S1 to S5 - coordinates of 5-residue long SURPASS fragment
from CA1 to CA6 - coordinates of 6-residue long Cα-fragment
N - counts
3.2 CA-trace reconstruction algorithm
The reconstruction procedure begins with the selection from the SUReLib library of a 5-residue long fragment in the SURPASS representation best suited to any fragment of the secondary structure along the entire protein chain. The matching criterion is defined by minimizing the root-mean-square deviation of the corresponding pseudo atom positions between the 5-residue long fragment in the SURPASS representation and the same fragment from the library. The best template for the reconstruction site is searched with a shift of one pseudoresidue in the sequence. In this way, the entire length of the protein chain is scanned. The selected fragment is used for finding position of four alpha carbons in the reconstructed chain. These are defined by the coordinates of the four central alpha carbons from the corresponding 6 Cα-fragment from the library. The Cα-trace reconstruction continues in both directions of the protein chain, with one residue shift in each step. The final positions of reconstructed alpha carbons are the averages of overlapping fragments from the database. A default reconstruction procedure starts from a 'seed', the best fitting SUReLib fragment along the reconstructed SURPASS structure. Since the deep coarse graining of SURPASS chains, usually several (15-20) pairs of SUReLib fragments are similarly accurate.
The reconstruction of more distorted structures, although based on the same fragment database and sequential rebuilding of Cα-traces as applied to native-like chains, needs some extensions. First, the reconstruction process starts not from randomly selected position along the chain, but from a fragment that is expected to have a regular local structure, an α-helix or β-strand.
The reconstruction of helical fragments takes into account additional optimisation by superimposing the already rebuilt neighboring alpha carbons with a the best fragment from the SUReLib library.
Reconstruction of beta type fragments is more complicated. It takes into account local geometry of SURPASS folds (partly folded structures) and corrects fitting of fragments according to geometrical definition of strand-strand hydrogen pseudo-bonds and related exposed-buried patterns of amino acid residues (see Figure 4 (B) and SURPASS force field descriptions from previous publications). Taking these features into account is particularly important for the correct reconstruction of the orientation of side groups in the neighboring β-strands.
3.3 Quality of obtained CA-trace
Reconstruction of CG-native chains (SURPASS to Cα-trace) is quite accurate, and reconstructed Cα-trace differ from the PDB data usually by 0.2 - 0.5 Å. The accuracy of this representation depends slightly on fold type and does not depend on protein size. Reconstruction of protein structures lacking of any well-defined (sufficiently long) secondary structure fragments may be more difficult. The main chain geometry of SURPASS models is quite accurate and most models from SURPASS simulations, after reconstructions, exhibited protein-like packing geometry. In most cases, the correct orientation of side chains in beta type secondary structure fragments is also maintained.
3.4 Reconstruction to higher resolution
Reconstruction of Cα-traces from deeply coarse-grained SURPASS chains is a first step towards more detailed structural models. Cα-traces are good starting structures for higher resolution modeling, using UNRES, CABS, ROSETTA or other models of protein structure. Reconstruction of main chain atoms, starting from realistic Cα-traces, is relatively easy, while rebuilding of all-atom structure usually requires some simulations optimizing local packing. There are few efficient methods, as PD2, BBQ, SABBAC, REMO, or PULCHRA, which can be used.
4. SURPASS simulations
SURPASS sampling uses different Monte Carlo dynamic schemes. Isothermal simulations, simulated cooling and annealing, replica exchange simulations and other conformational space sampling strategies, including molecular dynamics (after minor modifications of force field equations to a continuous form) are possible.
The model implements the simplest possible model of MC dynamics, using a long sequences of local random moves, which give a realistic picture of larger scale movements. Local random moves of short chain fragments, consisting of one or odd multiples of pseudo atoms along the sequence, are used as a sampling method. The range of motion during the simulation is dynamically changed (usually ~0.15 Å at the lowest temperatures and ~1.0 Å at the highest). The specific simplification of the SURPASS representation allows effective sampling of conformational space even for very large proteins. The algorithm can be significantly speeded up if slightly longer fragments and a replica exchange scheme are used.
5. Multiscale modeling using SURPASS
SURPASS efficiently samples all important areas of the conformation space and reproduces native topologies with surprisingly good accuracy at this level of coarse-graining. The current SURPASS force field forces polypeptide chains to collapse into compact protein-like structures, but in most cases the near-native structures, although usually present in simulation trajectories, are energetically similar to other compact structures with similar secondary structure content and still indistinguishable with their mirror images.
The simplicity of the model, the extension of the time scale and system size allows very fast simulations of even larger proteins. This model can provide large sets of low resolution protein-like structures, which can be used as starting structures for more accurate methods. Therefore, the SURPASS model opens up the possibility of effective multiscale modelling of dynamics and structural transformations in long time simulations for large proteins or their complexes, which are currently beyond the reach of the available coarse-grained higher resolution methods. It is relatively easy to extend the range of application of the model to membrane and multidomain proteins and other complexes of biomolecular systems.
6. Papers on SURPASS development and applications
Aleksandra E. Dawid, Dominik Gront, and Andrzej Kolinski, Coarse-Grained Modeling of the Interplay between Secondary Structure Propensities and Protein Fold Assembly, J. Chem. Theory Comput. 2017, 2018, 14, 4, 2277-2287
Sebastian Kmiecik, Maksim Kouza, Aleksandra E. Badaczewska-Dawid, Andrzej Kloczkowski, Andrzej Kolinski, Modeling of Protein Structural Flexibility and Large-Scale Dynamics: Coarse-grained Simulations and Elastic Network Models, Int. J. Mol. Sci. 2018, 19, 3496
7. Reference protein datasets
The database contains 4600 polypeptide chains with lengths from 20 to 1193 amino acid residues, which are a representative and nonredundant subset of all families of known protein structures. The resolution of PISCES structures is not worse than 1.6 Å and sequence similarity does not exceed 60%. These data were used for statistical analysis, which resulted in reliable and universal statistical potentials for the SURPASS model. These structures were also used to prepare a library of SUReLib fragments, and 2650 (DUN_2650) of them also to optimise and test the reconstruction algorithm.
A collection of known globular protein structures consisting of 195 chains containing proteins of various architecture (alpha horseshoe, helix boundle, roll, san-dwich, barrel), size (48-559 amino acid residues), or secondary structure content. The proteins from this set were used to model the protein structure using the SURPASS model and to test the SUReLib reconstruction algorithm.
A collection of known protein structures containing 62 chains, which was developed by David Baker's team and is one of the basic test sets for ROSETTA. It was used to evaluate the effectiveness of the SUReLib algorithm applied to reconstructing alpha carbon positions from the coarse-grained SURPASS representation.