Wiki

Clone wiki

CABSdock / Home

tools-cabsdock.png

Welcome to CABSdock wiki page! Installation instructions and the method outline are provided on the CABSdock OVERVIEW PAGE

Table of contents

1. CABSdock modeling scheme

1.1 Pipeline

1.2 Example simulation movies

1.3 CABS simulation engine

1.4 Average simulation time

1.5 Papers on CABSdock development and applications

2. CABSdock options

2.1 Basic options

2.2 Protein structure input options

2.3 Peptide input options

2.4 Distance restraints options

2.5 Simulation options

2.6 All-atom reconstruction options

2.7 Results analysis options

2.8 Output options

2.9 Miscellaneous options

2.10 Options' index

3. Ready-to-use examples

3.1 Default docking and demo directory

3.2 Default docking, peptide sequence from PDB

3.3 Docking with contact information

3.4 Flexible protein loops

3.5 Intrinsically unstructured protein regions

3.6 Docking multiple peptides

3.7 Modifying protein restraints

3.8 Sampling near native binding modes

3.9 Calculating ligand-RMSD values to a reference complex

3.10 Refinement of CABS-dock models using Rosetta FlexPepDock

3.11 Docking to GPCRs

4. Output models

5. Output plots and additional analysis to reference complex

5.1 RMSD plot analysis

5.2 RMSD additional analysis

5.3 Contact map and contact histogram plot analysis

5.4 Handling of not identical input and reference models

6. Additional docking analysis

6.1 Analysis of an already finished simulation

6.2 Analysis with PyMOL plugin

7. CABSdock scoring

8. Advanced CABS data

9. CABSflex simulations of protein fluctuations


1. CABSdock modeling scheme

1.1 Pipeline

CABSdock is an efficient simulation method for protein-peptide docking. The method enables to simulate significant conformational changes during the docking search for a binding site. CABSdock standalone package allows for control and modification of every simulation step. The picture below shows the CABSdock pipeline with default settings.

CABS-dock-pipeline4-800.png

1.2 Example simulation movies

The movie below shows example trajectory from protein-peptide docking using CABSdock. Only 1 trajectory (system replica) out of 10 trajectories (system replicas) is presented. The docking was performed with default CABSdock settings:

Video 1

The movie below shows example trajectories from protein-peptide docking using CABSdock. The movie shows 10 trajectories (system replicas) and selected 1 trajectory together with RMSD analysis. The docking was performed with default CABSdock settings:

10-replica-YT.png

The movie below shows shows example simulation snapshots from the CABSdock study on molecular docking with large-scale conformational changes: the p53-MDM2 interaction, see details in Sci Rep 6, 37532 (2016)

scirep-movie.png

1.3 CABS simulation engine

CABSdock method uses an efficient simulation engine: CABS coarse-grained protein model. The picture below shows comparison between all-atom representation (left) and CABS coarse-grained model representation (right) for an example 4-residue protein fragment. In CABS, single amino acid is represented by 4 atoms (or pseudo-atoms): C-alpha (CA), C-beta (CB), center of the mass of Side-Chain group (SC) and center of the peptide bond (cp).

Note that CABSdock modeling scheme allows to apply/modify distance restraints between selected CA atoms or between selected SC pseudoatoms.

representation.png

CABS design and applications have been recently described in the review: Chemical Reviews, 116:7898–7936, 2016

1.4 Average simulation time

Plot below presents average simulation time (in hours) and min/max times for jobs in a function of a protein-peptide system size (receptor + protein) size, for default number of simulation cycles (50), using single 2.5 GHz processors. The plot has been made using a few months data from CABS-dock web server

time-cabs-dock.jpg

1.5 Papers on CABSdock development and applications

Papers describing the CABS-dock server and its example applications:

2. CABSdock options

2.1 Basic options

Click on an option link to read full description

2.2 Protein structure input options

Note that CABS-dock uses cache directory (default location is ~/cabsPDBcache) to keep pdb files downloaded from the PDB database.

Click on an option link to read full description

2.3 Peptide input options

Click on an option link to read full description

2.4 Distance restraints options

Click on an option link to read full description

2.5 Simulation options

Click on an option link to read full description

  • -a, --mc-annealing NUM - sets number of Monte Carlo temperature annealing cycles to NUM (NUM > 0, default value = 20, changing default value is recommended only for advanced users).
  • -y, --mc-cycles NUM - sets number of Monte Carlo cycles to NUM (NUM>0, default value = 50).
  • -s, --mc-steps NUM - sets number of Monte Carlo cycles between trajectory frames to NUM (NUM > 0, default value = 50).
  • -r, --replicas NUM - sets number of replicas to be used in Replica Exchange Monte Carlo (NUM > 0, default value = 10, changing default value is recommended only for advanced users).
  • -D, --replicas-dtemp DELTA - sets temperature increment between replicas (DELTA > 0, default value = 0.5).
  • -t, --temperature TINIT TFINAL - sets temperature range for simulated annealing TINIT - initial temperature, TFINAL - final temperature (default values TINIT = 2.0 TFINAL = 1.0).
  • -z, --random-seed SEED - sets seed for random number generator.

2.6 All-atom reconstruction options

CABS-dock uses the Modeller tool to reconstruct top-scored models from C-alpha to all-atom resolution (see the [Modeller reconstruction script] (https://bitbucket.org/lcbio/ca2all/)). Note that the current version of the CABS-dock reconstruction protocol automatically reconstructs chain breaks in the receptor structure. If this is not desirable behavior, we recommend to use own reconstruction protocols (that may be composed of different available tools, like for example Pulchra, PD2, SAABAC, SCWRL4 or modified Modeller script). Until the end of March 2019, we plan to provide new improved and customizable protocol of the CABS-dock all-atom reconstruction that, among other new features, will enable preservation of chain breaks. The new reconstruction protocol, together with all-atom refinement options, will be described in the new article in Methods in Molecular Biology, as well as announced in the CABS-dock repository.

Click on an option link to read full description

2.7 Results analysis options

Click on an option link to read full description

2.8 Output options

Click on an option link to read full description

  • -S, --save-cabs-files - Save CABSdock simulation file. The filename will have the following format: yymmddHHMMSS<RANDOM 6-CHARACTERS STRING>.cbs format. For example: 181116161924knWPtn.cbs
  • -L, --load-cabs-files FILE - Load CABSdock simulation file(.cbs). This option allows for repeated scoring and analysis of CABSdock trajectories (with new settings, for example using a reference complex structure).
  • -C, --save-config - Save simulation parameters in config file.
  • -o, --pdb-output SELECTION - Select structures to be saved in the pdb format.

2.9 Miscellaneous options

Click on an option link to read full description

2.10 Options' index

-A, --aa-rebuild

Rebuild final models to all-atom representation. (default: True)


-P, --add-peptide PEPTIDE CONFORMATION LOCATION

Adds a peptide to the complex. This option can be used multiple times to add multiple peptides.

PEPTIDE must be either:

  • amino acid sequence in one-letter code (optionally annotated with secondary structure: H - helix, E - sheet, C - coil) i.e. -p HKILHRLLQD:CHHHHHHHHC loads HKILHRLLQD peptide sequence with the secondary structure assignemnt: CHHHHHHHHC

HINT: If possible, it is always recommended to use secondary structure information/prediction. For residues with ambiguous secondary structure prediction assignment it is better to assign coil (C) than the regular (H - helix or E - extended) type of structure.

  • pdb file (may be gzipped)

  • pdb code (optionally with chain_id i.e. 1abc:D)

CONFORMATION sets initial conformation of the peptide. Must be either:

  • random - random conformation is generated (default)

  • keep - preserve conformation from file. This has no effect if PEPTIDE=SEQUENCE.

LOCATION sets initial location for the peptide. Must be either:

  • random - peptide is placed in a random location on the surface of a sphere centered at the proteins geometrical center at distance defined by the --separation option from the surface of the proteins.

  • keep - preserve location from file. This has no effect if PEPTIDE=SEQUENCE

  • patch - list of proteins residues (i.e 123:A+125:A+17:B) Peptide will be placed above the geometrical center of listed residues at distance defined by the --separation option from the surface of the protein. WARNING: residues listed in path should be on the surface of the protein and close to each other.


--align METHOD

Method to be used to align target and peptides with reference. Available options are:

  • SW -- Smith-Waterman (default)

  • blastp -- protein BLAST (requires NCBI+ package installed)

  • trivial -- simple sequential alignment, useful only to speed up run (by omitting Smith-Waterman algorithm) in case of obvious one-chain input and reference of the same length (e.g. when input and reference are the same file).

  • CSV -- loads alignment from given file (passed as alignment setting called fname) in format described by Berbalk et. al. in 2009.


--alignment-options

Options to be passed to method aligning target, if --alignment-peptide-options is passed, or both if no such options is given.

CABSdock --align blastp --alignment-options task=short-task

--alignment-peptide-options

Options to be passed to method aligning peptides. If this option is passed, options given to --alignment-options are ignored during peptide alignment.


--ca-rest-add RESI RESJ DIST WEIGHT

Adds a distance restraint between CA (CA) atom in residue RESI and CA atom in residue RESJ.

DIST is a distance between these atoms and WEIGHT is restraints weight from [0, 1].

In order to add restraints between the peptide and the protein, or between two peptides, use PEP1, PEP2, ... as chain identifiers of the peptides (even when peptide is read from a pdb file its chain identifier is ignored).

Example:

  • 123:A 5:PEP1 8.7 1.0 adds a restraint between the CA atom of the residue number 123 in the chain A of the protein and the CA atom of the 5th residue of the peptide.

Comments:

  • If you add only one peptide both PEP and PEP1 is a valid chain identifier.

  • If you add multiple peptides they will be ordered as follows:

    1. from config file added by the peptide option
    2. from config file added by the add-peptide option
    3. from command line added by the --peptide option
    4. from command line added by the --add-peptide option
  • Peptides added by the same method preserve the order by which they appear in the config file, or on the command line.

  • Can be used multiple times to add multiple restraints.


--ca-rest-file FILE

Reads CA restraints from a file (use multiple times to add multiple files).


--ca-rest-weight WEIGHT

Sets a global weight for all CA restraints (including automatically generated restraints for the protein) (default: 1.0)


--clustering-iterations NUM

Set the number of iterations of the clustering k-medoids algorithm (default: 100).


-k, --clustering-medoids NUM

Sets the number of medoids in the k-medoids clustering algorithm. This option also sets the number of final models to be generated. (default: 10)


-c, --config CONFIG

Reads options from the configuration file CONFIG


--contact-map-colors COLORS

Sets 6 colors (hex code, e.g. #00FF00 for green etc.) to be used in contact map color bars.


-M, --contact-maps

Store contact maps matrix plots and histograms of contact frequencies.


--contact-threshold-aa DIST

Set contact distance between heavy atoms for contact map plotting (all-atom top scored models only). (default: 5.5 Angstroms)


-T, --contact-threshold DIST

Set contact distance between side chains pseudo-atoms (SC) for contact map plotting. (default: 6.5 Angstroms)


--dssp-command PATH

Use the provided path to the dssp binary.

CABS-dock requires the DSSP program in order to assign the secondary structure to the protein receptor's residues. We recommend installation of the standalone DSSP program to be used with CABS-dock '--dssp-command' option. As a fallback, we have implemented a module which communicates with the DSSP server when no local DSSP binary is available, however recently the server's performance has been unstable, resulting in jobs getting stuck. In order to install the standalone DSSP program follow instructions available here.


-e, --exclude RESIDUES

Excludes protein residues listed in RESIDUES from the docking search, therefore enforces more effective search in other areas of a protein surface. For example, it may be known that some parts of the protein are not accessible to the peptide (due to binding to other proteins) and therefore it could be useful to exclude these regions from the search procedure.

RESIDUES must be a single string of characters (no whitespaces) consisting of residue identifiers (i. e. 123:A) or chain identifiers (i. e. A) joined with the + sign. - is also allowed to specify a continous range of residues, or chains.

Examples:

  • -e 123:A excludes residue 123 from chain A
  • -e 123:A+125:A residues 123 and 125 from chain A
  • -e 123:A-125:A residues 123, 124 and 125 from chain A
  • -e A whole chain A
  • -e A+C chains A and C
  • -e A-C chains A, B and C

Adding @PEP<N> at the end of the string limits the excluding to only N-th peptide i.e. -e 123:A@PEP1 will exclude residue 123 in chain A for binding with the first peptide only. If @PEP<N> is omitted the exclusion list affects all peptides.

This option can be used multiple times to add multiple sets of excluded residues.


--excluding-distance DISTANCE

Sets minimum distance between side chain atoms of peptide(s) and protein residues marked as excluded


-n, --filtering-count NUM

Sets the number of low-energy models from trajectories to be clustered (default 1000)


--filtering-mode MODE

Choose the filtering mode to select NUM (set by --filtering-count) models for clustering.

MODE can be either: (default: each)

  • each - models are ordered by protein-peptide(s) binding energy and top n = [NUM / R] (R is the number of replicas) is selected from EACH replica
  • all - models are ordered by protein-peptide(s) binding energy and top NUM is selected from ALL replicas combined

--fortran-command PATH

Use the provided path to the fortran compiler binary.


--gauss-iterations NUM

Sets number of iterations of dynamic weighted-fit algorithm used for superposition of structures. This option has no effect when --weighted-fit is set to anything other than gauss. NUM = 100 by default


-h, --help

print help and exit program


--image-file-format FMT

Produce all the image files in given format.


-i, --input-protein INPUT

Loads input protein structure.

INPUT can be either:

  • PDB code (optionally with chain IDs) i.e. -i 1CE1:HL loads chains H and L of 1CE1 protein structure downloaded from the PDB database

  • path to a local PDB file (optionally gzipped)


--insertion-attempts NUM

This option enables advanced settings of building starting conformations of modelled complexes. The option sets number of attempts to insert peptide while building inital complex (default: 1000)


--insertion-clash DIST

This option enables advanced settings of building starting conformations of modelled complexes. The option sets distance in Angstroms between any two atoms (of different modeled chains) at which a clash occurs while building initial complex (default: 1.0 Angstrom)


-L, --load-cabs-files FILE

Loads CABSdock simulation files and allows for repeated scoring and analysis of CABSdock trajectories (with new settings , for example using a reference complex structure - --reference-pdb option).


--log

Automatically redirects output to the CABS.log file created in the working directory and stops progress bar from showing on higher verbosity levels and turns off log coloring. Piping standard error will not work with this option. If the log file already exists it will be appended to.


-a, --mc-annealing NUM

Sets the number of Monte Carlo temperature annealing cycles to NUM (NUM > 0, default value = 20, changing the default value is recommended only for advanced users).


-y, --mc-cycles NUM

Sets the number of Monte Carlo cycles to NUM (NUM>0, default value = 50). Total number of snapshots generated for each replica/trajectory = [mc-annealing] x [mc-cycles], default: 20x50=1000.


-s, --mc-steps NUM

Sets the number of Monte Carlo cycles between trajectory frames to NUM (NUM > 0, default value = 50). NUM = 1 means that every generated conformation will occur in trajectory. This option enables to increase the simulation length (between printed snapshots) and doesnt impact the number of snapshots in trajectories.

loops-in-cabs-dock-wide.png


-m, --modeller-iterations NUM

Sets number of iterations for reconstruction procedure in MODELLER package (default: 3). Bigger numbers may result in more accurate models, but reconstruction will take longer.


-N, --no-protein-restraints

Do not automatically generate any protein restraints. This option has precedence over the --protein-restraints option and will overwrite any settings set by the latter. With this flag on, restraints can still be added with the --ca-rest-add or --ca-rest-file options.


-o, --pdb-output SELECTION

Select structures to be saved in the pdb format. Available options are: * A - all (default) * R - replicas * F - filtered * C - clusters * M - models * N - none

Example: -o RM - saves replicas and models


-p, --peptide PEPTIDE

Loads peptide sequence and optionally peptide secondary structure in one-letter code (can be used multiple times to add multiple peptides).

PEPTIDE can be either:

  • amino acid sequence in one-letter code (optionally annotated with secondary structure: H - helix, E - sheet, C - coil) i.e. -p HKILHRLLQD:CHHHHHHHHC loads HKILHRLLQD peptide sequence with the secondary structure assignemnt: CHHHHHHHHC

HINT: If possible, it is always recommended to use secondary structure information/prediction. For residues with ambiguous secondary structure prediction assignment it is better to assign coil (C) than the regular (H - helix or E - extended) type of structure.

  • PDB code (optionally with chain ID) i.e. -p 1CE1:P loads the sequence of the chain P from 1CE1 protein

  • path to a PDB file with peptides coordinates, loads only a peptide sequence from a PDB file

--peptide PEPTIDE is an alias for --add-peptide PEPTIDE random random


-f, --protein-flexibility FLEXIBILITY

Modifies flexibility of selected protein residues:

  • 0 - fully flexible backbone,
  • 1 - almost stiff backbone (default value, given appropriate number of protein restraints),
  • >1 - increased stiffness.

FLEXIBILITY can be either:

  • a positive real number - all protein residues will be assigned flexibility equal to this number.

  • bf - flexibility for each residue is read from the beta factor column of the CA atom in the PDB input file. Note that the standard beta factors in PDB files have an opposite meaning to the CABSdock flexibility. Remember to edit the PDB file accordingly or use FLEXIBILITY = bfi).

  • bfi - each residue is assigned its flexibility based on the inverted beta factors stored in the input PDB file, so that bf = 0.0 -> f = 1.0 and bf >= 1.0 -> f = 0.0

  • <filename> - flexibility is read from file <filename> in the format of single residue entries: resid_ID <flexibility> i.e. 12:A 0.75, or residue ranges: resid_ID - resid_ID <flexibility> i.e. 12:A - 15:A 0.75

Default value for residues not explicitely specified can be set by inserting at the top of the file a following line: default <default flexibility value>, if this line is omitted, the default value becomes 1.0. Multiple entries can be used.


-g, --protein-restraints MODE GAP MIN MAX

Allows to generate a set of binary distance restraints for CA atoms, that keep the protein in predefined conformation (default: all, 5, 5.0, 15.0)

MODE can be either:

  • all - generates restraints for all protein residues
  • ss1 - generates restraints only when at least one restrained residue is assigned regular secondary structure (helix or sheet)
  • ss2 - generates restraints only when both restrained residues are assigned regular secondary structure (helix, sheet)

GAP specifies the gap along the main chain for the two resiudes to be restrained. MIN and MAX are min and max values in Angstroms for the two residues to be restrained.

The default setting, recommended for standard applications, is all 5 5.0 15.0


--protein-restraints-reduce FACTOR

Reduce the number of protein restraints by a FACTOR, where FACTOR is a number from [0, 1]. This option reduces the number of automatically generated restraints for the protein molecule in order to speed up computation. Restraints are randomly selected from all generated restraints, so that the final number of restraints #reduced = #all * FACTOR.


-z, --random-seed SEED

Sets the seed for random number generator.


-R, --reference-pdb REF

Loads a reference complex structure. This option allows for comparison with the reference complex structure and triggers additional analysis features

REF must be either:

  • [pdb code]:[protein chains]:[peptide1 chain][peptide2 chain]...
  • [pdb file]:[protein chains]:[peptide1 chain][peptide2 chain]...

Examples:

  • 1abc:AB:C
  • 1abc:AB:CD
  • myfile.pdb:AB:C
  • myfile.pdb.gz:AB:CDE

-r, --replicas NUM

Sets the number of replicas to be used in Replica Exchange Monte Carlo (NUM > 0, default value = 10, changing the default value is recommended only for advanced users)


-D, --replicas-dtemp DELTA

Sets the temperature increment between replicas (DELTA > 0, default value = 0.5, changing the default value is recommended only for advanced users)


-S, --save-cabs-files

Saves CABSdock simulation files.


-C, --save-config

Save simulation parameters in config file.


-d, --separation SEP

The option sets separation distance in Angstroms between the peptide and the surface of the protein (default: 20.0 Angstroms)


--sc-rest-add RESI RESJ DIST WEIGHT

Adds a distance restraint between SC pseudoatom in the residue RESI and SC pseudoatom in the residue RESJ; DIST is a distance between these pseudoatoms (the geometric centers of their side chain atoms) and WEIGHT is restraints weight from [0, 1]. Can be used multiple times to add multiple restraints.


--sc-rest-file FILE

Reads SC restraints from a file (use multiple times to add multiple files).


--sc-rest-weight WEIGHT

Sets a global weight for all SC restraints (default: 1.0)


-t, --temperature TINIT TFINAL

Sets the temperature range for simulated annealing procedure: TINIT - initial temperature, TFINAL - final temperature (default values TINIT=2.0, TFINAL=1.0).

CABSdock uses a temperature-like parameter that does not correspond straightforwardly to the real temperature. Temperature value around 1.0 roughly corresponds to nearly frozen conformation, while the folding temperature of small proteins in the CABS model is usually around 2.0.


-V, --verbose VERBOSITY

Controls how explicit the program output is, 0 for silent mode (only critical messages), 4 for maximum verbosity, default 2.


--version

print version and exit program


--weighted-fit ARG

This option allows to set and customize the way models are structurally aligned, which affects both calculation of the RMSD/RMSF and clustering together with the selectiom of the final models. Models are aligned by the Kabsch optimal fit algorithm. This options assigns weights to all atoms, which specify how 'important' the atom is in the structural fit process. Weights are numbers from [0:1] range with '0' meaning 'irrelevant in fitting process.'

ARG can be either:

  • off Turns off weighted-fit (all weights are 1.0) (default).
  • gauss Weights are generated automatically in the iterative procedure described in Biophys J. 2006 Jun 15; 90(12): 4558-4573. The procedure consists of the following steps: (1) Set wi = 1.0 for i = [1,2 ... N], where N is the number of atoms. (2) Align structures using weights wi. (3) Calculate di - displacement of the i-th atom. (4) Update weights according to formula: wi = exp(-0.5 * di * di). Repeat (2) through (4) until convergence (max 100 iterations, can be changed with --gauss-iterations).
  • flex Weights are taken from the flexibility settings. (See help entry for --protein-flexibility).
  • ss Weights are taken from the secondary structure assignment. Atoms in helices and sheets are given w = 1.0, while those in loops and coil get w = 0.0.
  • <filename> Weights are read from a file <filename>. The file should follow this format:
    default 1.0 (default value, if omitted w = 1.0 is assumed)
    1:A 0.5
    5:A 0.1
    ...
    1:B 0.99
    ...
    

--work-dir DIR

Set working directory to DIR.


3. Ready-to-use examples

3.1 Default docking and demo directory

To run CABSdock using the default settings (recommended for inexperienced users) use the following syntax:

$ CABSdock –i protein-pdb-code –p peptide-sequence:peptide-secondary-structure

For example, to dock HKLVQLLTTT peptide (with externally predicted secondary structure, CHHHHHHHCC) to protein stored as chain A of pdb structure 2FVJ, use

$ CABSdock –i 2FVJ:A –p HKLVQLLTTT:CHHHHHHHCC

This command will:

  • load the conformation of chain A from 2FVJ PDB file as the protein structure
  • load "HKLVQLLTTT" peptide sequence with the secondary structure assignment: "CHHHHHHHCC"
  • set default simulation settings (no knowledge about the binding site; almost rigid backbone of the protein receptor; random initial peptide conformations and positions).

To extend the outputs, it is possible to use additional flags discussed above:

$ CABSdock –i 2FVJ:A –p HKLVQLLTTT:CHHHHHHHCC -M -C -S

The outputs from this docking are available in the demo directory. The docking results are also presented in the pictures below. The surface of the protein (2FVJ) is white, and the peptide is shown in blue with the cartoon representation.

picture_1_lowquality.png

The 1000 top scored structures are presented in light-blue. The experimental structure is presented with dark blue.

picture_2_lowquality.png

The docking result (top-scored model) is presented with marine blue. The ligand-RMSD with respect to the native conformation for the presented model is 3.46A (l-RMSD was calculated automatically using the --reference-pdb option).

picture_4_lowquality.png

3.2 Default docking, peptide sequence from PDB

If the peptide sequence is available as a chain of any structure stored in the PDB database, it is possible to load it directly from the database using its structure ID. For example, the command:

$ CABSdock -i 2BZW:A -p 2BZW:B
loads chain A of 2BZW as the protein structure and loads the sequence of chain B of 2BZW as the peptide sequence. Using this option will not bias the results, it does not load the secondary structure or conformation of the peptide.

An example result for this docking is presented in the picture below. The surface of the protein (2BZW:A) is white, and the peptide is shown in blue with the cartoon representation. The experimental structure is presented with dark blue, whereas the docking result - with marine blue. The ligand-RMSD with respect to the native conformation for the presented model is 2.70A (l-RMSD was calculated automatically using the --reference-pdb):

picture_4_lowqality.png

3.3 Docking with contact information

It is possible to indicate preferred contacts for the complex modelled with CABS-dock.. Those usually will be the contacts identified experimentally that are expected to be present in the resulting structures. To use this information as restraints for docking, use --sc-rest-add (restraints for side-chain to side-chain contacts) or --ca-rest-add (restraints for CA to CA contacts).

An example command to run a docking with additional restraints to enforce contact between residue 235 from the protein chain E and 6th residue of the peptide is:

$ CABSdock –i 2CPK:E –p TTYADFIASGRTGRRNAIHD:CHHHHHHHHCCCCCCCCCCC --sc-rest-add 235:E 6:PEP 5.0 1.0

The resulting set of top 1000 structures is presented below (peptide shown in light blue).

rysunek_4_lowquality.png

For comparison, analogous set of structures is presented for a run without any contact information (a default run).

rysunek_2_lowquality.png

In the figure below, the experimental structure is presented with dark blue, whereas the docking result -- with marine blue. For comparison, the best docking result for a run without contact information is presented in red. The ligand-RMSD with respect to the native conformation for the presented model is 2.70A (l-RMSD was calculated automatically using the --reference-pdb option, for details see below):

rysunek_6_lowquality.png

3.4 Flexible protein loops

CABSdock allows for increasing the flexibility of specified protein fragments -- for example flexible loops that cover the binding site in the unbound protein conformation.

To run a docking simulation in such a case prepare a text file with the flexible region specified and use the --protein-flexibility option:

$ CABSdock -i 2RTM:A -p HPQFEK:CHHHCC -f flexibility.txt
The flexibility.txt file is a one-line text file:
45:A - 54:A 0

3.5 Intrinsically unstructured protein regions

The option --protein-flexibility may also be used to simulate the behavior of intrinsically unstructured region. To run a simulation, in which a part of the protein is highly flexible, issue a command similar to:

$ CABSdock -i 1Z1M:A -p RFMDYWEGL -f flexibility.txt
where flexibility.txt is a simple textfile containing ranges of increased flexibility. In the example case it is:
1:A - 27:A 0
106:A - 119:A 0
Which means the residues 1 to 27 and 106 to 119 of the protein will be treated as completely flexible. An animation presenting the results for this docking is presented below.

scirep-movie.png

3.6 Docking multiple peptides

The newly introduced functionality allows the user to predict binding poses of systems including multiple interacting peptides. To use it, simply use the --peptide option multiple times.

An example command is:

CABSdock -i 1EJL:I -p 1EJL:A -p 1EJL:B

An example result for this docking is presented in the pictures below. The surface of the protein (1EJL:I) is white, and the peptides are shown in blue with the cartoon representation. The image below presents the experimental structures.

twopeptide_picture_1_lowquality.png

The docking results - marked with light blue and green - are presented below.

twopeptide_picture_2_lowquality.png

3.7 Modifying protein restraints

It also possible to adjust the protein rigidity to match the experimental observations.

The RMSF graphs below present results obtained with option --protein-restraints set to:

  • default,

  • ss1 5 5.0 15.0,

  • ss2 5 5.0 15.0.

3.8 Sampling near native binding modes

CABS-dock may be also used to explore near native binding modes and bound complex dynamics. To do so load a bound complex using advanced peptide option --add-peptide with keep keep flags, and set the maximum temperature to a lower value (to make sure the results will only contain bound modes).

An example command is:

CABSdock -i 1AWR:C -P 1AWR:I keep keep --temperature 1.2 1.0

The input structure (1AWR chain C and I) is presented in the picture below.

local_picture_1_lowquality.png

The resulting set of near-native binding modes generated with CABSdock procedure is presented below.

local_picture_2_lowquality.png

3.9 Calculating ligand-RMSD values to a reference complex

The ligand-RMSD values for the peptide may be automatically calculated using the --reference-pdb. This option can be used while running a simulation with any other settings:

$ CABSdock –i 2FVJ –p HKLVQLLTTT:CHHHHHHHCC --reference-pdb 2FVJ:AB 

where 2FVJ:AB is the reference protein-peptide complex. This option also activates additional analysis procedures. The additional output of those methods is described in 5.1 RMSD plot analysis.

3.10 Refinement of CABS-dock models using Rosetta FlexPepDock

In CABS-dock models are reconstructed by default into all-atom representation using Modeller software. Additional structure refinement can improve this result. The pipeline we propose here uses the high-resolution FlexPepDock protocol, but other tools are also available.

For an individual case, you can use an online available server: http://flexpepdock.furmanlab.cs.huji.ac.il/

Here we will present a variant using the standalone version, therefore you need a locally installed rosetta software.

INPUT PREPARATION

Firstly, you need to properly prepare the inputs for rosetta.

You need at most 3 files:

  • model.pdb, CABS-dock resulting protein-peptide complex structure in all-atom or backbone+CB representation (it is necessary)

  • native.pdb, if the structure of protein-peptide complex is known (it is not necessary)

  • unbound.pdb, if the structure of unbound receptor is known (it is not necessary)

In each of the files, the order of the coordinates of the receptor, then the peptide should be kept. Files should be cleared of unnecessary information, such as headers, only the "ATOM" section should be kept and only from the certain chains. An effective method is to use a ready-made python script located on the path: ~/Rosetta/tools/protein_tools/scripts/clean_pdb.py

USAGE:

~/Rosetta/tools/protein_tools/scripts/clean_pdb.py  model.pdb  receptor_chain_id  peptide_chain_id
e.g. ~/Rosetta/tools/protein_tools/scripts/clean_pdb.py  model.pdb A B

The output will be: model_AB.pdb.

If you use a full atomic structure or retain side chains, steric clashes can cause the rosetta energy to be bad. To avoid this there are several solutions:

1) cut Cα coordinates and reconstruct with another tool, e.g. PRODART, REMO, BBQ, SAABAC

2) use the FlexPepDock protocol with -min_receptor_bb, which will allow for receptor backbone minimization

3) replace CABSdock receptor coordinates by the free receptor structure, but note that you should first align both structures using e.g. pymol or theseus software.

theseus USAGE:

theseus -sfrom-to -o reference_structure aligned_structure
theseus -s0-120 -o model_AB.pdb unbound.pdb
* -s option is the receptor residues selection

The output will be: theseus_sup.pdb – unbound receptor superposed on CABSdock receptor.

Then you should replace the coordinates of receptor:

~/Rosetta/tools/protein_tools/scripts/clean_pdb.py  model_AB.pdb B                #cut peptide coordinates
cat theseus_sup.pdb model_AB_B.pdb > model.pdb               #paste unbound receptor and peptide

INITIAL PREPACK

Before you start the proper FlexPepDock simulation, you should quick prepack input structure:

USAGE:

~Rosetta/main/source/bin/FlexPepDocking.proper_compilation_version -database ~/Rosetta/main/database -s    model.pdb -flexpep_prepack -ex1 -ex2aro

The output will be: model_0001.pdb

You can change file name: mv model_0001.pdb model_prepacked.pdb

FlexPepDock REFINEMENT

First, you should prepare a flexpepdock.flagfile:

#-bGDT
-nstruct 250    #how many decoys you need
-in::file::s    /PATH/model_prepacked.pdb  #input protein-peptide complex
-out:file:silent flexpepdock.silent #output silent file name
-out:file:silent_struct_type binary
-pep_refine
-ex1
-ex2aro
-use_input_sc
-unboundrot      PATH/unbound.pdb  #input unbound receptor for rotamers (not necessary)
Then, you can run FlexPepDock:

USAGE:

  ~/Rosetta/main/source/bin/FlexPepDocking.proper_compilation_version -database ~/Rosetta/main/database                 @flexpepdock.flagfile

EXTRACT PDBs

The silent file is a rosetta output format that is used to store ensembles of structures. Each frame in a silent file has a unique identifier, which is called the decoy-tag. The uniq decoy-tag ”decription” is at the end of each line that belongs to the respective frame, which allows to identify and extract frames. The most popular criterion is rosetta score ”score”, which allows you to choose models with top-best rosetta energy:

            grep '^SCORE' flexpepdock.silent | cut -c 1-284,309- > tmp
                cat tmp | sort -k1,1 -k2g  > silent_scores.sc
                cat silent_scores.sc | sort -k2g | awk '{print $47}' | head -11 | tail -10 > top10.tag
                * check if column 47 in silent_scores is the ‘description’
The extract_pdbs script allows to extract structures chosen by decoy-tag from a silent file and save them in PDB format.

USAGE:

                ~/Rosetta/main/source/bin/extract_pdbs.proper_compilation_version -in:file:silent /PATH/flexpepdock.silent     -in:file:tagfile top10.tag

3.11 Docking to GPCRs

Recently, we've proposed a CABS-dock based protocol dedicated for modeling GPCR-peptide systems. The protocol details are provided in the work: Badaczewska-Dawid A, Kmiecik S, Kolinski M. Docking of peptides to GPCRs using a combination of CABS-dock with FlexPepDock refinement (submitted).

The protocol consist of the three modeling stages: (1) docking of peptides to GPCRs using CABS-dock, the peptide sampling space is restricted to spherical volume which includes all receptor fragments that may interact with bound peptides (2) reconstruction of atomistic structures from C-alpha traces using PD2 (3) refinement of protein-peptide complex structures and models scoring using Rosetta FlexPepDock

Example commands

The example command lines for the peptide-GPCR complex (5GLH system):

STAGE 1: Running single docking simulation using CABS-dock

~/CABSdock -s 100 -M -C -S -v 4 -i 5GLH_struc.pdb:A -p 5GLH_struc.pdb:B --reference-pdb 5GLH_struc.pdb:A:B --ca-rest-add 1:PEP 15:PEP 5.3 1.0 --ca-rest-add 3:PEP 11:PEP 6.2 1.0 --sc-rest-add 249:A 1:PEP 30.0 5.0 --sc-rest-add 249:A 2:PEP 30.0 5.0 --sc-rest-add 249:A 3:PEP 30.0 5.0 --sc-rest-add 249:A 4:PEP 30.0 5.0 --sc-rest-add 249:A 5:PEP 30.0 5.0 --sc-rest-add 249:A 6:PEP 30.0 5.0 --sc-rest-add 249:A 7:PEP 30.0 5.0 --sc-rest-add 249:A 8:PEP 30.0 5.0 --sc-rest-add 249:A 9:PEP 30.0 5.0 --sc-rest-add 249:A 10:PEP 30.0 5.0 --sc-rest-add 249:A 11:PEP 30.0 5.0 --sc-rest-add 249:A 12:PEP 30.0 5.0 --sc-rest-add 249:A 13:PEP 30.0 5.0 --sc-rest-add 249:A 14:PEP 30.0 5.0 --sc-rest-add 249:A 15:PEP 30.0 5.0 --sc-rest-add 249:A 16:PEP 30.0 5.0 --sc-rest-add 249:A 17:PEP 30.0 5.0 --sc-rest-add 249:A 18:PEP 30.0 5.0 --sc-rest-add 249:A 19:PEP 30.0 5.0 --sc-rest-add 249:A 20:PEP 30.0 5.0 --sc-rest-add 249:A 21:PEP 30.0 5.0             
OUTPUT: model.pdb
mv model.pdb 5GLH_S1_M1.pdb

STAGE 2: Structure reconstruction from coarse-grained representation using PD2

~/bin/pd2_ca2main --database ./database/ -i 5GLH_S1_M1.pdb -o 5GLH_S1_M1-reconstructed.pdb --ca2main:new_fixed_ca --ca2main:bb_min_steps 500            

STAGE 3: Structure refinement using Rosetta FlexPepDock

1. Input preparation

use select_atoms.py for:

  • removing the side chains (select atoms by option -wa),
  • ordering the protein and peptide coordinates in the pdb file (receptor - first, peptide - second; use option -wc),

~/select_atoms.py -f 5GLH_S1_M1-reconstructed.pdb -wc A,B -wa CA,CB,C,O,N -o 5GLH_S1_M1-backboneCB.pdb
use rename_chains.py for:

  • renaming the protein and peptide chains: receptor - A, peptide - B,

~/rename_chains.py -f 5GLH_S1_M1-backboneCB.pdb -cho C,D -chn A,B -o 5GLH_S1_M1-chains.pdb
use ~/ROSETTA/tools/protein_tools/scripts/clean_pdb.py for:

  • renumbering the amino acid residues in the pdb file (starting from 1),
  • preparing a properly formatted structure files (e.g. no 0.00 values in occupancy column),

~<path_to_Rosetta>/tools/protein_tools/clean_pdb.py 5GLH_S1_M1-chains.pdb     #default output: 5GLH_S1_M1-chains_AB.pdb
mv 5GLH_S1_M1-chains_AB.pdb 5GLH_S1_M1-formated.pdb

2. Side chains reconstruction and pre-packing of initial complex structural components

~<path_to_Rosetta>main/source/bin/FlexPepDocking.linuxgccrelease -database <path_to_Rosetta>/main/database -s 5GLH_S1_M1-formated.pdb -flexpep_prepack -ex1 -ex2aro     #default output: 5GLH_S1_M1-formated_0001.pdb
mv 5GLH_S1_M1-formated_0001.pdb 5GLH_S1M1.pdb

3. Refinement of pre-packed initial complex structure

~<path_to_Rosetta>main/source/bin/FlexPepDocking.linuxgccrelease -database <path_to_Rosetta>/main/database @flexpepdock.flagfile
flexpepdock.flagfile content:
#-bGDT
-nstruct 300
-in::file::native <path>/ref.pdb
-in::file::s      <path>/5GLH_S1M1.pdb
-out:file:silent flexpepdock-lowres.silent
-out:file:silent_struct_type binary
-detect_disulf true
-rebuild_disulf true
-fix_disulf <path>/disulfide.dat
-lowres_preoptimize true
-pep_refine
-ex1
-ex2aro
-use_input_sc
disulfide.dat content:
293 307
295 303

3'. Minimization of pre-packed initial complex structure using Rosetta FlexPepDock (STAGE 3')

~<path_to_Rosetta>main/source/bin/FlexPepDocking.linuxgccrelease -database <path_to_Rosetta>/main/database @flexpepdock.flagfile
flexpepdock.flagfile content:
#-bGDT
-nstruct 1
-in::file::native <path>/ref.pdb
-in::file::s      <path>/5GLH_S1M1.pdb
-out:file:silent flexpepdock-minimize.silent
-out:file:silent_struct_type binary
-detect_disulf true
-rebuild_disulf true
-fix_disulf <path>/disulfide.dat
-flexPepDockingMinimizeOnly true
-ex1
-ex2aro
-use_input_sc

STAGE 4: Scoring and selecting top models

a) selecting final top-scored models using analyze_silent.sh

analyze_silent.sh content:

cat flexpepdock.silent | grep "SCORE:" | head -1 | awk '{for (i = 1; i <= NF; i++) print i, $i}' > p
t1=`cat p | grep "reweighted_sc" | awk '{print $1}'`
t2=`cat p | grep "I_sc" | awk '{print $1}'`
t3=`cat p | grep "pep_sc" | awk '{print $1}' | head -1`
t4=`cat p | grep "rmsBB_if" | awk '{print $1}'`
t5=`cat p | grep "description" | awk '{print $1}'`
rm p
echo "SCORE: total_score reweighted_sc I_sc pep_sc rmsBB_if description" > columns.sc
cat flexpepdock.silent | grep "SCORE:" | awk -v A=$t1 -v B=$t2 -v C=$t3 -v D=$t4 -v E=$t5 '{print $1,$2,$A,$B,$C,$D,$E}' >> columns.sc

#--- Select 1% top-scored models using total_score and reweighted_sc
cat columns.sc | tr . , | sort -gk2 | tr , . | head -3 > data.dat
cat columns.sc | tr . , | sort -gk3 | tr , . | head -3 >> data.dat
#--- Select final 10 top-scored models using reweighted_sc and pep_sc
cat data.dat | sort | uniq | tr . , | sort -gk3 | head -5 | tr , . > TOP10
cat data.dat | sort | uniq | tr . , | sort -gk5 | head -5 | tr , . >> TOP10
#--- Create file with tags for extracting pdbs
cat TOP10 | awk '{print $6}' > tags

b) extracting compressed structure coordinates to pdb file

~<path_to_Rosetta>main/source/bin/extract_pdbs.static.linuxgccrelease -database <path_to_Rosetta>/main/database -in:file:silent flexpepdock-lowres.silent -in:file:tagfile tags
c) calculating structural properties of final models using DockQ
~<path_to_DockQ>DockQ/scripts/fix_numbering.pl <path>/5GLH_S1M1_0001.pdb <path>/ref.pdb
~<path_to_DockQ>DockQ/DockQ.py <path>/5GLH_S1M1_0001.pdb.fixed <path>/ref.pdb > 5GLH_S1M1_0001-parameters

4. Output models

The resulting models are stored in /output_pdb folder in the working directory. The number of CABS-dock top-scored models and sets of models can be modified by users. The CABS-dock modeling result in the following files containing models or sets of models (see also the Figure below):

  • model_*.pdb – by default, 10 top-scored models in all-atom representation numbered from 1 to 10 (PDB file)

  • cluster_*.pdb – clusters of models (groups of models that have been classified in structural clustering to particular clusters) in CA representation, by default numbered from 1 to 10 (PDB file). Cluster numbering corresponds to cluster ranking and to model numbering i.e. model_7.pdb is a representative model for models grouped in the seventh cluster (ranked as seventh) (cluster_7.pdb). Cluster_*.pdb files may be used, for example, for visual assessment of clustering quality or visualization of the near-bound conformations. If combined with custom scoring methods, it may be used to improve the quality of selected final models.

  • top *.pdb – top-scored models in CA representation, selected for further clustering and analysis from the 10 trajectories (PDB file). By default, top1000.pdb file is generated containing 1000 top-scored models that passed a simple energy-based filtering procedure (100 lowest energy models are selected from each replica). If the user supposes a non-standard clustering method would provide better results, this file may be used as an input.

  • replica_*.pdb – complete set of 10 trajectories in CA representation, numbered from 1 to 10 (PDB file). Each replica contains 1000 models. Combined, they consist of all the models saved during the CABS simulation and may be treated as raw output of the method. The user may then apply custom filtering and clustering procedures to improve the success rate of the final model selection.

The Figure below shows CABS-dock pipeline with default settings:

CABS-dock-pipeline4-800.png

5. Output plots and additional analysis to reference complex

CABSdock creates several plots during analysis of its results, stored in /plots and /contact_maps subdirectory of working directory.

5.1 RMSD plot analysis

If you provide a reference protein-peptide complex - to compare the modeling results with a reference complex (using --reference-pdb option) - the CABSdock package will generate: * plot of RMSF (root mean square fluctuation) * energy (total and interaction) vs. peptide RMSD (to reference peptide)

Sample output:

  • RMSF (root mean square fluctuation) of subsequent target residues (around input position). In case of long target protein, only some reference residues are marked on x axis. Values of RMSF ranges from 0 to 1. Sample path workdir/plots/RMSF_seq.svg. Plain text file containing this data is available in corresponding workdir/RMSF.csv.

RMSF_seq.png

  • Energy vs. RMSD to reference peptide, two separate plots are being made: one shows total energy (for entire complex structure), the other - interaction energy (for interaction between peptide and protein receptor). Both consist of plot and histogram of RMSDs distribution along trajectory. Upper plot: energy vs. RMSD plot. Energy is given in CABS units. Sample path to plot and plain text file are, respectively, workdir/plots/E_RMSD_<chain>_<energy>.svg and workdir/plots/E_RMSD_<chain>_<energy>.csv, where <chain> is peptide chain character and <energy> is type of energy (total or interaction). Lower histogram: counts of frames with particular RMSD. Bins are at most 1 Å width (less if difference between highest and lowest RMSD is less than 5). Both: data for all frames is plotted in gray. Top 1k models are plotted in dark orange.

E_RMSD_C_inter.png

E_RMSD_C_total.png

  • RMSD to reference peptide vs. MC step. For each replica CABS provides history of RMSD changes (only if reference PDB was given). Dotted line between points is introduced for clarity of points sequence. Sample path workdir/plots/RMSD_frame_<chain>_replica_<replica number>.svg and workdir/plots/RMSD_frame_<chain>_replica_<replica number>.svg for plain text file.

RMSD_frame_C_replica_5.png

5.2 RMSD additional analysis

Additional data from the simulation are stored in /output_data subdirectory of working directory. It consists of following files (‚C’ in the examples refers to the peptide chain in CABSdock trajectories):

  • all_rmsds_C.txt - list of RMSD values to the reference structure of the peptide(s) calculated for each of the simulation frames.
  • filtered_rmsds_C.txt - list of RMSD values to the reference structure of the peptide(s) for each of the frames filtered before scoring.
  • medoids_rmsds_C.txt - list of the peptide RMSD values of the top scored models (cluster medoids).
  • lowest_rmsds_C.txt - summary of the lowest peptide RMSD values obtained in the simulations.
  • target_alignment_C.csv - sequential alignment to the reference structure of the C-chain.
  • config.ini - a CABSdock configuration file that stores all the options used to execute a docking run. This file may be used either to re-run docking or for analysis of an already finished docking (see below).

5.3 Contact map and contact histogram plot analysis

  • --contact-maps - if this flag is given, contact maps will be calculated, data and plots will be stored.

  • Target protein internal contact map. If target protein is too long - only some ticks will be marked on both axes. Sample path workdir/contact_maps/target_all.svg.

target_all.png

  • Maps of interface contacts frequencies with all target residues for clusters, top models and replicas. Frequencies are presented in separate lines if needed. Sample path workid/contact_maps/<type>_<number>_ch_<chain character>.svg and corresponding files with txt extension containing plain data. <type> can be one of the following: cluster, top (for top model), or replica; <number> denotes number of corresponding type, i.e. cluster, top model or replica and <chain character> distinguishes peptide chains. E.g. top_4_ch_X.svg would be name of contact map for chain X in 4th of top scored models.

cluster_0_ch_A.png

  • Histogram of contact frequencies are divided into three sections: top, middle and bottom. Top (upper histogram) shows frequencies of peptide chain residues. Middle section (all histograms but first and last) shows detailed analysis of only those residues from target protein, which were in contact with target peptide at least once. Last section (last histogram) shows summary of contact frequencies of all target residues, whether they had contacts with peptide or not.

all_contacts_histo_A.png

5.4 Handling of not identical input and reference models

  • Built-in sequence alignment. During calculation of RMSD to reference structure sequential alignment between simulation and reference are created for both: peptide and target protein. Option --align allows user to determine method of sequence alignment to be used. By default CABS-dock uses its own implementation of Smith-Waterman algorithm. If package NCBI+ is installed, it is also possible to use protein BLAST. In that case one can set align method to blastp:
    CABSdock ... --align blastp ...
    
  • Loading alignment from file. If alignment to reference structures is known or when available sequence alignments are not enough to properly align target or peptide -- path to reference alignment can be passed to CABS-dock. To do so one needs to set --align argument to CSV to order CABS-dock to use aligning method that load external file, and --alignment-options to pass file name as fname=<path>. E.g.:
    CABSdock ... --align CSV --alignment-options fname=external/file.csv
    
    If alignment of peptide and target are stored in different files, user can pass different options to be used while loading alignment of target or peptide:
    CABSdock ... --align CSV --alignment-options fname=external/file.csv --alignment-peptide-options fname=external/file_peptide.csv
    
    If --alignment-peptide-options is not given -- file from --alignment-options is passed to peptide file loader.

Given file needs to be in CSV format as described by Berbalk et. al. in 2009 (doi: 10.1002/pro.213; alignments returned by CABSdock are in that particular csv format). Sample file is given below:

                         reference   template
                         B:687:H C:687:H
                         B:688:K C:688:K
                         B:689:I C:689:I
                         B:690:L C:690:L
                         B:691:H C:691:H
                         B:692:R C:692:R
                         B:693:L C:693:L
                         B:694:L C:694:L
                         B:695:Q C:695:Q
                         B:696:D C:696:D

6. Additional docking analysis

6.1 Analysis of an already finished simulation

It is sometimes necessary to perform additional analysis of the docking results - for example calculate RMSD to another reference complex or produce contact maps with slightly changed cut-off. To perform this kind of analysis, remember run your original job with --save-cabs-files and --save-config option:

$ CABSdock -i 2P1T:A -p HKILHRLLQD:CHHHHHHHHC --save-cabs-files --save-config

This option will result in storing two additional files: a CABSdock config file config.ini and compressed archive <timestamp><randomstring>.cbs.

To re-run the default analysis of your job use the following command using --config and --load-cabs-files options:

$ CABSdock -c SAVED_CONFIG_FILE --load-cabs-files SAVED_CBS_FILE

You can use this syntax to specify any additional analysis option (your command line options will overwrite any options specified in the CONFIG file). For example you may want to filter out only 100 low-energy models and cluster them into 3 clusters using --filtering-count and --clustering-medoids options to alter the default settings:

$ CABSdock -c SAVED_CONFIG_FILE --load-cabs-files SAVED_CBS_FILE --filtering-count 100 --clustering-medoids 3

6.2 Analysis with PyMOL plugin

Recently, we developed a PyMOL plugin which enable molecular visualization analysis of CABSdock results. The plugin repository is temporarily available from here plugin documentation

7. CABSdock scoring

CABSdock scoring procedure can be modified by users.

The default procedure (using default settings) looks like follows:

  • Simulation module produces a set of 10’000 of models (10 trajectories consisting of 1000 models) in CA representation

  • Scoring module selects top-scored models from the simulation module output. Top-scored models are selected based on interaction energy values and structural clustering. Scoring module outputs of 10, 100 and 1000 top-scored model in CA representation.

  • Reconstruction to all-atom representation module uses a Modeller package to reconstruct a set of 10 top-scored models from CA to all-atom representation.

8. Advanced CABS data

8.1 .cbs files

A .cbs file contains the complete set of of both input and output text files read and written by the core CABS simulation module, compressed into a single archive. CABSdock and CABSflex programs can read -L, --load-cabs-files FILE and write -S, --save-cabs-files .cbs files directly. It is however possible to extract basic information from the underlying files.

8.2 Filename pattern

.cbs file names consist of timestamp in the yymmddHHMMSS format followed by a random 6-character string and .cbs extension as in: 180129155704Enar5w.cbs.
In order to extract all files into current directory run:

tar xzf myfile.cbs

This will create (and possibly overwrite) five files in the current directory INP, SEQ, TRAF, OUT and FCHAINS. To extract one specific file (i.e. SEQ) to the current directory run:

tar xz SEQ < myfile.cbs
or:

tar xzO SEQ < myfile.cbs 

to only write its content to the screen.

8.3 INP file (input)

INP is a input file for the CABS procedure. It has a very restrictive format, where most whitespaces and newlines matter, so know what you're doing before modifying it. Specifically - empty, or comment lines are not allowed. Fields within a line are separated by one or more whitespaces (including tabs). Order of lines and fields within a line is meaningful.

INP file is composed of four sections:

  1. general configuration (lines 1 - 4)
  2. CA restraints (lines 5 - 5+N) (N is the number of CA restraints)
  3. SC restraints (lines 6+N, 6+N+M) (M is the number of side-chain restraints)
  4. excluding (lines 7+N+M, 7+N+M+K) (K is the number of excluded contacts)

N, M and K could all be "0".

general configuration section contains all of the parameters required to run CABS such as the simulation temperature, scaling factors for the force field components, parameters controlling the simulation length etc. CA restraints section contains the list of the restraints imposed on pairs of the CA atoms.
SC restraints section contains the list of the restraints imposed on pairs of the unified side-chain pseudo-atoms.
excluding section contains the list of all of the forbidden contacts between any two residues (both CA and SC (pseudo-)atoms are considered when checking for contact).

###Below is a detailed description of the INP file format. Line number: field

1: RNG-seed

2: MC-anneal MC-cycles MC-steps #replicas #chains

3: T-initial T-final E-repulsion E-interaction dT-replicas

4: E-side-chain E-long-range E-centro-symmetric E-hydrogen-bond E-short-range

5: #CA-restraints(N) weight

6: chainI residueI chainJ residueJ distance weight

7: chainI residueI chainJ residueJ distance weight

. . .

N+6: #SC-restraints(M) weight SC-SC

N+7: chainI residueI chainJ residueJ distance weight

N+8: chainI residueI chainJ residueJ distance weight

. . .

N+M+7: #excluded-contacts excluding-cut-off

N+M+8: chainI residueI chainJ residueJ

N+M+9: chainI residueI chainJ residueJ

. . .

  • RNG-seed - integer to seed the Random Number Generator
  • MC-anneal/cycles/steps - integers controlling the length of the simulation
  • #replicas - number of replicas to be used
  • #chains - number of protein chains
  • T-initial/final - initial and final temperature of the simulation
  • dT-replicas - temperature difference between neighboring replicas
  • E-* - scaling factors for various energy terms
  • #CA/SC-restraints - number of CA/SC restraints
  • #excluded contacts - number of excluded contacts
  • chainI/J - identification number of protein chain: 1, 2 ... (not A, B)
  • residueI/J - identification number of a residue within a chain 1, 2 ... (not a number from the pdb file)

###Example INP file:

1245
20 10 10 1 9
1.40 1.40 4.00 1.00 0.50
1.000 2.000 0.125 -2.000 0.375
1432 1.00
 1   2  1  48   6.58   1.00
 1   2  1  49   5.54   1.00
 1   2  1  50   4.45   1.00
 1   2  1  51   7.55   1.00
 1   3  1  48   5.73   1.00
 1   3  1  49   5.43   1.00
 1   3  1  50   6.42   1.00
 1   3  7  38   6.96   1.00
 1   4  1  46   6.21   1.00
 .
 .
 .
 9  78  9  81   4.97   1.00
 9  78  9  82   6.28   1.00
 9  79  9  82   4.78   1.00
 9  79  9  83   5.66   1.00
 9  80  9  83   4.68   1.00
 9  81  9  87   6.08   1.00
 9  82  9  87   6.65   1.00
2 1.00
 1   5  2  12   4.50   0.50
 3  17  3  35   5.00   0.75
6 5.000
7 5 1 66
7 1 1 66
7 4 1 66
7 3 1 66
7 2 1 66
7 6 1 66  

8.4 SEQ file (input)

SEQ file contains information such as protein sequence, secondary structure and local flexibility. This file is used by the CABSdock and the CABSflex programs to generate pdb files with output structures and trajectories - residues' names and numbers and chains' IDs are taken from the SEQ file.

SEQ file contains as many lines as there are residues in the simulated system. Each line is organised into 5 columns:
residue-number residue-name chain-ID II-structure flexibility

  • residue-number - as it occurs in the input pdb file
  • residue-name - name of the residue in the 3-letter code
  • chain-ID - one character identifying protein chain as it occurs in the input pdb file
  • II-structure - single digit indicating the secondary structure assigned to each residue in the following code:
    • 1 - coil
    • 2 - helix
    • 3 - turn
    • 4 - sheet
  • flexibility - number from [0, 1] range indicating the level of flexibility assigned to each residue, where 0.0 means 'fully flexible', and 1.0 - 'rigid'.

###Example SEQ file:

  135   GLU A  1  1.00
  136   ARG A  4  1.00
  137   ARG A  4  1.00
    .
    .
    .
  180   ARG A  4  1.00
  135   GLU B  1  1.00
  136   ARG B  1  1.00
  137   ARG B  4  1.00
    .
    .
    .
  180   ARG B  4  1.00
    3   GLN H  1  1.00
    4   LYS H  4  1.00
    5   THR H  4  1.00
    .
    .
    .
   30   ASP H  1  1.00
    1   MET J  1  1.00
    2   ALA J  4  1.00
    3   GLN J  4  1.00
    4   LYS J  4  1.00
    5   THR J  4  1.00
    6   PHE J  4  1.00
    7   LYS J  4  1.00
    8   VAL J  4  1.00
    9   THR J  1  1.00
   10   ALA J  1  1.00

8.5 FCHAINS file (input)

FCHAINS file contains the coordinates of all of the CA atoms in the system in the initial conformation (before the simulation starts). The file is organised in sections; each corresponding to exactly one protein chain. Section starts with a single integer number N in a line and is followed by N lines - each containing three integers: x, y and z coordinates of one of the CA atoms in CABS lattice units (hence the integers).

The number of sections (chains) in the FCHAINS file depends also on how many replicas are to be used during the simulation. In general each chain can have different set of coordinates in different replicas. Finally the structure of the FCHAINS file containing N replicas and M chains in each replica is as follows:

chain 1 replica 1
chain 1 replica 2
...
chain 1 replica N
chain 2 replica 1
chain 2 replica 2
...
chain 2 replica N
.
.
.
chain M replica 1
chain M replica 2
...
chain M replica N

###Example FCHAINS file: File contains coordinates for three protein chains (lengths: 19, 14 and 6) in two replicas:

19
51 5 25
53 1 30
59 0 30
61 -4 35
66 -2 39
69 -6 36
68 -4 31
69 2 33
75 0 35
76 -3 30
76 2 27
79 6 31
84 2 31
84 2 24
83 8 24
88 8 28
91 4 25
91 7 20
92 13 23
19
99 8 19
97 12 15
100 18 16
96 22 12
101 26 12
105 22 9
110 23 12
113 18 12
113 18 5
118 22 6
121 17 8
120 14 2
124 14 -3
122 19 -6
118 19 -10
120 20 -16
117 22 -21
120 23 -26
117 25 -31
14
121 1 -31
123 7 -28
119 6 -23
121 12 -20
116 12 -16
117 6 -14
121 7 -9
117 11 -7
112 8 -8
115 3 -6
116 6 -1
111 8 0
108 2 -1
14
116 6 -1
111 8 0
108 2 -1
112 0 3
111 4 8
105 3 6
105 -3 7
108 -2 13
103 2 14
99 -2 12
102 -6 16
101 -3 21
6
115 -37 3
113 -32 5
117 -28 9
115 -23 11
119 -21 15
118 -15 14
6
112 -19 19
115 -15 22
111 -10 21
106 -14 23
109 -14 28
104 -18 30

Note that protein chains in the FCHAINS file are always longer by two residues than respective chains in the input pdb file loaded by the CABSdock or the CABSflex programs, since upon casting a protein structure on the CABS lattice two "dummy" residues are added to the ends of all protein chains.

Also note that number and length(s) of protein chains in the FCHAINS file are sometimes different from what is loaded by the CABSdock or the CABSflex programs, as upon casting a protein structure on the CABS lattice the structure is tested for chain continuity and broken into smaller chains on gaps. Original chain composition is restored when result pdb files with structures and trajectories are generated.

8.6 TRAF file (output)

TRAF file contains the coordinates of the CA atoms of every k-th conformation generated during the simulation, where k equals to the MCsteps parameter set in the INP file.

The TRAF file can be divided into blocks of data. Each block corresponds to a single protein chain within a single replica at exactly one moment in time. Such block starts with the header line followed by multiple coordinates lines. Although these blocks are the only explicit data structures within the TRAF file, they are organised into three abstract structures: chains, replicas and frames representing respective data structures processed in the simulation. Following figure presents the data layout inside the TRAF file.

Figure

8.7 OUT file (output)

The OUT file contains a short summary of the simulation.

9. CABSflex simulations of protein fluctuations

Except protein-peptide docking functionalities, CABSdock standalone package is equipped with additional feature that enable to perform fast simulations of protein fluctuations using CABS-flex methodology see CABS-flex server website.

The following command:

$ CABSflex -i 2GB1

will run CABSflex method with default flexibility settings described here

Note that default settings for CABS-flex flexibility are different from those set for CABS-dock: restraints are calculated only for regions with secondary structure and are not so strict (created for residues distant up to 8 Angstroms).

All other options, except for those concerning peptides, works the same as in CABS-dock.


Updated