Wiki
Clone wikiCABSdock / Home
Welcome to CABSdock wiki page! Installation instructions and the method outline are provided on the CABSdock OVERVIEW PAGE
Table of contents
2.2 Protein structure input options
2.4 Distance restraints options
3.1 Default docking and demo directory
3.2 Default docking, peptide sequence from PDB
3.3 Docking with contact information
3.5 Intrinsically unstructured protein regions
3.7 Modifying protein restraints
3.8 Sampling near native binding modes
3.9 Calculating ligand-RMSD values to a reference complex
3.10 Refinement of CABS-dock models using Rosetta FlexPepDock
5. Output plots and additional analysis to reference complex
6. Additional docking analysis
9. CABSflex simulations of protein fluctuations
1. CABSdock modeling scheme
1.1 Pipeline
CABSdock is an efficient simulation method for protein-peptide docking. The method enables to simulate significant conformational changes during the docking search for a binding site. CABSdock standalone package allows for control and modification of every simulation step. The picture below shows the CABSdock pipeline with default settings.
1.2 Example simulation movies
The movie below shows example trajectory from protein-peptide docking using CABSdock. Only 1 trajectory (system replica) out of 10 trajectories (system replicas) is presented. The docking was performed with default CABSdock settings:
The movie below shows example trajectories from protein-peptide docking using CABSdock. The movie shows 10 trajectories (system replicas) and selected 1 trajectory together with RMSD analysis. The docking was performed with default CABSdock settings:
The movie below shows shows example simulation snapshots from the CABSdock study on molecular docking with large-scale conformational changes: the p53-MDM2 interaction, see details in Sci Rep 6, 37532 (2016)
1.3 CABS simulation engine
CABSdock method uses an efficient simulation engine: CABS coarse-grained protein model. The picture below shows comparison between all-atom representation (left) and CABS coarse-grained model representation (right) for an example 4-residue protein fragment. In CABS, single amino acid is represented by 4 atoms (or pseudo-atoms): C-alpha (CA), C-beta (CB), center of the mass of Side-Chain group (SC) and center of the peptide bond (cp).
Note that CABSdock modeling scheme allows to apply/modify distance restraints between selected CA atoms or between selected SC pseudoatoms.
CABS design and applications have been recently described in the review: Chemical Reviews, 116:7898–7936, 2016
1.4 Average simulation time
Plot below presents average simulation time (in hours) and min/max times for jobs in a function of a protein-peptide system size (receptor + protein) size, for default number of simulation cycles (50), using single 2.5 GHz processors. The plot has been made using a few months data from CABS-dock web server
1.5 Papers on CABSdock development and applications
Papers describing the CABS-dock server and its example applications:
-
CABS-dock web server for flexible docking of peptides to proteins without prior knowledge of the binding site, Nucleic Acids Research, 43(W1): W419-W424, 2015 - the paper describes CABS-dock web server implementation and results obtained for the PeptiDB benchmark set in the default mode (using peptide sequence, peptide secondary structure and receptor structure input informations)
-
Modeling of protein-peptide interactions using the CABS-dock web server for binding site search and flexible docking, Methods, 93, 72-83, 2016 - the paper presents: example CABS-dock results obtained in the default mode and using advanced options that enable to increase the range of flexibility for chosen receptor fragments, examples of scoring of CABS-dock models using all-atom molecular dynamics, a tutorial appendix for analysis and visualization of CABS-dock results using VMD
-
Protein-peptide molecular docking with large-scale conformational changes: the p53-MDM2 interaction, Scientific Reports 6, 37532, 2016 - the paper presents CABS-dock application to simulations of the binding of the p53-MDM2 complex, including large-scale structural rearrangements of MDM2 flexible regions
-
Highly flexible protein-peptide docking using CABS-dock, Methods in Molecular Biology, 1561: 69-94, 2017 - the paper presents an example of CABS-dock application for docking a potentially therapeutic peptide to a protein target, simulation contact maps (a new feature of the web server), tutorial for running CABS-dock web server from the command line or command line scripts
-
Modeling EphB4-EphrinB2 protein–protein interaction using flexible docking of a short linear motif, Biomedical engineering online, 16:71, 2017 - the paper presents an example test case of the protocol for protein-protein docking in which CABS-dock is used for docking a short linear motif as a peptide. Furthermore, based on the docking result, the protein–protein complex is reconstructed and refined.
-
Protein–peptide docking using CABS-dock and contact information, Briefings in Bioinformatics, bby080, 2018 - the paper presents a specific CABS-dock protocol that enhances the docking procedure using fragmentary information about protein–peptide contacts. The contact information is used to narrow down the search for the binding peptide pose to the proximity of the binding site.
2. CABSdock options
2.1 Basic options
Click on an option link to read full description
- -i, --input-protein INPUT - Loads input protein structure.
- -p, --peptide PEPTIDE - Loads peptide sequence and optionally peptide secondary structure in one-letter code (can be used multiple times to add multiple peptides).
- -c, --config CONFIG - Reads options from the configuration file CONFIG.
2.2 Protein structure input options
Note that CABS-dock uses cache directory (default location is ~/cabsPDBcache) to keep pdb files downloaded from the PDB database.
Click on an option link to read full description
- -e, --exclude RESIDUES - Excludes protein residues listed in RESIDUES from the docking search, therefore enforces more effective search in other areas of a protein surface, for example, it may be known that some parts of the protein are not accessible to peptide (due to binding to other proteins) and therefore it could be useful to exclude these regions from the search procedure.
- --excluding-distance DISTANCE - Sets minimum distance between side chain atoms of peptide(s) and protein residues marked as 'excluded'.
- -f, --protein-flexibility FLEXIBILITY - Modifies flexibility of selected protein residues.
- -g, --protein-restraints MODE GAP MIN MAX - Allows to generate a set of binary distance restraints for CA atoms, that keep the protein in predefined conformation.
- --protein-restraints-reduce FACTOR - Allows to reduce the number of generated restraints.
- -N, --no-protein-restraints - Turns off automatic restraints generation.
- --weighted-fit ARG - Allows to set and customize the way models are structurally aligned.
- --gauss-iterations NUM - Sets the maximum number of iterations of the dynamic weighted-fit algorithm to NUM.
2.3 Peptide input options
Click on an option link to read full description
- -P, --add-peptide PEPTIDE CONFORMATION LOCATION - Add peptide to the complex.
- -d, --separation SEP - This option enables advanced settings of building starting conformations of modelled complexes (to be used only in specific protocols).
- --insertion-clash DIST - This option enables advanced settings of building starting conformations of modelled complexes.
- --insertion-attempts NUM - This option enables advanced settings of building starting conformations of modelled complexes.
2.4 Distance restraints options
Click on an option link to read full description
- --ca-rest-add RESI RESJ DIST WEIGHT - Adds distance restraint between CA atom in residue RESI and CA atom in residue RESJ.
- --sc-rest-add RESI RESJ DIST WEIGHT - Adds distance restraint between SC pseudo-atom in residue RESI and SC pseudo-atom in residue RESJ.
- --ca-rest-weight WEIGHT - Set global weight for all CA restraints (including automatically generated restraints for a protein).
- --sc-rest-weight WEIGHT - Set global weight for all SC restraints.
- --ca-rest-file FILE - Read CA restraints from file (use multiple times to add multiple files).
- --sc-rest-file FILE - Read SC restraints from file (use multiple times to add multiple files).
2.5 Simulation options
Click on an option link to read full description
- -a, --mc-annealing NUM - sets number of Monte Carlo temperature annealing cycles to NUM (NUM > 0, default value = 20, changing default value is recommended only for advanced users).
- -y, --mc-cycles NUM - sets number of Monte Carlo cycles to NUM (NUM>0, default value = 50).
- -s, --mc-steps NUM - sets number of Monte Carlo cycles between trajectory frames to NUM (NUM > 0, default value = 50).
- -r, --replicas NUM - sets number of replicas to be used in Replica Exchange Monte Carlo (NUM > 0, default value = 10, changing default value is recommended only for advanced users).
- -D, --replicas-dtemp DELTA - sets temperature increment between replicas (DELTA > 0, default value = 0.5).
- -t, --temperature TINIT TFINAL - sets temperature range for simulated annealing TINIT - initial temperature, TFINAL - final temperature (default values TINIT = 2.0 TFINAL = 1.0).
- -z, --random-seed SEED - sets seed for random number generator.
2.6 All-atom reconstruction options
CABS-dock uses the Modeller tool to reconstruct top-scored models from C-alpha to all-atom resolution (see the [Modeller reconstruction script] (https://bitbucket.org/lcbio/ca2all/)). Note that the current version of the CABS-dock reconstruction protocol automatically reconstructs chain breaks in the receptor structure. If this is not desirable behavior, we recommend to use own reconstruction protocols (that may be composed of different available tools, like for example Pulchra, PD2, SAABAC, SCWRL4 or modified Modeller script). Until the end of March 2019, we plan to provide new improved and customizable protocol of the CABS-dock all-atom reconstruction that, among other new features, will enable preservation of chain breaks. The new reconstruction protocol, together with all-atom refinement options, will be described in the new article in Methods in Molecular Biology, as well as announced in the CABS-dock repository.
Click on an option link to read full description
- -A, --aa-rebuild - Rebuild final models to all-atom representation (requires MODELLER installed).
- -m, --modeller-iterations NUM - Set number of iterations for the reconstruction procedure in MODELLER (default: 3).
2.7 Results analysis options
Click on an option link to read full description
- -R, --reference-pdb REF - Load reference complex structure.
- -k, --clustering-medoids NUM - Sets number of medoids in k-medoids clustering algorithm.
- --clustering-iterations NUM - Sets number of iterations of the clustering k-medoids algorithm.
- -n, --filtering-count NUM - Sets number of low-energy models from trajectories to be clustered (default 1000).
- --filtering-mode MODE - Picks (filtering-number/replicas) models from each replica.
- -M, --contact-maps - Stores contact maps matrix plots and histograms of contact frequencies.
- -T, --contact-threshold DIST - Set contact distance between side chains pseudoatoms (SC) for contact map plotting.
- --contact-threshold-aa DIST - Set contact distance between heavy atoms for contact map plotting (all-atom top scored models only).
- --contact-map-colors COLORS - sets colors in hex code to be used in contact map color bars.
- --align METHOD - Method to be used to align terget with reference pdb.
- --alignment-options - Options to be passed to method aligning (target).
- --alignment-peptide-options - Options to be passed to method aligning peptide.
2.8 Output options
Click on an option link to read full description
- -S, --save-cabs-files - Save CABSdock simulation file. The filename will have the following format: yymmddHHMMSS<RANDOM 6-CHARACTERS STRING>.cbs format. For example: 181116161924knWPtn.cbs
- -L, --load-cabs-files FILE - Load CABSdock simulation file(.cbs). This option allows for repeated scoring and analysis of CABSdock trajectories (with new settings, for example using a reference complex structure).
- -C, --save-config - Save simulation parameters in config file.
- -o, --pdb-output SELECTION - Select structures to be saved in the pdb format.
2.9 Miscellaneous options
Click on an option link to read full description
- --work-dir DIR - set working directory to DIR.
- --dssp-command PATH - provide path to dssp binary.
- --fortran-command PATH - provide path to fortran compiler binary.
- --image-file-format FMT - produces all the image files in given format.
- -V, --verbose VERBOSITY - Controls how explicit the program output is, 0 for silent mode (only critical messages), 4 for maximum verbosity (default: 2).
- --log - redirect all output to the log file (CABS.log)
- --version - print version and exit program
- -h, --help - print help and exit program
2.10 Options' index
-A
, --aa-rebuild
Rebuild final models to all-atom representation. (default: True)
-P
, --add-peptide
PEPTIDE
CONFORMATION
LOCATION
Adds a peptide to the complex. This option can be used multiple times to add multiple peptides.
PEPTIDE must be either:
- amino acid sequence in one-letter code (optionally annotated with secondary structure:
H - helix, E - sheet, C - coil)
i.e.
-p HKILHRLLQD:CHHHHHHHHC
loads HKILHRLLQD peptide sequence with the secondary structure assignemnt: CHHHHHHHHC
HINT: If possible, it is always recommended to use secondary structure information/prediction. For residues with ambiguous secondary structure prediction assignment it is better to assign coil (C) than the regular (H - helix or E - extended) type of structure.
-
pdb file (may be gzipped)
-
pdb code (optionally with chain_id i.e.
1abc:D
)
CONFORMATION sets initial conformation of the peptide. Must be either:
-
random
- random conformation is generated (default) -
keep
- preserve conformation from file. This has no effect if PEPTIDE=SEQUENCE.
LOCATION sets initial location for the peptide. Must be either:
-
random
- peptide is placed in a random location on the surface of a sphere centered at the proteins geometrical center at distance defined by the --separation option from the surface of the proteins. -
keep
- preserve location from file. This has no effect if PEPTIDE=SEQUENCE -
patch
- list of proteins residues (i.e123:A+125:A+17:B
) Peptide will be placed above the geometrical center of listed residues at distance defined by the --separation option from the surface of the protein. WARNING: residues listed in path should be on the surface of the protein and close to each other.
--align
METHOD
Method to be used to align target and peptides with reference. Available options are:
-
SW -- Smith-Waterman (default)
-
blastp -- protein BLAST (requires NCBI+ package installed)
-
trivial -- simple sequential alignment, useful only to speed up run (by omitting Smith-Waterman algorithm) in case of obvious one-chain input and reference of the same length (e.g. when input and reference are the same file).
-
CSV -- loads alignment from given file (passed as alignment setting called
fname
) in format described by Berbalk et. al. in 2009.
--alignment-options
Options to be passed to method aligning target, if --alignment-peptide-options is passed, or both if no such options is given.
CABSdock --align blastp --alignment-options task=short-task
CABSdock --align blastp --alignment-options task=short-task
--alignment-peptide-options
Options to be passed to method aligning peptides. If this option is passed, options given to --alignment-options are ignored during peptide alignment.
--ca-rest-add
RESI
RESJ
DIST
WEIGHT
Adds a distance restraint between CA (CA) atom in residue RESI and CA atom in residue RESJ.
DIST is a distance between these atoms and WEIGHT is restraints weight from [0, 1].
In order to add restraints between the peptide and the protein, or between two peptides, use PEP1, PEP2, ... as chain identifiers of the peptides (even when peptide is read from a pdb file its chain identifier is ignored).
Example:
123:A 5:PEP1 8.7 1.0
adds a restraint between the CA atom of the residue number 123 in the chain A of the protein and the CA atom of the 5th residue of the peptide.
Comments:
-
If you add only one peptide both PEP and PEP1 is a valid chain identifier.
-
If you add multiple peptides they will be ordered as follows:
- from config file added by the peptide option
- from config file added by the add-peptide option
- from command line added by the --peptide option
- from command line added by the --add-peptide option
-
Peptides added by the same method preserve the order by which they appear in the config file, or on the command line.
-
Can be used multiple times to add multiple restraints.
--ca-rest-file
FILE
Reads CA restraints from a file (use multiple times to add multiple files).
--ca-rest-weight
WEIGHT
Sets a global weight for all CA restraints (including automatically generated restraints for the protein) (default: 1.0)
--clustering-iterations
NUM
Set the number of iterations of the clustering k-medoids algorithm (default: 100).
-k
, --clustering-medoids
NUM
Sets the number of medoids in the k-medoids clustering algorithm. This option also sets the number of final models to be generated. (default: 10)
-c
, --config
CONFIG
Reads options from the configuration file CONFIG
--contact-map-colors
COLORS
Sets 6 colors (hex code, e.g. #00FF00 for green etc.) to be used in contact map color bars.
-M
, --contact-maps
Store contact maps matrix plots and histograms of contact frequencies.
--contact-threshold-aa
DIST
Set contact distance between heavy atoms for contact map plotting (all-atom top scored models only). (default: 5.5 Angstroms)
-T
, --contact-threshold
DIST
Set contact distance between side chains pseudo-atoms (SC) for contact map plotting. (default: 6.5 Angstroms)
--dssp-command
PATH
Use the provided path to the dssp binary.
CABS-dock requires the DSSP program in order to assign the secondary structure to the protein receptor's residues. We recommend installation of the standalone DSSP program to be used with CABS-dock '--dssp-command' option. As a fallback, we have implemented a module which communicates with the DSSP server when no local DSSP binary is available, however recently the server's performance has been unstable, resulting in jobs getting stuck. In order to install the standalone DSSP program follow instructions available here.
-e
, --exclude
RESIDUES
Excludes protein residues listed in RESIDUES from the docking search, therefore enforces more effective search in other areas of a protein surface. For example, it may be known that some parts of the protein are not accessible to the peptide (due to binding to other proteins) and therefore it could be useful to exclude these regions from the search procedure.
RESIDUES must be a single string of characters (no whitespaces) consisting of residue identifiers (i. e. 123:A
) or
chain identifiers (i. e. A
) joined with the + sign. - is also allowed to specify a continous range of residues, or
chains.
Examples:
-e 123:A
excludes residue 123 from chain A-e 123:A+125:A
residues 123 and 125 from chain A-e 123:A-125:A
residues 123, 124 and 125 from chain A-e A
whole chain A-e A+C
chains A and C-e A-C
chains A, B and C
Adding @PEP<N>
at the end of the string limits the excluding to only N-th peptide
i.e. -e 123:A@PEP1
will exclude residue 123 in chain A for binding with the first peptide only. If @PEP<N>
is
omitted the exclusion list affects all peptides.
This option can be used multiple times to add multiple sets of excluded residues.
--excluding-distance
DISTANCE
Sets minimum distance between side chain atoms of peptide(s) and protein residues marked as excluded
-n
, --filtering-count
NUM
Sets the number of low-energy models from trajectories to be clustered (default 1000)
--filtering-mode
MODE
Choose the filtering mode to select NUM (set by --filtering-count) models for clustering.
MODE can be either: (default: each)
each
- models are ordered by protein-peptide(s) binding energy and top n = [NUM / R] (R is the number of replicas) is selected from EACH replicaall
- models are ordered by protein-peptide(s) binding energy and top NUM is selected from ALL replicas combined
--fortran-command
PATH
Use the provided path to the fortran compiler binary.
--gauss-iterations
NUM
Sets number of iterations of dynamic weighted-fit algorithm used for superposition of structures.
This option has no effect when --weighted-fit is set to anything other than
gauss
.
NUM = 100 by default
-h
, --help
print help and exit program
--image-file-format
FMT
Produce all the image files in given format.
-i
, --input-protein
INPUT
Loads input protein structure.
INPUT can be either:
-
PDB code (optionally with chain IDs) i.e.
-i 1CE1:HL
loads chains H and L of 1CE1 protein structure downloaded from the PDB database -
path to a local PDB file (optionally gzipped)
--insertion-attempts
NUM
This option enables advanced settings of building starting conformations of modelled complexes. The option sets number of attempts to insert peptide while building inital complex (default: 1000)
--insertion-clash
DIST
This option enables advanced settings of building starting conformations of modelled complexes. The option sets distance in Angstroms between any two atoms (of different modeled chains) at which a clash occurs while building initial complex (default: 1.0 Angstrom)
-L
, --load-cabs-files
FILE
Loads CABSdock simulation files and allows for repeated scoring and analysis of CABSdock trajectories (with new settings , for example using a reference complex structure - --reference-pdb option).
--log
Automatically redirects output to the CABS.log file created in the working directory and stops progress bar from showing on higher verbosity levels and turns off log coloring. Piping standard error will not work with this option. If the log file already exists it will be appended to.
-a
, --mc-annealing
NUM
Sets the number of Monte Carlo temperature annealing cycles to NUM (NUM > 0, default value = 20, changing the default value is recommended only for advanced users).
-y
, --mc-cycles
NUM
Sets the number of Monte Carlo cycles to NUM (NUM>0, default value = 50). Total number of snapshots generated for each replica/trajectory = [mc-annealing] x [mc-cycles], default: 20x50=1000.
-s
, --mc-steps
NUM
Sets the number of Monte Carlo cycles between trajectory frames to NUM (NUM > 0, default value = 50). NUM = 1 means that every generated conformation will occur in trajectory. This option enables to increase the simulation length (between printed snapshots) and doesnt impact the number of snapshots in trajectories.
-m
, --modeller-iterations
NUM
Sets number of iterations for reconstruction procedure in MODELLER package (default: 3). Bigger numbers may result in more accurate models, but reconstruction will take longer.
-N
, --no-protein-restraints
Do not automatically generate any protein restraints. This option has precedence over the --protein-restraints option and will overwrite any settings set by the latter. With this flag on, restraints can still be added with the --ca-rest-add or --ca-rest-file options.
-o
, --pdb-output
SELECTION
Select structures to be saved in the pdb format.
Available options are:
* A
- all (default)
* R
- replicas
* F
- filtered
* C
- clusters
* M
- models
* N
- none
Example:
-o RM
- saves replicas and models
-p
, --peptide
PEPTIDE
Loads peptide sequence and optionally peptide secondary structure in one-letter code (can be used multiple times to add multiple peptides).
PEPTIDE can be either:
- amino acid sequence in one-letter code
(optionally annotated with secondary structure: H - helix, E - sheet, C - coil)
i.e.
-p HKILHRLLQD:CHHHHHHHHC
loads HKILHRLLQD peptide sequence with the secondary structure assignemnt: CHHHHHHHHC
HINT: If possible, it is always recommended to use secondary structure information/prediction. For residues with ambiguous secondary structure prediction assignment it is better to assign coil (C) than the regular (H - helix or E - extended) type of structure.
-
PDB code (optionally with chain ID) i.e.
-p 1CE1:P
loads the sequence of the chain P from 1CE1 protein -
path to a PDB file with peptides coordinates, loads only a peptide sequence from a PDB file
--peptide PEPTIDE
is an alias for --add-peptide PEPTIDE random random
-f
, --protein-flexibility
FLEXIBILITY
Modifies flexibility of selected protein residues:
0
- fully flexible backbone,1
- almost stiff backbone (default value, given appropriate number of protein restraints),>1
- increased stiffness.
FLEXIBILITY can be either:
-
a positive real number - all protein residues will be assigned flexibility equal to this number.
-
bf
- flexibility for each residue is read from the beta factor column of the CA atom in the PDB input file. Note that the standard beta factors in PDB files have an opposite meaning to the CABSdock flexibility. Remember to edit the PDB file accordingly or useFLEXIBILITY = bfi
). -
bfi
- each residue is assigned its flexibility based on the inverted beta factors stored in the input PDB file, so that bf = 0.0->
f = 1.0 and bf >= 1.0->
f = 0.0 -
<filename>
- flexibility is read from file <filename> in the format of single residue entries: resid_ID <flexibility> i.e.12:A 0.75
, or residue ranges: resid_ID - resid_ID <flexibility> i.e.12:A - 15:A 0.75
Default value for residues not explicitely specified can be set by inserting at the top of the file a following line: default <default flexibility value>, if this line is omitted, the default value becomes 1.0. Multiple entries can be used.
-g
, --protein-restraints
MODE
GAP
MIN
MAX
Allows to generate a set of binary distance restraints for CA atoms, that keep the protein in predefined conformation
(default: all, 5, 5.0, 15.0
)
MODE can be either:
all
- generates restraints for all protein residuesss1
- generates restraints only when at least one restrained residue is assigned regular secondary structure (helix or sheet)ss2
- generates restraints only when both restrained residues are assigned regular secondary structure (helix, sheet)
GAP specifies the gap along the main chain for the two resiudes to be restrained. MIN and MAX are min and max values in Angstroms for the two residues to be restrained.
The default setting, recommended for standard applications, is all 5 5.0 15.0
--protein-restraints-reduce
FACTOR
Reduce the number of protein restraints by a FACTOR, where FACTOR is a number from [0, 1]. This option reduces the number of automatically generated restraints for the protein molecule in order to speed up computation. Restraints are randomly selected from all generated restraints, so that the final number of restraints #reduced = #all * FACTOR.
-z
, --random-seed
SEED
Sets the seed for random number generator.
-R
, --reference-pdb
REF
Loads a reference complex structure. This option allows for comparison with the reference complex structure and triggers additional analysis features
REF must be either:
[pdb code]:[protein chains]:[peptide1 chain][peptide2 chain]
...[pdb file]:[protein chains]:[peptide1 chain][peptide2 chain]
...
Examples:
1abc:AB:C
1abc:AB:CD
myfile.pdb:AB:C
myfile.pdb.gz:AB:CDE
-r
, --replicas
NUM
Sets the number of replicas to be used in Replica Exchange Monte Carlo (NUM > 0, default value = 10, changing the default value is recommended only for advanced users)
-D
, --replicas-dtemp
DELTA
Sets the temperature increment between replicas (DELTA > 0, default value = 0.5, changing the default value is recommended only for advanced users)
-S
, --save-cabs-files
Saves CABSdock simulation files.
-C
, --save-config
Save simulation parameters in config file.
-d
, --separation
SEP
The option sets separation distance in Angstroms between the peptide and the surface of the protein (default: 20.0 Angstroms)
--sc-rest-add
RESI
RESJ
DIST
WEIGHT
Adds a distance restraint between SC pseudoatom in the residue RESI and SC pseudoatom in the residue RESJ; DIST is a distance between these pseudoatoms (the geometric centers of their side chain atoms) and WEIGHT is restraints weight from [0, 1]. Can be used multiple times to add multiple restraints.
--sc-rest-file
FILE
Reads SC restraints from a file (use multiple times to add multiple files).
--sc-rest-weight
WEIGHT
Sets a global weight for all SC restraints (default: 1.0)
-t
, --temperature
TINIT
TFINAL
Sets the temperature range for simulated annealing procedure: TINIT
- initial temperature, TFINAL
- final
temperature (default values TINIT=2.0
, TFINAL=1.0
).
CABSdock uses a temperature-like parameter that does not correspond straightforwardly to the real temperature.
Temperature value around 1.0
roughly corresponds to nearly frozen conformation, while the folding temperature of small
proteins in the CABS model is usually around 2.0
.
-V
, --verbose
VERBOSITY
Controls how explicit the program output is, 0
for silent mode (only critical messages), 4
for maximum verbosity,
default 2
.
--version
print version and exit program
--weighted-fit
ARG
This option allows to set and customize the way models are structurally aligned, which affects both calculation of the RMSD/RMSF and clustering together with the selectiom of the final models. Models are aligned by the Kabsch optimal fit algorithm. This options assigns weights to all atoms, which specify how 'important' the atom is in the structural fit process. Weights are numbers from [0:1] range with '0' meaning 'irrelevant in fitting process.'
ARG
can be either:
off
Turns off weighted-fit (all weights are 1.0) (default).gauss
Weights are generated automatically in the iterative procedure described in Biophys J. 2006 Jun 15; 90(12): 4558-4573. The procedure consists of the following steps: (1) Set wi = 1.0 for i = [1,2 ... N], where N is the number of atoms. (2) Align structures using weights wi. (3) Calculate di - displacement of the i-th atom. (4) Update weights according to formula: wi = exp(-0.5 * di * di). Repeat (2) through (4) until convergence (max 100 iterations, can be changed with --gauss-iterations).flex
Weights are taken from the flexibility settings. (See help entry for --protein-flexibility).ss
Weights are taken from the secondary structure assignment. Atoms in helices and sheets are given w = 1.0, while those in loops and coil get w = 0.0.- <filename> Weights are read from a file <filename>. The file should follow this format:
default 1.0 (default value, if omitted w = 1.0 is assumed) 1:A 0.5 5:A 0.1 ... 1:B 0.99 ...
--work-dir
DIR
Set working directory to DIR.
3. Ready-to-use examples
3.1 Default docking and demo directory
To run CABSdock using the default settings (recommended for inexperienced users) use the following syntax:
$ CABSdock –i protein-pdb-code –p peptide-sequence:peptide-secondary-structure
For example, to dock HKLVQLLTTT peptide (with externally predicted secondary structure, CHHHHHHHCC) to protein stored as chain A of pdb structure 2FVJ, use
$ CABSdock –i 2FVJ:A –p HKLVQLLTTT:CHHHHHHHCC
This command will:
- load the conformation of chain A from 2FVJ PDB file as the protein structure
- load "HKLVQLLTTT" peptide sequence with the secondary structure assignment: "CHHHHHHHCC"
- set default simulation settings (no knowledge about the binding site; almost rigid backbone of the protein receptor; random initial peptide conformations and positions).
To extend the outputs, it is possible to use additional flags discussed above:
$ CABSdock –i 2FVJ:A –p HKLVQLLTTT:CHHHHHHHCC -M -C -S
The outputs from this docking are available in the demo directory. The docking results are also presented in the pictures below. The surface of the protein (2FVJ) is white, and the peptide is shown in blue with the cartoon representation.
The 1000 top scored structures are presented in light-blue. The experimental structure is presented with dark blue.
The docking result (top-scored model) is presented with marine blue. The ligand-RMSD with respect to the native conformation for the presented model is 3.46A (l-RMSD was calculated automatically using the --reference-pdb option).
3.2 Default docking, peptide sequence from PDB
If the peptide sequence is available as a chain of any structure stored in the PDB database, it is possible to load it directly from the database using its structure ID. For example, the command:
$ CABSdock -i 2BZW:A -p 2BZW:B
An example result for this docking is presented in the picture below. The surface of the protein (2BZW:A) is white, and the peptide is shown in blue with the cartoon representation. The experimental structure is presented with dark blue, whereas the docking result - with marine blue. The ligand-RMSD with respect to the native conformation for the presented model is 2.70A (l-RMSD was calculated automatically using the --reference-pdb):
3.3 Docking with contact information
It is possible to indicate preferred contacts for the complex modelled with CABS-dock.. Those usually will be the contacts identified experimentally that are expected to be present in the resulting structures. To use this information as restraints for docking, use --sc-rest-add (restraints for side-chain to side-chain contacts) or --ca-rest-add (restraints for CA to CA contacts).
An example command to run a docking with additional restraints to enforce contact between residue 235 from the protein chain E and 6th residue of the peptide is:
$ CABSdock –i 2CPK:E –p TTYADFIASGRTGRRNAIHD:CHHHHHHHHCCCCCCCCCCC --sc-rest-add 235:E 6:PEP 5.0 1.0
The resulting set of top 1000 structures is presented below (peptide shown in light blue).
For comparison, analogous set of structures is presented for a run without any contact information (a default run).
In the figure below, the experimental structure is presented with dark blue, whereas the docking result -- with marine blue. For comparison, the best docking result for a run without contact information is presented in red. The ligand-RMSD with respect to the native conformation for the presented model is 2.70A (l-RMSD was calculated automatically using the --reference-pdb option, for details see below):
3.4 Flexible protein loops
CABSdock allows for increasing the flexibility of specified protein fragments -- for example flexible loops that cover the binding site in the unbound protein conformation.
To run a docking simulation in such a case prepare a text file with the flexible region specified and use the --protein-flexibility option:
$ CABSdock -i 2RTM:A -p HPQFEK:CHHHCC -f flexibility.txt
flexibility.txt
file is a one-line text file:
45:A - 54:A 0
3.5 Intrinsically unstructured protein regions
The option --protein-flexibility may also be used to simulate the behavior of intrinsically unstructured region. To run a simulation, in which a part of the protein is highly flexible, issue a command similar to:
$ CABSdock -i 1Z1M:A -p RFMDYWEGL -f flexibility.txt
flexibility.txt
is a simple textfile containing ranges of increased flexibility. In the example case it is:
1:A - 27:A 0 106:A - 119:A 0
3.6 Docking multiple peptides
The newly introduced functionality allows the user to predict binding poses of systems including multiple interacting peptides. To use it, simply use the --peptide option multiple times.
An example command is:
CABSdock -i 1EJL:I -p 1EJL:A -p 1EJL:B
An example result for this docking is presented in the pictures below. The surface of the protein (1EJL:I) is white, and the peptides are shown in blue with the cartoon representation. The image below presents the experimental structures.
The docking results - marked with light blue and green - are presented below.
3.7 Modifying protein restraints
It also possible to adjust the protein rigidity to match the experimental observations.
The RMSF graphs below present results obtained with option --protein-restraints set to:
-
default,
-
ss1 5 5.0 15.0
, -
ss2 5 5.0 15.0
.
3.8 Sampling near native binding modes
CABS-dock may be also used to explore near native binding modes and bound complex dynamics. To do so load a bound
complex using advanced peptide option --add-peptide
with keep keep
flags, and set the maximum temperature to a lower value (to make sure the results will only contain
bound modes).
An example command is:
CABSdock -i 1AWR:C -P 1AWR:I keep keep --temperature 1.2 1.0
The input structure (1AWR chain C and I) is presented in the picture below.
The resulting set of near-native binding modes generated with CABSdock procedure is presented below.
3.9 Calculating ligand-RMSD values to a reference complex
The ligand-RMSD values for the peptide may be automatically calculated using the --reference-pdb. This option can be used while running a simulation with any other settings:
$ CABSdock –i 2FVJ –p HKLVQLLTTT:CHHHHHHHCC --reference-pdb 2FVJ:AB
where 2FVJ:AB is the reference protein-peptide complex. This option also activates additional analysis procedures. The additional output of those methods is described in 5.1 RMSD plot analysis.
3.10 Refinement of CABS-dock models using Rosetta FlexPepDock
In CABS-dock models are reconstructed by default into all-atom representation using Modeller software. Additional structure refinement can improve this result. The pipeline we propose here uses the high-resolution FlexPepDock protocol, but other tools are also available.
For an individual case, you can use an online available server: http://flexpepdock.furmanlab.cs.huji.ac.il/
Here we will present a variant using the standalone version, therefore you need a locally installed rosetta software.
INPUT PREPARATION
Firstly, you need to properly prepare the inputs for rosetta.
You need at most 3 files:
-
model.pdb, CABS-dock resulting protein-peptide complex structure in all-atom or backbone+CB representation (it is necessary)
-
native.pdb, if the structure of protein-peptide complex is known (it is not necessary)
-
unbound.pdb, if the structure of unbound receptor is known (it is not necessary)
In each of the files, the order of the coordinates of the receptor, then the peptide should be kept. Files should be cleared of unnecessary information, such as headers, only the "ATOM" section should be kept and only from the certain chains. An effective method is to use a ready-made python script located on the path: ~/Rosetta/tools/protein_tools/scripts/clean_pdb.py
USAGE:
~/Rosetta/tools/protein_tools/scripts/clean_pdb.py model.pdb receptor_chain_id peptide_chain_id e.g. ~/Rosetta/tools/protein_tools/scripts/clean_pdb.py model.pdb A B
The output will be: model_AB.pdb.
If you use a full atomic structure or retain side chains, steric clashes can cause the rosetta energy to be bad. To avoid this there are several solutions:
1) cut Cα coordinates and reconstruct with another tool, e.g. PRODART, REMO, BBQ, SAABAC
2) use the FlexPepDock protocol with -min_receptor_bb, which will allow for receptor backbone minimization
3) replace CABSdock receptor coordinates by the free receptor structure, but note that you should first align both structures using e.g. pymol or theseus software.
theseus USAGE:
theseus -sfrom-to -o reference_structure aligned_structure theseus -s0-120 -o model_AB.pdb unbound.pdb * -s option is the receptor residues selection
The output will be: theseus_sup.pdb – unbound receptor superposed on CABSdock receptor.
Then you should replace the coordinates of receptor:
~/Rosetta/tools/protein_tools/scripts/clean_pdb.py model_AB.pdb B #cut peptide coordinates cat theseus_sup.pdb model_AB_B.pdb > model.pdb #paste unbound receptor and peptide
INITIAL PREPACK
Before you start the proper FlexPepDock simulation, you should quick prepack input structure:
USAGE:
~Rosetta/main/source/bin/FlexPepDocking.proper_compilation_version -database ~/Rosetta/main/database -s model.pdb -flexpep_prepack -ex1 -ex2aro
The output will be: model_0001.pdb
You can change file name: mv model_0001.pdb model_prepacked.pdb
FlexPepDock REFINEMENT
First, you should prepare a flexpepdock.flagfile:
#-bGDT -nstruct 250 #how many decoys you need -in::file::s /PATH/model_prepacked.pdb #input protein-peptide complex -out:file:silent flexpepdock.silent #output silent file name -out:file:silent_struct_type binary -pep_refine -ex1 -ex2aro -use_input_sc -unboundrot PATH/unbound.pdb #input unbound receptor for rotamers (not necessary)
USAGE:
~/Rosetta/main/source/bin/FlexPepDocking.proper_compilation_version -database ~/Rosetta/main/database @flexpepdock.flagfile
EXTRACT PDBs
The silent file is a rosetta output format that is used to store ensembles of structures. Each frame in a silent file has a unique identifier, which is called the decoy-tag. The uniq decoy-tag ”decription” is at the end of each line that belongs to the respective frame, which allows to identify and extract frames. The most popular criterion is rosetta score ”score”, which allows you to choose models with top-best rosetta energy:
grep '^SCORE' flexpepdock.silent | cut -c 1-284,309- > tmp cat tmp | sort -k1,1 -k2g > silent_scores.sc cat silent_scores.sc | sort -k2g | awk '{print $47}' | head -11 | tail -10 > top10.tag * check if column 47 in silent_scores is the ‘description’
USAGE:
~/Rosetta/main/source/bin/extract_pdbs.proper_compilation_version -in:file:silent /PATH/flexpepdock.silent -in:file:tagfile top10.tag
3.11 Docking to GPCRs
Recently, we've proposed a CABS-dock based protocol dedicated for modeling GPCR-peptide systems. The protocol details are provided in the work: Badaczewska-Dawid A, Kmiecik S, Kolinski M. Docking of peptides to GPCRs using a combination of CABS-dock with FlexPepDock refinement (submitted).
The protocol consist of the three modeling stages: (1) docking of peptides to GPCRs using CABS-dock, the peptide sampling space is restricted to spherical volume which includes all receptor fragments that may interact with bound peptides (2) reconstruction of atomistic structures from C-alpha traces using PD2 (3) refinement of protein-peptide complex structures and models scoring using Rosetta FlexPepDock
Example commands
The example command lines for the peptide-GPCR complex (5GLH system):
STAGE 1: Running single docking simulation using CABS-dock
~/CABSdock -s 100 -M -C -S -v 4 -i 5GLH_struc.pdb:A -p 5GLH_struc.pdb:B --reference-pdb 5GLH_struc.pdb:A:B --ca-rest-add 1:PEP 15:PEP 5.3 1.0 --ca-rest-add 3:PEP 11:PEP 6.2 1.0 --sc-rest-add 249:A 1:PEP 30.0 5.0 --sc-rest-add 249:A 2:PEP 30.0 5.0 --sc-rest-add 249:A 3:PEP 30.0 5.0 --sc-rest-add 249:A 4:PEP 30.0 5.0 --sc-rest-add 249:A 5:PEP 30.0 5.0 --sc-rest-add 249:A 6:PEP 30.0 5.0 --sc-rest-add 249:A 7:PEP 30.0 5.0 --sc-rest-add 249:A 8:PEP 30.0 5.0 --sc-rest-add 249:A 9:PEP 30.0 5.0 --sc-rest-add 249:A 10:PEP 30.0 5.0 --sc-rest-add 249:A 11:PEP 30.0 5.0 --sc-rest-add 249:A 12:PEP 30.0 5.0 --sc-rest-add 249:A 13:PEP 30.0 5.0 --sc-rest-add 249:A 14:PEP 30.0 5.0 --sc-rest-add 249:A 15:PEP 30.0 5.0 --sc-rest-add 249:A 16:PEP 30.0 5.0 --sc-rest-add 249:A 17:PEP 30.0 5.0 --sc-rest-add 249:A 18:PEP 30.0 5.0 --sc-rest-add 249:A 19:PEP 30.0 5.0 --sc-rest-add 249:A 20:PEP 30.0 5.0 --sc-rest-add 249:A 21:PEP 30.0 5.0
mv model.pdb 5GLH_S1_M1.pdb
STAGE 2: Structure reconstruction from coarse-grained representation using PD2
~/bin/pd2_ca2main --database ./database/ -i 5GLH_S1_M1.pdb -o 5GLH_S1_M1-reconstructed.pdb --ca2main:new_fixed_ca --ca2main:bb_min_steps 500
STAGE 3: Structure refinement using Rosetta FlexPepDock
1. Input preparation
use select_atoms.py for:
- removing the side chains (select atoms by option -wa),
- ordering the protein and peptide coordinates in the pdb file (receptor - first, peptide - second; use option -wc),
~/select_atoms.py -f 5GLH_S1_M1-reconstructed.pdb -wc A,B -wa CA,CB,C,O,N -o 5GLH_S1_M1-backboneCB.pdb
- renaming the protein and peptide chains: receptor - A, peptide - B,
~/rename_chains.py -f 5GLH_S1_M1-backboneCB.pdb -cho C,D -chn A,B -o 5GLH_S1_M1-chains.pdb
- renumbering the amino acid residues in the pdb file (starting from 1),
- preparing a properly formatted structure files (e.g. no 0.00 values in occupancy column),
~<path_to_Rosetta>/tools/protein_tools/clean_pdb.py 5GLH_S1_M1-chains.pdb #default output: 5GLH_S1_M1-chains_AB.pdb
mv 5GLH_S1_M1-chains_AB.pdb 5GLH_S1_M1-formated.pdb
2. Side chains reconstruction and pre-packing of initial complex structural components
~<path_to_Rosetta>main/source/bin/FlexPepDocking.linuxgccrelease -database <path_to_Rosetta>/main/database -s 5GLH_S1_M1-formated.pdb -flexpep_prepack -ex1 -ex2aro #default output: 5GLH_S1_M1-formated_0001.pdb
mv 5GLH_S1_M1-formated_0001.pdb 5GLH_S1M1.pdb
3. Refinement of pre-packed initial complex structure
~<path_to_Rosetta>main/source/bin/FlexPepDocking.linuxgccrelease -database <path_to_Rosetta>/main/database @flexpepdock.flagfile
#-bGDT -nstruct 300 -in::file::native <path>/ref.pdb -in::file::s <path>/5GLH_S1M1.pdb -out:file:silent flexpepdock-lowres.silent -out:file:silent_struct_type binary -detect_disulf true -rebuild_disulf true -fix_disulf <path>/disulfide.dat -lowres_preoptimize true -pep_refine -ex1 -ex2aro -use_input_sc
293 307 295 303
3'. Minimization of pre-packed initial complex structure using Rosetta FlexPepDock (STAGE 3')
~<path_to_Rosetta>main/source/bin/FlexPepDocking.linuxgccrelease -database <path_to_Rosetta>/main/database @flexpepdock.flagfile
#-bGDT -nstruct 1 -in::file::native <path>/ref.pdb -in::file::s <path>/5GLH_S1M1.pdb -out:file:silent flexpepdock-minimize.silent -out:file:silent_struct_type binary -detect_disulf true -rebuild_disulf true -fix_disulf <path>/disulfide.dat -flexPepDockingMinimizeOnly true -ex1 -ex2aro -use_input_sc
STAGE 4: Scoring and selecting top models
a) selecting final top-scored models using analyze_silent.sh
analyze_silent.sh content:
cat flexpepdock.silent | grep "SCORE:" | head -1 | awk '{for (i = 1; i <= NF; i++) print i, $i}' > p t1=`cat p | grep "reweighted_sc" | awk '{print $1}'` t2=`cat p | grep "I_sc" | awk '{print $1}'` t3=`cat p | grep "pep_sc" | awk '{print $1}' | head -1` t4=`cat p | grep "rmsBB_if" | awk '{print $1}'` t5=`cat p | grep "description" | awk '{print $1}'` rm p echo "SCORE: total_score reweighted_sc I_sc pep_sc rmsBB_if description" > columns.sc cat flexpepdock.silent | grep "SCORE:" | awk -v A=$t1 -v B=$t2 -v C=$t3 -v D=$t4 -v E=$t5 '{print $1,$2,$A,$B,$C,$D,$E}' >> columns.sc #--- Select 1% top-scored models using total_score and reweighted_sc cat columns.sc | tr . , | sort -gk2 | tr , . | head -3 > data.dat cat columns.sc | tr . , | sort -gk3 | tr , . | head -3 >> data.dat #--- Select final 10 top-scored models using reweighted_sc and pep_sc cat data.dat | sort | uniq | tr . , | sort -gk3 | head -5 | tr , . > TOP10 cat data.dat | sort | uniq | tr . , | sort -gk5 | head -5 | tr , . >> TOP10 #--- Create file with tags for extracting pdbs cat TOP10 | awk '{print $6}' > tags
b) extracting compressed structure coordinates to pdb file
~<path_to_Rosetta>main/source/bin/extract_pdbs.static.linuxgccrelease -database <path_to_Rosetta>/main/database -in:file:silent flexpepdock-lowres.silent -in:file:tagfile tags
~<path_to_DockQ>DockQ/scripts/fix_numbering.pl <path>/5GLH_S1M1_0001.pdb <path>/ref.pdb
~<path_to_DockQ>DockQ/DockQ.py <path>/5GLH_S1M1_0001.pdb.fixed <path>/ref.pdb > 5GLH_S1M1_0001-parameters
4. Output models
The resulting models are stored in /output_pdb folder in the working directory. The number of CABS-dock top-scored models and sets of models can be modified by users. The CABS-dock modeling result in the following files containing models or sets of models (see also the Figure below):
-
model_*.pdb – by default, 10 top-scored models in all-atom representation numbered from 1 to 10 (PDB file)
-
cluster_*.pdb – clusters of models (groups of models that have been classified in structural clustering to particular clusters) in CA representation, by default numbered from 1 to 10 (PDB file). Cluster numbering corresponds to cluster ranking and to model numbering i.e. model_7.pdb is a representative model for models grouped in the seventh cluster (ranked as seventh) (cluster_7.pdb). Cluster_*.pdb files may be used, for example, for visual assessment of clustering quality or visualization of the near-bound conformations. If combined with custom scoring methods, it may be used to improve the quality of selected final models.
-
top *.pdb – top-scored models in CA representation, selected for further clustering and analysis from the 10 trajectories (PDB file). By default, top1000.pdb file is generated containing 1000 top-scored models that passed a simple energy-based filtering procedure (100 lowest energy models are selected from each replica). If the user supposes a non-standard clustering method would provide better results, this file may be used as an input.
-
replica_*.pdb – complete set of 10 trajectories in CA representation, numbered from 1 to 10 (PDB file). Each replica contains 1000 models. Combined, they consist of all the models saved during the CABS simulation and may be treated as raw output of the method. The user may then apply custom filtering and clustering procedures to improve the success rate of the final model selection.
The Figure below shows CABS-dock pipeline with default settings:
5. Output plots and additional analysis to reference complex
CABSdock creates several plots during analysis of its results, stored in /plots
and /contact_maps
subdirectory of
working directory.
5.1 RMSD plot analysis
If you provide a reference protein-peptide complex - to compare the modeling results with a reference complex (using --reference-pdb option) - the CABSdock package will generate: * plot of RMSF (root mean square fluctuation) * energy (total and interaction) vs. peptide RMSD (to reference peptide)
Sample output:
- RMSF (root mean square fluctuation) of subsequent target residues (around input position). In case of long target
protein, only some reference residues are marked on x axis. Values of RMSF ranges from 0 to 1. Sample path
workdir/plots/RMSF_seq.svg
. Plain text file containing this data is available in correspondingworkdir/RMSF.csv
.
- Energy vs. RMSD to reference peptide, two separate plots are being made: one shows total energy (for entire complex
structure), the other - interaction energy (for interaction between peptide and protein receptor).
Both consist of plot and histogram of RMSDs distribution along trajectory.
Upper plot: energy vs. RMSD plot. Energy is given in CABS units. Sample path to plot and plain text file are,
respectively,
workdir/plots/E_RMSD_<chain>_<energy>.svg
andworkdir/plots/E_RMSD_<chain>_<energy>.csv
, where <chain> is peptide chain character and <energy> is type of energy (total or interaction). Lower histogram: counts of frames with particular RMSD. Bins are at most 1 Å width (less if difference between highest and lowest RMSD is less than 5). Both: data for all frames is plotted in gray. Top 1k models are plotted in dark orange.
- RMSD to reference peptide vs. MC step. For each replica CABS provides history of RMSD changes (only if reference PDB
was given). Dotted line between points is introduced for clarity of points sequence. Sample path
workdir/plots/RMSD_frame_<chain>_replica_<replica number>.svg
andworkdir/plots/RMSD_frame_<chain>_replica_<replica number>.svg
for plain text file.
5.2 RMSD additional analysis
Additional data from the simulation are stored in /output_data subdirectory of working directory. It consists of following files (‚C’ in the examples refers to the peptide chain in CABSdock trajectories):
- all_rmsds_C.txt - list of RMSD values to the reference structure of the peptide(s) calculated for each of the simulation frames.
- filtered_rmsds_C.txt - list of RMSD values to the reference structure of the peptide(s) for each of the frames filtered before scoring.
- medoids_rmsds_C.txt - list of the peptide RMSD values of the top scored models (cluster medoids).
- lowest_rmsds_C.txt - summary of the lowest peptide RMSD values obtained in the simulations.
- target_alignment_C.csv - sequential alignment to the reference structure of the C-chain.
- config.ini - a CABSdock configuration file that stores all the options used to execute a docking run. This file may be used either to re-run docking or for analysis of an already finished docking (see below).
5.3 Contact map and contact histogram plot analysis
-
--contact-maps
- if this flag is given, contact maps will be calculated, data and plots will be stored. -
Target protein internal contact map. If target protein is too long - only some ticks will be marked on both axes. Sample path
workdir/contact_maps/target_all.svg
.
- Maps of interface contacts frequencies with all target residues for clusters, top models and replicas. Frequencies are
presented in separate lines if needed. Sample path
workid/contact_maps/<type>_<number>_ch_<chain character>.svg
and corresponding files withtxt
extension containing plain data. <type> can be one of the following:cluster
,top
(for top model), orreplica
; <number> denotes number of corresponding type, i.e. cluster, top model or replica and <chain character> distinguishes peptide chains. E.g. top_4_ch_X.svg would be name of contact map for chain X in 4th of top scored models.
- Histogram of contact frequencies are divided into three sections: top, middle and bottom. Top (upper histogram) shows frequencies of peptide chain residues. Middle section (all histograms but first and last) shows detailed analysis of only those residues from target protein, which were in contact with target peptide at least once. Last section (last histogram) shows summary of contact frequencies of all target residues, whether they had contacts with peptide or not.
5.4 Handling of not identical input and reference models
- Built-in sequence alignment.
During calculation of RMSD to reference structure sequential alignment between simulation and reference are created for
both: peptide and target protein. Option
--align
allows user to determine method of sequence alignment to be used. By default CABS-dock uses its own implementation of Smith-Waterman algorithm. If package NCBI+ is installed, it is also possible to use protein BLAST. In that case one can set align method to blastp:CABSdock ... --align blastp ...
- Loading alignment from file.
If alignment to reference structures is known or when available sequence alignments are not enough to properly align
target or peptide -- path to reference alignment can be passed to CABS-dock. To do so one needs to set
--align
argument toCSV
to order CABS-dock to use aligning method that load external file, and--alignment-options
to pass file name asfname=<path>
. E.g.:If alignment of peptide and target are stored in different files, user can pass different options to be used while loading alignment of target or peptide:CABSdock ... --align CSV --alignment-options fname=external/file.csv
IfCABSdock ... --align CSV --alignment-options fname=external/file.csv --alignment-peptide-options fname=external/file_peptide.csv
--alignment-peptide-options
is not given -- file from--alignment-options
is passed to peptide file loader.
Given file needs to be in CSV format as described by Berbalk et. al. in 2009 (doi: 10.1002/pro.213; alignments returned by CABSdock are in that particular csv format). Sample file is given below:
reference template B:687:H C:687:H B:688:K C:688:K B:689:I C:689:I B:690:L C:690:L B:691:H C:691:H B:692:R C:692:R B:693:L C:693:L B:694:L C:694:L B:695:Q C:695:Q B:696:D C:696:D
6. Additional docking analysis
6.1 Analysis of an already finished simulation
It is sometimes necessary to perform additional analysis of the docking results - for example calculate RMSD to another reference complex or produce contact maps with slightly changed cut-off. To perform this kind of analysis, remember run your original job with --save-cabs-files and --save-config option:
$ CABSdock -i 2P1T:A -p HKILHRLLQD:CHHHHHHHHC --save-cabs-files --save-config
This option will result in storing two additional files: a CABSdock config file config.ini
and compressed archive
<timestamp><randomstring>.cbs
.
To re-run the default analysis of your job use the following command using --config and --load-cabs-files options:
$ CABSdock -c SAVED_CONFIG_FILE --load-cabs-files SAVED_CBS_FILE
You can use this syntax to specify any additional analysis option (your command line options will overwrite any options specified in the CONFIG file). For example you may want to filter out only 100 low-energy models and cluster them into 3 clusters using --filtering-count and --clustering-medoids options to alter the default settings:
$ CABSdock -c SAVED_CONFIG_FILE --load-cabs-files SAVED_CBS_FILE --filtering-count 100 --clustering-medoids 3
6.2 Analysis with PyMOL plugin
Recently, we developed a PyMOL plugin which enable molecular visualization analysis of CABSdock results. The plugin repository is temporarily available from here plugin documentation
7. CABSdock scoring
CABSdock scoring procedure can be modified by users.
The default procedure (using default settings) looks like follows:
-
Simulation module produces a set of 10’000 of models (10 trajectories consisting of 1000 models) in CA representation
-
Scoring module selects top-scored models from the simulation module output. Top-scored models are selected based on interaction energy values and structural clustering. Scoring module outputs of 10, 100 and 1000 top-scored model in CA representation.
-
Reconstruction to all-atom representation module uses a Modeller package to reconstruct a set of 10 top-scored models from CA to all-atom representation.
8. Advanced CABS data
8.1 .cbs files
A .cbs file contains the complete set of of both input and output text files read and written by the core CABS simulation module, compressed into a single archive. CABSdock and CABSflex programs can read -L, --load-cabs-files FILE and write -S, --save-cabs-files .cbs files directly. It is however possible to extract basic information from the underlying files.
8.2 Filename pattern
.cbs file names consist of timestamp in the yymmddHHMMSS
format followed by a random 6-character string and .cbs
extension as in: 180129155704Enar5w.cbs
.
In order to extract all files into current directory run:
tar xzf myfile.cbs
This will create (and possibly overwrite) five files in the current directory INP
, SEQ
, TRAF
, OUT
and FCHAINS
.
To extract one specific file (i.e. SEQ
) to the current directory run:
tar xz SEQ < myfile.cbs
tar xzO SEQ < myfile.cbs
to only write its content to the screen.
8.3 INP file (input)
INP is a input file for the CABS procedure. It has a very restrictive format, where most whitespaces and newlines matter, so know what you're doing before modifying it. Specifically - empty, or comment lines are not allowed. Fields within a line are separated by one or more whitespaces (including tabs). Order of lines and fields within a line is meaningful.
INP file is composed of four sections:
- general configuration (lines 1 - 4)
- CA restraints (lines 5 - 5+N) (N is the number of CA restraints)
- SC restraints (lines 6+N, 6+N+M) (M is the number of side-chain restraints)
- excluding (lines 7+N+M, 7+N+M+K) (K is the number of excluded contacts)
N, M and K could all be "0".
general configuration section contains all of the parameters required to run CABS such as the simulation
temperature, scaling factors for the force field components, parameters controlling the simulation length etc.
CA restraints section contains the list of the restraints imposed on pairs of the CA atoms.
SC restraints section contains the list of the restraints imposed on pairs of the unified side-chain pseudo-atoms.
excluding section contains the list of all of the forbidden contacts between any two residues (both CA and SC
(pseudo-)atoms are considered when checking for contact).
###Below is a detailed description of the INP file format.
Line number: field
1: RNG-seed
2: MC-anneal
MC-cycles
MC-steps
#replicas
#chains
3: T-initial
T-final
E-repulsion
E-interaction
dT-replicas
4: E-side-chain
E-long-range
E-centro-symmetric
E-hydrogen-bond
E-short-range
5: #CA-restraints(N)
weight
6: chainI
residueI
chainJ
residueJ
distance
weight
7: chainI
residueI
chainJ
residueJ
distance
weight
. . .
N+6: #SC-restraints(M)
weight SC-SC
N+7: chainI
residueI
chainJ
residueJ
distance
weight
N+8: chainI
residueI
chainJ
residueJ
distance
weight
. . .
N+M+7: #excluded-contacts
excluding-cut-off
N+M+8: chainI
residueI
chainJ
residueJ
N+M+9: chainI
residueI
chainJ
residueJ
. . .
RNG-seed
- integer to seed the Random Number GeneratorMC-anneal/cycles/steps
- integers controlling the length of the simulation#replicas
- number of replicas to be used#chains
- number of protein chainsT-initial/final
- initial and final temperature of the simulationdT-replicas
- temperature difference between neighboring replicasE-*
- scaling factors for various energy terms#CA/SC-restraints
- number of CA/SC restraints#excluded contacts
- number of excluded contactschainI/J
- identification number of protein chain: 1, 2 ... (not A, B)residueI/J
- identification number of a residue within a chain 1, 2 ... (not a number from the pdb file)
###Example INP file:
1245 20 10 10 1 9 1.40 1.40 4.00 1.00 0.50 1.000 2.000 0.125 -2.000 0.375 1432 1.00 1 2 1 48 6.58 1.00 1 2 1 49 5.54 1.00 1 2 1 50 4.45 1.00 1 2 1 51 7.55 1.00 1 3 1 48 5.73 1.00 1 3 1 49 5.43 1.00 1 3 1 50 6.42 1.00 1 3 7 38 6.96 1.00 1 4 1 46 6.21 1.00 . . . 9 78 9 81 4.97 1.00 9 78 9 82 6.28 1.00 9 79 9 82 4.78 1.00 9 79 9 83 5.66 1.00 9 80 9 83 4.68 1.00 9 81 9 87 6.08 1.00 9 82 9 87 6.65 1.00 2 1.00 1 5 2 12 4.50 0.50 3 17 3 35 5.00 0.75 6 5.000 7 5 1 66 7 1 1 66 7 4 1 66 7 3 1 66 7 2 1 66 7 6 1 66
8.4 SEQ file (input)
SEQ file contains information such as protein sequence, secondary structure and local flexibility. This file is used by the CABSdock and the CABSflex programs to generate pdb files with output structures and trajectories - residues' names and numbers and chains' IDs are taken from the SEQ file.
SEQ file contains as many lines as there are residues in the simulated system. Each line is organised into
5 columns:
residue-number
residue-name
chain-ID
II-structure
flexibility
residue-number
- as it occurs in the input pdb fileresidue-name
- name of the residue in the 3-letter codechain-ID
- one character identifying protein chain as it occurs in the input pdb fileII-structure
- single digit indicating the secondary structure assigned to each residue in the following code:- 1 - coil
- 2 - helix
- 3 - turn
- 4 - sheet
flexibility
- number from [0, 1] range indicating the level of flexibility assigned to each residue, where 0.0 means 'fully flexible', and 1.0 - 'rigid'.
###Example SEQ file:
135 GLU A 1 1.00 136 ARG A 4 1.00 137 ARG A 4 1.00 . . . 180 ARG A 4 1.00 135 GLU B 1 1.00 136 ARG B 1 1.00 137 ARG B 4 1.00 . . . 180 ARG B 4 1.00 3 GLN H 1 1.00 4 LYS H 4 1.00 5 THR H 4 1.00 . . . 30 ASP H 1 1.00 1 MET J 1 1.00 2 ALA J 4 1.00 3 GLN J 4 1.00 4 LYS J 4 1.00 5 THR J 4 1.00 6 PHE J 4 1.00 7 LYS J 4 1.00 8 VAL J 4 1.00 9 THR J 1 1.00 10 ALA J 1 1.00
8.5 FCHAINS file (input)
FCHAINS file contains the coordinates of all of the CA atoms in the system in the initial conformation (before the simulation starts). The file is organised in sections; each corresponding to exactly one protein chain. Section starts with a single integer number N in a line and is followed by N lines - each containing three integers: x, y and z coordinates of one of the CA atoms in CABS lattice units (hence the integers).
The number of sections (chains) in the FCHAINS file depends also on how many replicas are to be used during the simulation. In general each chain can have different set of coordinates in different replicas. Finally the structure of the FCHAINS file containing N replicas and M chains in each replica is as follows:
chain 1 replica 1 chain 1 replica 2 ... chain 1 replica N chain 2 replica 1 chain 2 replica 2 ... chain 2 replica N . . . chain M replica 1 chain M replica 2 ... chain M replica N
###Example FCHAINS file: File contains coordinates for three protein chains (lengths: 19, 14 and 6) in two replicas:
19 51 5 25 53 1 30 59 0 30 61 -4 35 66 -2 39 69 -6 36 68 -4 31 69 2 33 75 0 35 76 -3 30 76 2 27 79 6 31 84 2 31 84 2 24 83 8 24 88 8 28 91 4 25 91 7 20 92 13 23 19 99 8 19 97 12 15 100 18 16 96 22 12 101 26 12 105 22 9 110 23 12 113 18 12 113 18 5 118 22 6 121 17 8 120 14 2 124 14 -3 122 19 -6 118 19 -10 120 20 -16 117 22 -21 120 23 -26 117 25 -31 14 121 1 -31 123 7 -28 119 6 -23 121 12 -20 116 12 -16 117 6 -14 121 7 -9 117 11 -7 112 8 -8 115 3 -6 116 6 -1 111 8 0 108 2 -1 14 116 6 -1 111 8 0 108 2 -1 112 0 3 111 4 8 105 3 6 105 -3 7 108 -2 13 103 2 14 99 -2 12 102 -6 16 101 -3 21 6 115 -37 3 113 -32 5 117 -28 9 115 -23 11 119 -21 15 118 -15 14 6 112 -19 19 115 -15 22 111 -10 21 106 -14 23 109 -14 28 104 -18 30
Note that protein chains in the FCHAINS file are always longer by two residues than respective chains in the input pdb file loaded by the CABSdock or the CABSflex programs, since upon casting a protein structure on the CABS lattice two "dummy" residues are added to the ends of all protein chains.
Also note that number and length(s) of protein chains in the FCHAINS file are sometimes different from what is loaded by the CABSdock or the CABSflex programs, as upon casting a protein structure on the CABS lattice the structure is tested for chain continuity and broken into smaller chains on gaps. Original chain composition is restored when result pdb files with structures and trajectories are generated.
8.6 TRAF file (output)
TRAF file contains the coordinates of the CA atoms of every k-th conformation generated during the
simulation, where k equals to the MCsteps
parameter set in the INP file.
The TRAF file can be divided into blocks of data. Each block corresponds to a single protein chain within
a single replica at exactly one moment in time. Such block starts with the header
line followed by multiple
coordinates
lines. Although these blocks are the only explicit data structures within the TRAF file, they are
organised into three abstract structures: chains
, replicas
and frames
representing respective data structures
processed in the simulation. Following figure presents the data layout inside the TRAF file.
8.7 OUT file (output)
The OUT file contains a short summary of the simulation.
9. CABSflex simulations of protein fluctuations
Except protein-peptide docking functionalities, CABSdock standalone package is equipped with additional feature that enable to perform fast simulations of protein fluctuations using CABS-flex methodology see CABS-flex server website.
The following command:
$ CABSflex -i 2GB1
will run CABSflex method with default flexibility settings described here
Note that default settings for CABS-flex flexibility are different from those set for CABS-dock: restraints are calculated only for regions with secondary structure and are not so strict (created for residues distant up to 8 Angstroms).
All other options, except for those concerning peptides, works the same as in CABS-dock.
Updated