Consensus correlation network analysis

Issue #307 resolved
Former user created an issue

How is the significance calculated based on p-value in plot.fluct function for rmsf values ? Moreover I did not understand how you are able to state significance based on comparing two points (rmsf values for each residue).

For consensus network analysis is it enough to concatenate the two independent simulations as single dcd file or the correlation matrix needs to be calculated for each simulation and then averaged out before the correlation network analysis as in your Kinesin-5 paper.

Comments (10)

  1. Xinqiu Yao

    Hi,

    To calculate significance you need multiple rmsf/fluctuation values not just two. For example, we have two nucleotide states (ATP and ADP) for kinesin. For each state, we ran 5 MD simulations in which the initial conformation is the same but initial velocities are different. With each simulation trajectory, we can calculate rmsf values. With the total 10 trajectories we can obtain a 10xN matrix (e.g. named rr), in which each row represents rmsf calculated from a trajectory and N is the total number of residues. Then, call plot.fluct(x=rr, col=c(rep('red', 5), rep('green', 5)), signif=TRUE) to have a plot with significance annotations. Note that 'col' here is to classify rows of rr into two groups, which should match the nucleotide state of the corresponding trajectory. Also check help(plot.fluct) for information about other options.

    Both ways of network analysis are doable but we recommend the second one as explained in the tutorial (The side-note below Figure 9). By concatenating simulations, you get one correlation matrix and the network can be built using a "contact map based" approach as did by Sethi et al. (2009 PNAS 106:6620). Also note that the second approach is not simply averaging correlation values: It will also check the "robustness" of a observed strong correlation across multiple independent simulations and keep or remove the corresponding edge from network based on "certain" rules. Please read the Methods of the kinesin paper or check the help doc of the function filter.dccm() for more details.

  2. Rakesh Ramachandran

    Hi Xin,

    Thanks for the reply. I have few more doubts. What is the statistical test used for the significance calculation ? Moreover for the rmsf calculation do you use the initial frame of the trajectory or the original crystal structure. What is the minimum number of simulations for each state do you suggest for the significance test ?

    Also in your paper you mention "Ci is the covariance matrix for the displacement of heavy atom i and Cij for atoms i and j. The LMI of residues A and B corresponds to the maximum value of the LMI among the atoms forming residue A and residue B (indicated with G(A) and G(B), respectively). This analysis was performed separately on each simulation, resulting in four matrices per protein state. Each group of four matrices was also used to obtain a consensus matrix containing average LMI values if the values from individual simulations were greater than or equal to a cutoff value of 0.6. Zero values were assigned to ij of the consensus matrix if any LMIij was <0.6 and the respective atoms were separated by >10 Å in 70% of cumulative simulation frames."

    So my question is how are the parameters like LMI set is it subjective, I see in the PNAS paper they have used > 4.5 Å for separation ? Moreover in the last sentence you mention cumulative simulation frames, so does that mean the consensus matrix is further refined after taking into account contacts of all the replica simulations.

    You also mention that the LMI values were obtained using heavy atoms, but I was unable to calculate even after long time and I also did not see multiple cores being used in spite of providing ncore option to lmi function. What is the typical number of frames you have used in the 40ns simulation ? Do you also discard initial simulation frames and only use the frames around which the RMSD has converged ?

    Finally does the nma function also support protein-RNA complex normal mode analysis.

    Regards Rakesh

  3. Xinqiu Yao

    Hi Rakesh,

    We used the standard student t test with the two-sided alternative hypothesis (See help(t.test) for details). In the calculation of rmsf, we first superimposed all frames onto the crystal structure used for the MD simulation, and then for each C-alpha atom we calculated the positional variance with respect to the mean position. For the significance test, I would like to suggest at least 5 trajectories for each state.

    I have to say that the parameters of correlation network analysis, e.g. the 0.6 cutoff used for LMI in the paper, are empirical and probably dependent on systems and methods to measure correlation (e.g. Pearson correlation or linear mutual information). You need to adjust values for your own system and judge by yourself which value is the most suitable. A tip is to use pymol.dccm() function to load gradually filtered correlations and visually check the effect of using different correlation cutoffs.

    The 4.5A cutoff used in the PNAS paper is for the distance of all heavy atoms, while the 10A cutoff in Guido's paper is for the C[alpha]-C[alpha] distance. In principal, you can use either way to calculate contact map. Yes, we do use contact map to further refine network. The difference between our method and that in the PNAS paper is that we only apply contact map filter to residue pairs that are not constantly strongly coupled (with correlation value above the 0.6 cutoff) across replicate simulations whereas Sethi et al applied the contact map to all residue pairs.

    What command did you use to calculate LMI? Let me know more details of your calculations and then maybe I can help to figure out what's going on in your problems.

    I would like to forward your questions regarding the kinesin paper to Guido, who may have comments and tell your more technical details. As for the number of frames, I would like to say it is not critical and shouldn't change results so much as long as you have used a large enough number. I use 2000 for my studies in particular but others may use a different number (usually larger than 1000). You can also check by yourself what's the difference between e.g. 2000 and 5000.

    Currently, we don't have functionality supporting NMA of systems containing RNA (Is it true, Lars?)

  4. guido scarabelli

    Hi Rakesh, in addition to Xin-Qiu's explanation, I just want to say that each simulation of the kinesin-5 motor domain was 40ns long and had 2000 frames, without discarding any initial part. Let us know if you have other questions. Bests. Guido

  5. Rakesh Ramachandran

    Hi Xin and Guido,

    Thanks for the reply. I have used this command - cij.all <- lmi(trj[, ca.inds$xyz], ncore = 12). I dont see 12 cores being used. Where trj is the object which has superposed frames. If I use heavy atoms instead of calpha then it gets really slow again 12 cores are not used.

    Rakesh

  6. Xinqiu Yao

    Hi Rakesh,

    Yes, do provide what Barry suggested. Also, let us know the output of following command setup.ncore(12).

  7. Rakesh Ramachandran

    Hi Xin and Grant,

    This was the output of sessionInfo() and the output of setup.ncore(12) is 12. I have to mention I find multiple cores not being used only for lmi() function, the others fit.xyz() and cmap() are working fine.

    R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS release 6.5 (Final)

    locale: [1] LC_CTYPE=en_IN LC_NUMERIC=C LC_TIME=en_IN
    [4] LC_COLLATE=en_IN LC_MONETARY=en_IN LC_MESSAGES=en_IN
    [7] LC_PAPER=en_IN LC_NAME=C LC_ADDRESS=C
    [10] LC_TELEPHONE=C LC_MEASUREMENT=en_IN LC_IDENTIFICATION=C

    attached base packages: character(0)

    other attached packages: [1] bio3d_2.2-3.9000

    loaded via a namespace (and not attached): [1] bigmemory_4.5.8 bigmemory.sri_0.1.3 magrittr_1.5
    [4] graphics_3.2.2 parallel_3.2.2 igraph_1.0.1
    [7] utils_3.2.2 grDevices_3.2.2 Rcpp_0.12.1
    [10] stats_3.2.2 datasets_3.2.2 grid_3.2.2
    [13] methods_3.2.2 base_3.2.2 lattice_0.20-33

  8. Xinqiu Yao

    Hi,

    I think your system for multicore calcuation is fine. lmi() and other correlation calculation functions called R internal function var() first, which is not parallelized. When number of frames of your trajectory is very large, the step of calling var() will be the computational bottleneck. This is possibly the reason why you see only one core is busy. Note that the var() is much faster than ordinary functions even after being parallelized.

    One solution to your problem mentioned above is to reduce the number of frames (e.g. 2000). Also, make sure you have used the 'grpby' option when you tried all heavy atoms for the lmi calculation (see help(lmi)).

    Let me know if you still get any problem!

  9. Log in to comment