DCCM (dccm.xyz.R and cov2dccm.R) - Mean substraction

Issue #388 resolved
Former user created an issue

Hello all,

I have been looking at the code which generates the DCCM plot from cartesian coordinates. I noticed that in dccm.xyz.R, there is an early call for subtraction of the mean of each column to each individual entry in that column.

  if(is.null(reference)) ref = colMeans(xyz)
  else ref = reference
  dxyz  <- sweep(xyz, 2, ref)

The matrix dxyz is then fed to the cov() function, generating variable covmat, which in turn is feed into the function cov2dccm as ccmat.

  covmat <- cov(dxyz)

  ccmat <- cov2dccm(covmat, ncore = ncore)

If what I have stated above is correct, then I have two questions:

1) This question pertains to the use of the R cov() function on matrix dxyz. I have been looking at the R code and what functions are called by cov(). It appears to me that cov() (in R/cov.c) goes ahead and performs a mean subtraction on the data set. If so, does this imply that the mean of each column is subtracted twice? Once generating dxyz (in dccm.xyz.R) and a second time by the cov() call? (I've included the code from cov.c that I found at the end of this message, or you can find it here: https://github.com/SurajGupta/r-source/blob/56becd21c75d104bfec829f9c23baa2e144869a2/src/library/stats/src/cov.c)

2) I noticed that the syntax for :

  if(is.null(reference)) ref = colMeans(xyz)
  else ref = reference
  dxyz  <- sweep(xyz, 2, ref)

is missing { } (after the if statement, else, and closing). Is this a problem? I can't seem to run that part of the code on my end without the brackets.

Lastly, hats off to you guys for the invaluable work you are doing for the community. I have written my own collection of scripts in different languages and programs to do analyses that are restricted to a gazillion limitations (trajectory type, file requirements, etc, etc), it is extremely nice to find such an efficient and elegant analysis suite. Many thanks again.

Carlos Andres, Sham Lab, Gao Lab, U of Minnesota

#define COV_SUM_UPDATE              \
            sum += xm * ym;     \
            if(cor) {           \
            xsd += xm * xm;     \
            ysd += ym * ym;     \
            }

#define ANS(I,J)  ans[I + J * ncx]

/* Note that "if (kendall)" and  "if (cor)" are used inside a double for() loop;
   which makes the code better readable -- and is hopefully dealt with
   by a smartly optimizing compiler
*/

/** Compute   Cov(xx[], yy[])  or  Cor(.,.)  with n = length(xx)
 */
#define COV_PAIRWISE_BODY                       \
    LDOUBLE sum, xmean = 0., ymean = 0., xsd, ysd, xm, ym;  \
        int k, nobs, n1 = -1;   /* -Wall initializing */        \
                                    \
        nobs = 0;                           \
        if(!kendall) {                      \
        xmean = ymean = 0.;                 \
        for (k = 0 ; k < n ; k++) {             \
            if(!(ISNAN(xx[k]) || ISNAN(yy[k]))) {       \
            nobs ++;                    \
            xmean += xx[k];                 \
            ymean += yy[k];                 \
            }                           \
        }                           \
        } else /*kendall*/                      \
        for (k = 0 ; k < n ; k++)               \
            if(!(ISNAN(xx[k]) || ISNAN(yy[k])))         \
            nobs ++;                    \
                                    \
        if (nobs >= 2) {                        \
        xsd = ysd = sum = 0.;                   \
        if(!kendall) {                      \
            xmean /= nobs;                  \
            ymean /= nobs;                  \
            n1 = nobs-1;                    \
        }                           \
        for(k=0; k < n; k++) {                  \
            if(!(ISNAN(xx[k]) || ISNAN(yy[k]))) {       \
            if(!kendall) {                  \
                xm = xx[k] - xmean;             \
                ym = yy[k] - ymean;             \
                                    \
                COV_SUM_UPDATE              \
            }                       \
            else { /* Kendall's tau */          \
                for(n1=0 ; n1 < k ; n1++)           \
                if(!(ISNAN(xx[n1]) || ISNAN(yy[n1]))) { \
                    xm = sign(xx[k] - xx[n1]);      \
                    ym = sign(yy[k] - yy[n1]);      \
                                    \
                    COV_SUM_UPDATE          \
                }                   \
            }                       \
            }                           \
} 

Comments (4)

  1. Xinqiu Yao

    Hi Carlos,

    Thanks for your encouraging comments on Bio3D. We are glad to see that you like it and find it useful.

    For your first question: Yes, it has double subtraction of the column mean but it won't be a problem. This is because after the first subtraction the column mean of the new matrix (i.e. dxyz) will be 0. Therefore the second subtraction in cov(dxyz) will have no effect. Note that this is true for reference=NULL. When reference is not NULL (e.g. to calculate 'moment' instead of 'covariance'; this option is rarely used in actual), calling 'cov()' will indeed cause some problems, and that is why we have additional codes (i.e. the lines enclosed in if(!is.null(reference)) {}) to correct it.

    For the second, missing curly brackets in 'if statement ' won't be a problem in function(). But it will indeed cause an error if typing directly in a R console, possibly because R interprets each line immediately stead of considering the "context". In a "coding style" point of view, yes it might be better to always have the brackets. Thanks for pointing out this.

  2. Log in to comment