Error in performance assessment

Issue #146 resolved
Benjamin Kellman created an issue

I'm running into an error in the perf() function. I built the following model:

mod=spls(Y=Y[-test,], X=X[-test,],
    ncomp = 10,
    mode = c("regression", "canonical", "invariant", "classic")[1],
    keepX=which(colSums(X)>10),
    keepY=which(colSums(abs(Y))>10),
    scale = TRUE,
    tol = 1e-6,max.iter = 1000,near.zero.var = FALSE,logratio="none",multilevel=NULL,all.outputs = TRUE)

and get the following error

> perf(mod,'Mfold')
  |=                                                                     |   1%
Error in if ((crossprod(a.cv - a.old.cv) < tol) || (iter.cv == max.iter)) break : 
  missing value where TRUE/FALSE needed
In addition: Warning messages:
1: The SGCCA algorithm did not converge 
2: The SGCCA algorithm did not converge 
3: In cor(A[[k]], variates.A[[k]]) : the standard deviation is zero

Summaries of my X and Y (binary) matrixes: Y, binary inclusion matrix X, mix of continuous and binarized categorical variables

It seems like an NA is sneeking in here through a.cv or a.old.cv and the crossprod calculations used to generate them. Any idea if this can be resolved?

Comments (5)

  1. Florian Rohart

    Hi Benjamin,

    First I'd like to remind you that the keepX and keepY parameters are the number of variables you want to keep on each component (one number to be given for each of the 10 components). In your code you used

    keepX=which(colSums(X)>10)
    

    This gives you which columns to keep but not their number.

    To answer your question: it seems your V8 in Y is a constant (and null) variable, you may want to set near.zero.var=TRUE, or remove this column. V9 and V10 don't look to informative either.. This might result in constant variable during the CV process, I think setting near=TRUE should solve this problem.

    Let me know!

  2. Benjamin Kellman reporter

    Thanks for the quick response!

    Looks like it was my misuse of the 'keepX' variable as you said. What is the purpose of manually specifying the number of variables? Does SPLS have a tendency to overestimate the necessary variables?

  3. Benjamin Kellman reporter

    On a related note, I'm running the predict.spls function to recapitulate my binary Y matrix but the $predict output has 3 dimensions: observations x variable x ncomp. How do I decompose this back to a 2d matrix (observations x variables) to asses the quality of the prediction?

  4. Florian Rohart

    spls() chooses the best linear combination of keepX variables. If you want to optimise that number, you need to use the tune.splsda() function on a grid of keepX parameter.

    Regarding predict, you should use the $predict[,,ncomp], as this is the prediction of all the 1:ncomp components. If you only look at $predict[,,1] then it's the prediction with the first component only.

  5. Log in to comment