The function valid has been superseded by the perf function to avoid some selection bias in the sparse functions. This has been fixed.

sessionInfo()

## R version 3.1.0 (2014-04-10)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
## [1] digest_0.6.4     evaluate_0.5.5   formatR_0.10     htmltools_0.2.4 
## [5] knitr_1.6        rmarkdown_0.2.46 stringr_0.6.2    tools_3.1.0     
## [9] yaml_2.1.13

Load the latest version of the package 5.0-2

library(mixOmics, lib='../../MyR')

This is an example to announce the transition to the perf function:

data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic

res.pls = pls(X, Y, ncomp = 3, mode = "regression")
valid.pls = valid(res.pls, validation = "Mfold", folds = 5, criterion = "all", 
    progressBar = FALSE)

## The valid function has been supersided by the perf function to evaluate the performance of the PLS, sPLS, PLS-DA and sPLS-DA methods. 
##      See ?perf for more details

Perf is an S3 method that applies on an object inherited from PLS, sPLS, PLS-DA or sPLS-DA. It will retrieve the parameters input from the model run beforehand and assesses the performance of the model.

PLS methods

We will use the liver toxicity data for the PLS and sPLS methods where the Y response is a matrix composed of 10 clinical variables:

data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic

Tuning the number of components

The Q2.total can be used to tune the number of components. The rule of thumbs is that a PLS component should be included in the model if its value is greater than or equal to 0.0975 (Tenenhaus, 1998, rule applied in the SIMCA software). After few important components, the Q2.total values should all be less than 0.0975. We have often experiences Q2.total values that do not decrease, which might be due to the large number of variables in the X data set.

Here is an example with PLS, where ncomp = 3 should be enough for the model:

res.pls1 = pls(X, Y, ncomp = 10, mode = 'regression')
tune.pls = perf(res.pls1, validation = 'Mfold', folds = 5, criterion = 'all', progressBar = FALSE) 
tune.pls$Q2.total

##   comp 1   comp 2   comp 3   comp 4   comp 5   comp 6   comp 7   comp 8 
##  0.28028  0.04028  0.12094  0.01316  0.04511 -0.07670 -0.01609 -0.01811 
##   comp 9  comp 10 
## -0.01214  0.02518

In order to save some computational time (and output space) we can now rerun the perf function with ncomp = 3 components and look at the other criteria Q2, R2 and MSEP which are calculated for each response variable Y. According to the tuning step from above, we should only consider the columns ncomp = 3, although these more ‘local’ criteria also give a better idea of the fit for each response variable.

res.pls2 = pls(X, Y, ncomp = 3, mode = 'regression')
perf.pls = perf(res.pls2, validation = 'Mfold', folds = 5, criterion = 'all', progressBar = FALSE) 
perf.pls$Q2

##                    ncomp 1  ncomp 2  ncomp 3
## BUN.mg.dL.         0.46131  0.09182 -0.02863
## Creat.mg.dL.       0.09239 -0.01942  0.03229
## TP.g.dL.           0.06527  0.05494  0.16978
## ALB.g.dL.          0.09267  0.16226  0.23009
## ALT.IU.L.          0.63112  0.09993  0.10704
## SDH.IU.L.          0.08833  0.03258  0.11026
## AST.IU.L.          0.62150  0.07513  0.09956
## ALP.IU.L.          0.26706 -0.05063 -0.07242
## TBA.umol.L.        0.53169 -0.05099  0.32634
## Cholesterol.mg.dL. 0.21343  0.11661  0.01431

perf.pls$R2

##                     ncomp 1 ncomp 2 ncomp 3
## BUN.mg.dL.         0.522514 0.58024 0.56597
## Creat.mg.dL.       0.052655 0.03980 0.07944
## TP.g.dL.           0.001606 0.05848 0.22156
## ALB.g.dL.          0.029408 0.17625 0.38337
## ALT.IU.L.          0.637420 0.74457 0.76513
## SDH.IU.L.          0.026605 0.01066 0.05771
## AST.IU.L.          0.592691 0.71351 0.74534
## ALP.IU.L.          0.164270 0.12946 0.09054
## TBA.umol.L.        0.581582 0.56013 0.71150
## Cholesterol.mg.dL. 0.239225 0.36508 0.37478

sPLS

With sPLS, we first specify the number of variables to select on each component.

res.spls = spls(X, Y, ncomp = 3, keepX = c(10,5, 10), keepY = c(5,8,4), mode = 'regression')
perf.spls = perf(res.spls, validation = 'Mfold', folds = 5, criterion = 'all', progressBar = FALSE) 
perf.spls$Q2.total

##   comp 1   comp 2   comp 3 
##  0.29191  0.11793 -0.02768

perf.spls$MSEP

##                    ncomp 1 ncomp 2 ncomp 3
## BUN.mg.dL.          0.6394  0.6200  0.6140
## Creat.mg.dL.        0.9219  0.9266  0.9949
## TP.g.dL.            0.9219  0.7282  0.7255
## ALB.g.dL.           0.9219  0.6277  0.6229
## ALT.IU.L.           0.2960  0.2951  0.2951
## SDH.IU.L.           0.9219  0.6843  0.6990
## AST.IU.L.           0.3158  0.3157  0.3157
## ALP.IU.L.           0.9219  0.9398  0.9868
## TBA.umol.L.         0.4315  0.3900  0.3900
## Cholesterol.mg.dL.  0.6782  0.6207  0.6745

perf.spls$R2

##                      ncomp 1   ncomp 2   ncomp 3
## BUN.mg.dL.         3.920e-01 0.4154204 0.4023648
## Creat.mg.dL.       3.421e-05 0.0001862 0.0039095
## TP.g.dL.           3.677e-02 0.2898074 0.2965621
## ALB.g.dL.          3.596e-02 0.3998084 0.4067936
## ALT.IU.L.          8.451e-01 0.8462560 0.8462560
## SDH.IU.L.          2.615e-02 0.0760346 0.0563164
## AST.IU.L.          8.114e-01 0.8105985 0.8105985
## ALP.IU.L.          2.842e-03 0.1065213 0.0003963
## TBA.umol.L.        6.541e-01 0.7051423 0.7051423
## Cholesterol.mg.dL. 4.273e-01 0.4516098 0.3408542

sPLS and stability of selection

An interesting output that we have added with the function perf is a ‘stability frequency’. For each training step (cross validation run) the specified number of variables are selected (keepX and keepY) on the training data set. Since each training set is different, a different list of selected features is obtained. We can therefore calculate the frequency of selection of of the selected variables.

perf.spls$features$stable.X

## $`comp 1`
## A_42_P620915  A_43_P10606  A_43_P14131  A_43_P17415  A_43_P22616 
##          1.0          1.0          1.0          1.0          1.0 
##  A_43_P23376 A_42_P675890 A_42_P840776 A_42_P681650 A_42_P705413 
##          1.0          0.6          0.6          0.4          0.4 
## A_42_P802628  A_43_P10003  A_43_P11724 A_42_P578246 A_42_P681533 
##          0.4          0.4          0.4          0.2          0.2 
## A_42_P758454 A_42_P825290 
##          0.2          0.2 
## 
## $`comp 2`
## A_42_P505480 A_42_P576823 A_42_P591665 A_42_P678904  A_43_P11285 
##          1.0          1.0          1.0          0.8          0.6 
## A_42_P843603  A_43_P11288 
##          0.4          0.2 
## 
## $`comp 3`
## A_42_P622652  A_43_P17889 A_42_P467372 A_42_P504507  A_43_P20814 
##          0.6          0.6          0.4          0.4          0.4 
##  A_43_P22375 A_42_P505298 A_42_P512802 A_42_P592417 A_42_P603705 
##          0.4          0.2          0.2          0.2          0.2 
## A_42_P606133 A_42_P617554 A_42_P656513 A_42_P668600 A_42_P677172 
##          0.2          0.2          0.2          0.2          0.2 
## A_42_P693789 A_42_P716680 A_42_P720241 A_42_P740209 A_42_P754353 
##          0.2          0.2          0.2          0.2          0.2 
## A_42_P755687 A_42_P757296 A_42_P763572 A_42_P772304 A_42_P793123 
##          0.2          0.2          0.2          0.2          0.2 
##  A_43_P10703  A_43_P10754  A_43_P11230  A_43_P13183  A_43_P13368 
##          0.2          0.2          0.2          0.2          0.2 
##  A_43_P13563  A_43_P13777  A_43_P15173  A_43_P15957  A_43_P16009 
##          0.2          0.2          0.2          0.2          0.2 
##  A_43_P16316  A_43_P16621  A_43_P16946  A_43_P19044  A_43_P19997 
##          0.2          0.2          0.2          0.2          0.2 
##  A_43_P20149  A_43_P21505 
##          0.2          0.2

#perf.spls$features$stable.Y

We also output the final variable selection given the input arguments keepX and keepY on the full data set, along with the weight values from the loading vector.

perf.spls$features$final.X

## $`comp 1`
##              value.var.X
## A_42_P620915     0.54789
## A_43_P14131      0.53920
## A_43_P23376      0.37728
## A_43_P22616      0.32534
## A_43_P10606      0.27336
## A_43_P17415      0.23237
## A_42_P705413     0.13863
## A_42_P802628     0.08216
## A_42_P840776     0.06827
## A_42_P675890     0.03932
## 
## $`comp 2`
##              value.var.X
## A_42_P591665     -0.7181
## A_42_P576823     -0.5238
## A_43_P11285      -0.3390
## A_42_P678904     -0.2911
## A_42_P505480     -0.1011
## 
## $`comp 3`
##              value.var.X
## A_42_P622652     0.78117
## A_42_P656513     0.33316
## A_42_P504507     0.31844
## A_43_P17889      0.24404
## A_42_P467372     0.21493
## A_43_P22375      0.17676
## A_43_P13368      0.13226
## A_42_P757296     0.09641
## A_42_P606133     0.08656
## A_43_P10703      0.07814

#perf.spls$features$final.Y

A plot function is proposed to output the different criteria per response variable.

plot(perf.spls, criterion = c('RMSEP'), type = 'l')

plot of chunk sPLS_example4

PLS-DA methods

PLS-DA perform a classification task (as opposed to a regression task with PLS). Here the Y response is a factor indicating the class of each individual. Here we take the dose group as a Y factor.

data(liver.toxicity)
X <- liver.toxicity$gene
Y <- as.factor(liver.toxicity$treatment[,"Dose.Group"])  # take a factor here
summary(Y)

##   50  150 1500 2000 
##   16   16   16   16

Estimating the error rate

In addition to the number of components, the user needs to specify the prediction method to be used with PLS-DA (see ?predict.plsda and ?perf). The criterion that is computed is the classification error rate averaged across all cross-validated runs. Note if M fold cross validation is performed, this result should be averaged across several repetitions (in the example below we only ran it once).

res.plsda = plsda(X, Y, ncomp = 10)
tune.plsda = perf(res.plsda, method.predict = 'all' , validation = 'Mfold', folds = 5, progressBar = FALSE) 

#! note that the function perf should be rerun several times and the error rates obtained below should then be averaged
tune.plsda$error.rate

##          max.dist centroids.dist mahalanobis.dist
## ncomp 1    0.5782         0.5321           0.5321
## ncomp 2    0.4192         0.4346           0.3756
## ncomp 3    0.2179         0.2628           0.1872
## ncomp 4    0.2513         0.2795           0.2179
## ncomp 5    0.1872         0.2487           0.1564
## ncomp 6    0.2192         0.2487           0.1551
## ncomp 7    0.1577         0.2641           0.2038
## ncomp 8    0.1731         0.2333           0.2205
## ncomp 9    0.1423         0.2179           0.2513
## ncomp 10   0.1423         0.2179           0.2205

The classification error rate might reach a minimum when a sufficient number of components is included in the model. This output is especially useful when using the sparse PLS-DA, for example if we want to assess the performance of sPLS-DA with a specified number of variables to select (keepX).

Form our experience, the prediction distance ‘max.dist’ seems the outperform the other distances in most cases. This is the distance that we choose in this example:

res.splsda = splsda(X, Y, ncomp = 3, keepX=c(50,40,20))
perf.splsda = perf(res.splsda, method.predict = 'max.dist' , validation = 'Mfold', folds = 5, progressBar = FALSE) 

#! note that the function perf should be rerun several times and the error rates obtained below should then be averaged
perf.splsda$error.rate

##         max.dist
## ncomp 1   0.6218
## ncomp 2   0.1397
## ncomp 3   0.1718

As with the sPLS we can also output the final variable selection (on the full data set) as well as the variable selected repeatedly across the cross-validation runs:

#perf.splsda$features$stable
#perf.splsda$features$final

matplot(perf.splsda$error.rate, type = 'l', col = 'blue', xlab = 'sPLS-DA components', ylab = 'classification error rate', main = 'Max distance')

plot of chunk unnamed-chunk-4

Tuning the number of variables to select with sPLS-DA

Below is an example of code to run to tune the number of variables to select with sPLS-DA. There are few things to take into account during the tuning step: * vary the number of variables to select * this code below only assesses the model for one cross-validation procedure. It should be averaged across several runs * the tuning should be performed iteratively, 1 component at a time.

Note: these computations may take a long time to run!

For component 1:

n <- nrow(X)
M <- 5
list.keepX <- c(seq(10, 200, 10))
error1 <- vector(length = length(list.keepX))
names(error1) <- list.keepX

ncomp = 1
for (i in 1:length(list.keepX)) {
  model <- splsda(X, Y, ncomp = ncomp, keepX = c(list.keepX[i]))
  error1[i] <- perf(model, method.predict = 'max.dist' , validation = 'Mfold', folds = M, progressBar = FALSE)$error.rate
  }

error1

##     10     20     30     40     50     60     70     80     90    100 
## 0.6385 0.6423 0.6256 0.6564 0.6090 0.6064 0.6103 0.6410 0.6244 0.6385 
##    110    120    130    140    150    160    170    180    190    200 
## 0.6231 0.5949 0.6269 0.6397 0.6077 0.5141 0.6269 0.6564 0.6244 0.6103

Given the previous results, we can decide to set the first keepX value to, say, 20 and continue to tune the next component:

ncomp = 2
error2 <- vector(length = length(list.keepX))
names(error2) <- list.keepX

# tuned value for the first component
keepX1 = 20  


for (i in 1:length(list.keepX)) {
  model <- splsda(X, Y, ncomp = ncomp, keepX = c(keepX1, list.keepX[i]))
  error2[i] <- t(perf(model, method.predict = 'max.dist' , validation = 'Mfold', folds = M, progressBar = FALSE)$error.rate)[ncomp]
  }
error2

##     10     20     30     40     50     60     70     80     90    100 
## 0.2038 0.1090 0.2013 0.2192 0.1397 0.1718 0.2179 0.1872 0.2346 0.2795 
##    110    120    130    140    150    160    170    180    190    200 
## 0.1910 0.2013 0.2474 0.1731 0.2500 0.2026 0.3128 0.2192 0.3256 0.2218

Plot the error rate for either component 1, or the combination of component 1 and 2:

matplot(cbind(error1, error2), type = "l", axes = FALSE,  xlab = "number of selected genes on the dimension 1",
        ylab = "error rate", lwd = 2, lty = 1, col = 1:2, ylim = c(0, 0.8))
axis(1, c(1:length(list.keepX)), labels = list.keepX)
axis(2)
legend("topright", lty = 1, lwd = 2,  legend = c("dim 1", "dim 1:2"), horiz = TRUE,
       col = 1:2)

plot of chunk unnamed-chunk-5