The function valid has been superseded by the perf function to avoid some selection bias in the sparse functions. This has been fixed.
sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.4 evaluate_0.5.5 formatR_0.10 htmltools_0.2.4
## [5] knitr_1.6 rmarkdown_0.2.46 stringr_0.6.2 tools_3.1.0
## [9] yaml_2.1.13
Load the latest version of the package 5.0-2
library(mixOmics, lib='../../MyR')
This is an example to announce the transition to the perf function:
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic
res.pls = pls(X, Y, ncomp = 3, mode = "regression")
valid.pls = valid(res.pls, validation = "Mfold", folds = 5, criterion = "all",
progressBar = FALSE)
## The valid function has been supersided by the perf function to evaluate the performance of the PLS, sPLS, PLS-DA and sPLS-DA methods.
## See ?perf for more details
Perf is an S3 method that applies on an object inherited from PLS, sPLS, PLS-DA or sPLS-DA. It will retrieve the parameters input from the model run beforehand and assesses the performance of the model.
We will use the liver toxicity data for the PLS and sPLS methods where the Y response is a matrix composed of 10 clinical variables:
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic
The Q2.total can be used to tune the number of components. The rule of thumbs is that a PLS component should be included in the model if its value is greater than or equal to 0.0975 (Tenenhaus, 1998, rule applied in the SIMCA software). After few important components, the Q2.total values should all be less than 0.0975. We have often experiences Q2.total values that do not decrease, which might be due to the large number of variables in the X data set.
Here is an example with PLS, where ncomp = 3 should be enough for the model:
res.pls1 = pls(X, Y, ncomp = 10, mode = 'regression')
tune.pls = perf(res.pls1, validation = 'Mfold', folds = 5, criterion = 'all', progressBar = FALSE)
tune.pls$Q2.total
## comp 1 comp 2 comp 3 comp 4 comp 5 comp 6 comp 7 comp 8
## 0.28028 0.04028 0.12094 0.01316 0.04511 -0.07670 -0.01609 -0.01811
## comp 9 comp 10
## -0.01214 0.02518
In order to save some computational time (and output space) we can now rerun the perf function with ncomp = 3 components and look at the other criteria Q2, R2 and MSEP which are calculated for each response variable Y. According to the tuning step from above, we should only consider the columns ncomp = 3, although these more ‘local’ criteria also give a better idea of the fit for each response variable.
res.pls2 = pls(X, Y, ncomp = 3, mode = 'regression')
perf.pls = perf(res.pls2, validation = 'Mfold', folds = 5, criterion = 'all', progressBar = FALSE)
perf.pls$Q2
## ncomp 1 ncomp 2 ncomp 3
## BUN.mg.dL. 0.46131 0.09182 -0.02863
## Creat.mg.dL. 0.09239 -0.01942 0.03229
## TP.g.dL. 0.06527 0.05494 0.16978
## ALB.g.dL. 0.09267 0.16226 0.23009
## ALT.IU.L. 0.63112 0.09993 0.10704
## SDH.IU.L. 0.08833 0.03258 0.11026
## AST.IU.L. 0.62150 0.07513 0.09956
## ALP.IU.L. 0.26706 -0.05063 -0.07242
## TBA.umol.L. 0.53169 -0.05099 0.32634
## Cholesterol.mg.dL. 0.21343 0.11661 0.01431
perf.pls$R2
## ncomp 1 ncomp 2 ncomp 3
## BUN.mg.dL. 0.522514 0.58024 0.56597
## Creat.mg.dL. 0.052655 0.03980 0.07944
## TP.g.dL. 0.001606 0.05848 0.22156
## ALB.g.dL. 0.029408 0.17625 0.38337
## ALT.IU.L. 0.637420 0.74457 0.76513
## SDH.IU.L. 0.026605 0.01066 0.05771
## AST.IU.L. 0.592691 0.71351 0.74534
## ALP.IU.L. 0.164270 0.12946 0.09054
## TBA.umol.L. 0.581582 0.56013 0.71150
## Cholesterol.mg.dL. 0.239225 0.36508 0.37478
With sPLS, we first specify the number of variables to select on each component.
res.spls = spls(X, Y, ncomp = 3, keepX = c(10,5, 10), keepY = c(5,8,4), mode = 'regression')
perf.spls = perf(res.spls, validation = 'Mfold', folds = 5, criterion = 'all', progressBar = FALSE)
perf.spls$Q2.total
## comp 1 comp 2 comp 3
## 0.29191 0.11793 -0.02768
perf.spls$MSEP
## ncomp 1 ncomp 2 ncomp 3
## BUN.mg.dL. 0.6394 0.6200 0.6140
## Creat.mg.dL. 0.9219 0.9266 0.9949
## TP.g.dL. 0.9219 0.7282 0.7255
## ALB.g.dL. 0.9219 0.6277 0.6229
## ALT.IU.L. 0.2960 0.2951 0.2951
## SDH.IU.L. 0.9219 0.6843 0.6990
## AST.IU.L. 0.3158 0.3157 0.3157
## ALP.IU.L. 0.9219 0.9398 0.9868
## TBA.umol.L. 0.4315 0.3900 0.3900
## Cholesterol.mg.dL. 0.6782 0.6207 0.6745
perf.spls$R2
## ncomp 1 ncomp 2 ncomp 3
## BUN.mg.dL. 3.920e-01 0.4154204 0.4023648
## Creat.mg.dL. 3.421e-05 0.0001862 0.0039095
## TP.g.dL. 3.677e-02 0.2898074 0.2965621
## ALB.g.dL. 3.596e-02 0.3998084 0.4067936
## ALT.IU.L. 8.451e-01 0.8462560 0.8462560
## SDH.IU.L. 2.615e-02 0.0760346 0.0563164
## AST.IU.L. 8.114e-01 0.8105985 0.8105985
## ALP.IU.L. 2.842e-03 0.1065213 0.0003963
## TBA.umol.L. 6.541e-01 0.7051423 0.7051423
## Cholesterol.mg.dL. 4.273e-01 0.4516098 0.3408542
An interesting output that we have added with the function perf is a ‘stability frequency’. For each training step (cross validation run) the specified number of variables are selected (keepX and keepY) on the training data set. Since each training set is different, a different list of selected features is obtained. We can therefore calculate the frequency of selection of of the selected variables.
perf.spls$features$stable.X
## $`comp 1`
## A_42_P620915 A_43_P10606 A_43_P14131 A_43_P17415 A_43_P22616
## 1.0 1.0 1.0 1.0 1.0
## A_43_P23376 A_42_P675890 A_42_P840776 A_42_P681650 A_42_P705413
## 1.0 0.6 0.6 0.4 0.4
## A_42_P802628 A_43_P10003 A_43_P11724 A_42_P578246 A_42_P681533
## 0.4 0.4 0.4 0.2 0.2
## A_42_P758454 A_42_P825290
## 0.2 0.2
##
## $`comp 2`
## A_42_P505480 A_42_P576823 A_42_P591665 A_42_P678904 A_43_P11285
## 1.0 1.0 1.0 0.8 0.6
## A_42_P843603 A_43_P11288
## 0.4 0.2
##
## $`comp 3`
## A_42_P622652 A_43_P17889 A_42_P467372 A_42_P504507 A_43_P20814
## 0.6 0.6 0.4 0.4 0.4
## A_43_P22375 A_42_P505298 A_42_P512802 A_42_P592417 A_42_P603705
## 0.4 0.2 0.2 0.2 0.2
## A_42_P606133 A_42_P617554 A_42_P656513 A_42_P668600 A_42_P677172
## 0.2 0.2 0.2 0.2 0.2
## A_42_P693789 A_42_P716680 A_42_P720241 A_42_P740209 A_42_P754353
## 0.2 0.2 0.2 0.2 0.2
## A_42_P755687 A_42_P757296 A_42_P763572 A_42_P772304 A_42_P793123
## 0.2 0.2 0.2 0.2 0.2
## A_43_P10703 A_43_P10754 A_43_P11230 A_43_P13183 A_43_P13368
## 0.2 0.2 0.2 0.2 0.2
## A_43_P13563 A_43_P13777 A_43_P15173 A_43_P15957 A_43_P16009
## 0.2 0.2 0.2 0.2 0.2
## A_43_P16316 A_43_P16621 A_43_P16946 A_43_P19044 A_43_P19997
## 0.2 0.2 0.2 0.2 0.2
## A_43_P20149 A_43_P21505
## 0.2 0.2
#perf.spls$features$stable.Y
We also output the final variable selection given the input arguments keepX and keepY on the full data set, along with the weight values from the loading vector.
perf.spls$features$final.X
## $`comp 1`
## value.var.X
## A_42_P620915 0.54789
## A_43_P14131 0.53920
## A_43_P23376 0.37728
## A_43_P22616 0.32534
## A_43_P10606 0.27336
## A_43_P17415 0.23237
## A_42_P705413 0.13863
## A_42_P802628 0.08216
## A_42_P840776 0.06827
## A_42_P675890 0.03932
##
## $`comp 2`
## value.var.X
## A_42_P591665 -0.7181
## A_42_P576823 -0.5238
## A_43_P11285 -0.3390
## A_42_P678904 -0.2911
## A_42_P505480 -0.1011
##
## $`comp 3`
## value.var.X
## A_42_P622652 0.78117
## A_42_P656513 0.33316
## A_42_P504507 0.31844
## A_43_P17889 0.24404
## A_42_P467372 0.21493
## A_43_P22375 0.17676
## A_43_P13368 0.13226
## A_42_P757296 0.09641
## A_42_P606133 0.08656
## A_43_P10703 0.07814
#perf.spls$features$final.Y
A plot function is proposed to output the different criteria per response variable.
plot(perf.spls, criterion = c('RMSEP'), type = 'l')
PLS-DA perform a classification task (as opposed to a regression task with PLS). Here the Y response is a factor indicating the class of each individual. Here we take the dose group as a Y factor.
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- as.factor(liver.toxicity$treatment[,"Dose.Group"]) # take a factor here
summary(Y)
## 50 150 1500 2000
## 16 16 16 16
In addition to the number of components, the user needs to specify the prediction method to be used with PLS-DA (see ?predict.plsda and ?perf). The criterion that is computed is the classification error rate averaged across all cross-validated runs. Note if M fold cross validation is performed, this result should be averaged across several repetitions (in the example below we only ran it once).
res.plsda = plsda(X, Y, ncomp = 10)
tune.plsda = perf(res.plsda, method.predict = 'all' , validation = 'Mfold', folds = 5, progressBar = FALSE)
#! note that the function perf should be rerun several times and the error rates obtained below should then be averaged
tune.plsda$error.rate
## max.dist centroids.dist mahalanobis.dist
## ncomp 1 0.5782 0.5321 0.5321
## ncomp 2 0.4192 0.4346 0.3756
## ncomp 3 0.2179 0.2628 0.1872
## ncomp 4 0.2513 0.2795 0.2179
## ncomp 5 0.1872 0.2487 0.1564
## ncomp 6 0.2192 0.2487 0.1551
## ncomp 7 0.1577 0.2641 0.2038
## ncomp 8 0.1731 0.2333 0.2205
## ncomp 9 0.1423 0.2179 0.2513
## ncomp 10 0.1423 0.2179 0.2205
The classification error rate might reach a minimum when a sufficient number of components is included in the model. This output is especially useful when using the sparse PLS-DA, for example if we want to assess the performance of sPLS-DA with a specified number of variables to select (keepX).
Form our experience, the prediction distance ‘max.dist’ seems the outperform the other distances in most cases. This is the distance that we choose in this example:
res.splsda = splsda(X, Y, ncomp = 3, keepX=c(50,40,20))
perf.splsda = perf(res.splsda, method.predict = 'max.dist' , validation = 'Mfold', folds = 5, progressBar = FALSE)
#! note that the function perf should be rerun several times and the error rates obtained below should then be averaged
perf.splsda$error.rate
## max.dist
## ncomp 1 0.6218
## ncomp 2 0.1397
## ncomp 3 0.1718
As with the sPLS we can also output the final variable selection (on the full data set) as well as the variable selected repeatedly across the cross-validation runs:
#perf.splsda$features$stable
#perf.splsda$features$final
matplot(perf.splsda$error.rate, type = 'l', col = 'blue', xlab = 'sPLS-DA components', ylab = 'classification error rate', main = 'Max distance')
Below is an example of code to run to tune the number of variables to select with sPLS-DA. There are few things to take into account during the tuning step: * vary the number of variables to select * this code below only assesses the model for one cross-validation procedure. It should be averaged across several runs * the tuning should be performed iteratively, 1 component at a time.
Note: these computations may take a long time to run!
For component 1:
n <- nrow(X)
M <- 5
list.keepX <- c(seq(10, 200, 10))
error1 <- vector(length = length(list.keepX))
names(error1) <- list.keepX
ncomp = 1
for (i in 1:length(list.keepX)) {
model <- splsda(X, Y, ncomp = ncomp, keepX = c(list.keepX[i]))
error1[i] <- perf(model, method.predict = 'max.dist' , validation = 'Mfold', folds = M, progressBar = FALSE)$error.rate
}
error1
## 10 20 30 40 50 60 70 80 90 100
## 0.6385 0.6423 0.6256 0.6564 0.6090 0.6064 0.6103 0.6410 0.6244 0.6385
## 110 120 130 140 150 160 170 180 190 200
## 0.6231 0.5949 0.6269 0.6397 0.6077 0.5141 0.6269 0.6564 0.6244 0.6103
Given the previous results, we can decide to set the first keepX value to, say, 20 and continue to tune the next component:
ncomp = 2
error2 <- vector(length = length(list.keepX))
names(error2) <- list.keepX
# tuned value for the first component
keepX1 = 20
for (i in 1:length(list.keepX)) {
model <- splsda(X, Y, ncomp = ncomp, keepX = c(keepX1, list.keepX[i]))
error2[i] <- t(perf(model, method.predict = 'max.dist' , validation = 'Mfold', folds = M, progressBar = FALSE)$error.rate)[ncomp]
}
error2
## 10 20 30 40 50 60 70 80 90 100
## 0.2038 0.1090 0.2013 0.2192 0.1397 0.1718 0.2179 0.1872 0.2346 0.2795
## 110 120 130 140 150 160 170 180 190 200
## 0.1910 0.2013 0.2474 0.1731 0.2500 0.2026 0.3128 0.2192 0.3256 0.2218
Plot the error rate for either component 1, or the combination of component 1 and 2:
matplot(cbind(error1, error2), type = "l", axes = FALSE, xlab = "number of selected genes on the dimension 1",
ylab = "error rate", lwd = 2, lty = 1, col = 1:2, ylim = c(0, 0.8))
axis(1, c(1:length(list.keepX)), labels = list.keepX)
axis(2)
legend("topright", lty = 1, lwd = 2, legend = c("dim 1", "dim 1:2"), horiz = TRUE,
col = 1:2)