Dear mixOmics team,
First of all I would like to thank you so much for hard working to develop such an amazing tool. And also to congratulate Anh Le for getting The Moran Medal in Statistical Sciences from the Australian Academy of Science.
With the intern I am supervising, we are trying to define a universal transcriptomic signature for human cellular senescence. To achieve this, we have a compendium of 100 transcriptomes corresponding to timecourses where senescence have been induced with different triggers (DNA damage, oncogene induction, oxidative stress ...) as well as a quiescence control timecourse.
We use these data as the X matrix along a Y matrix which we parametrize so that we are matching senescence phases across inducers (CONTROL, EARLY, INTERMEDIATE, LATE senescence, and finally QUIESCENCE) as the input for an sPLS-DA.
After fine-tuning, we identify the optimal number of components to keep as well as the per-component number of variables to select. Using these parameter we finally run the sPLS-DA to get our signature.
Below is what we get when looking at the CIM.
As you can see, the classification is not optimal. And we noticed that by "artificially" increasing - even slightly - the number of variables to keep on each of the component, we managed to get better results.
Making this comparison, we "know" our data has a better potential than the one we reached with the parameters given by the fine tuning. So we are probably making something wrong here.
The questions that we have are :
1) Could you please give us some ideas on how to explain that there is still valuable biological information hidden in our data that is not considered during the tuning of the sPLS-DA ?
2) Is there something that we are likely to make wrong ? Especially while seting up the validation parameters for the tuning phases ?
3) Despite reading the documentation for the CIM function, we fail to understand how the matrix used as an input relate to the original data. Would it be possible to have some additional explanations ?
Below is the code we used.
# Prepare data X <- t(data_final) Y <- pdata$Phases Y <- droplevels(Y) #---> PCA pca.srbct = pca(X, ncomp = 10, center =TRUE, scale = TRUE) plot(pca.srbct) plotIndiv(pca.srbct, group = Y, ind.names = TRUE, legend = TRUE, title = 'PCA on SRBCT') #---> PLSDA srbct.plsda <- plsda(X, Y, ncomp = 10) # set ncomp to 10 for performance assessment later plotIndiv(srbct.plsda , comp = c(1,2), group = Y, ind.names = TRUE, ellipse = TRUE, legend = TRUE, title = 'PLSDA on SRBCT') # with background background = background.predict(srbct.plsda, comp.predicted=2, dist = "mahalanobis.dist") plotIndiv(srbct.plsda, comp = 1:2, group = Y, ind.names = TRUE, title = "Maximum distance", legend = TRUE, background = background) #---> sPLSDA # Define the number of component set.seed(2543) # for reproducibility, only when the `cpus' argument is not used perf.plsda.srbct <- perf(srbct.plsda, validation = "Mfold", folds = 4, progressBar = TRUE, auc = TRUE, nrepeat = 30) plot(perf.plsda.srbct, col = color.mixo(5:7), sd = TRUE, legend.position = "horizontal") auc.plsda = auroc(srbct.plsda, roc.comp = 6) # Define the number of variables per components list.keepX <- c(1:10, seq(20, 1000, 100)) t1 = proc.time() tune.splsda.srbct <- tune.splsda(X, Y, ncomp = 3, validation = "Mfold", folds = 4, progressBar = TRUE, dist = 'mahalanobis.dist', measure = "BER", test.keepX = list.keepX, nrepeat = 30) t2 = proc.time() running_time = t2 - t1; running_time # running time error <- tune.splsda.srbct$error.rate ncomp <- tune.splsda.srbct$choice.ncomp$ncomp select.keepX <- tune.splsda.srbct$choice.keepX[1:ncomp] select.keepX <- tune.splsda.srbct$choice.keepX[1:ncomp] # optimal number of variables to select select.keepX plot(tune.splsda.srbct, col = color.jet(2)) # Run the sPLSDA with defined parameters splsda.srbct <- splsda(X, Y, ncomp = ncomp, keepX = select.keepX) plotIndiv(splsda.srbct, comp = c(1,2), group = Y, ind.names = FALSE, ellipse = TRUE, legend = TRUE, title = 'sPLS-DA on SRBCT, comp 1 & 2') auc.splsda = auroc(splsda.srbct, roc.comp = 6) auc.splsda = auroc(splsda.srbct, roc.comp = ncomp) set.seed(40) # for reproducibility, only when the `cpus' argument is not used # takes about 1 min to run perf.srbct <- perf(splsda.srbct, validation = "Mfold", folds = 2, dist = 'mahalanobis.dist', measure = "BER", progressBar = TRUE, ncomp = 6) colors <- color.mixo(1:5) pdf("~/Desktop/LISA/CIM.pdf") cim(splsda.srbct, row.sideColors = colors[factor(Y)], margins = c(10,10), legend=list(legend = unique(levels(Y)), col=colors), row.cex = .1, col.cex = .1, transpose = TRUE) dev.off()