Suboptimal results obtained with sPLSDA
Dear mixOmics team,
First of all I would like to thank you so much for hard working to develop such an amazing tool. And also to congratulate Anh Le for getting The Moran Medal in Statistical Sciences from the Australian Academy of Science.
With the intern I am supervising, we are trying to define a universal transcriptomic signature for human cellular senescence. To achieve this, we have a compendium of 100 transcriptomes corresponding to timecourses where senescence have been induced with different triggers (DNA damage, oncogene induction, oxidative stress ...) as well as a quiescence control timecourse.
We use these data as the X matrix along a Y matrix which we parametrize so that we are matching senescence phases across inducers (CONTROL, EARLY, INTERMEDIATE, LATE senescence, and finally QUIESCENCE) as the input for an sPLS-DA.
After fine-tuning, we identify the optimal number of components to keep as well as the per-component number of variables to select. Using these parameter we finally run the sPLS-DA to get our signature.
Below is what we get when looking at the CIM.
As you can see, the classification is not optimal. And we noticed that by "artificially" increasing - even slightly - the number of variables to keep on each of the component, we managed to get better results.
Making this comparison, we "know" our data has a better potential than the one we reached with the parameters given by the fine tuning. So we are probably making something wrong here.
The questions that we have are :
1) Could you please give us some ideas on how to explain that there is still valuable biological information hidden in our data that is not considered during the tuning of the sPLS-DA ?
2) Is there something that we are likely to make wrong ? Especially while seting up the validation parameters for the tuning phases ?
3) Despite reading the documentation for the CIM function, we fail to understand how the matrix used as an input relate to the original data. Would it be possible to have some additional explanations ?
Below is the code we used.
# Prepare data
X <- t(data_final)
Y <- pdata$Phases
Y <- droplevels(Y)
#---> PCA
pca.srbct = pca(X, ncomp = 10, center =TRUE, scale = TRUE)
plot(pca.srbct)
plotIndiv(pca.srbct, group = Y, ind.names = TRUE,
legend = TRUE, title = 'PCA on SRBCT')
#---> PLSDA
srbct.plsda <- plsda(X, Y, ncomp = 10) # set ncomp to 10 for performance assessment later
plotIndiv(srbct.plsda , comp = c(1,2),
group = Y, ind.names = TRUE,
ellipse = TRUE, legend = TRUE, title = 'PLSDA on SRBCT')
# with background
background = background.predict(srbct.plsda, comp.predicted=2, dist = "mahalanobis.dist")
plotIndiv(srbct.plsda, comp = 1:2,
group = Y, ind.names = TRUE, title = "Maximum distance",
legend = TRUE, background = background)
#---> sPLSDA
# Define the number of component
set.seed(2543) # for reproducibility, only when the `cpus' argument is not used
perf.plsda.srbct <- perf(srbct.plsda, validation = "Mfold", folds = 4,
progressBar = TRUE, auc = TRUE, nrepeat = 30)
plot(perf.plsda.srbct, col = color.mixo(5:7), sd = TRUE, legend.position = "horizontal")
auc.plsda = auroc(srbct.plsda, roc.comp = 6)
# Define the number of variables per components
list.keepX <- c(1:10, seq(20, 1000, 100))
t1 = proc.time()
tune.splsda.srbct <- tune.splsda(X, Y, ncomp = 3, validation = "Mfold", folds = 4,
progressBar = TRUE, dist = 'mahalanobis.dist', measure = "BER",
test.keepX = list.keepX, nrepeat = 30)
t2 = proc.time()
running_time = t2 - t1; running_time # running time
error <- tune.splsda.srbct$error.rate
ncomp <- tune.splsda.srbct$choice.ncomp$ncomp
select.keepX <- tune.splsda.srbct$choice.keepX[1:ncomp]
select.keepX <- tune.splsda.srbct$choice.keepX[1:ncomp] # optimal number of variables to select
select.keepX
plot(tune.splsda.srbct, col = color.jet(2))
# Run the sPLSDA with defined parameters
splsda.srbct <- splsda(X, Y, ncomp = ncomp, keepX = select.keepX)
plotIndiv(splsda.srbct, comp = c(1,2),
group = Y, ind.names = FALSE,
ellipse = TRUE, legend = TRUE,
title = 'sPLS-DA on SRBCT, comp 1 & 2')
auc.splsda = auroc(splsda.srbct, roc.comp = 6)
auc.splsda = auroc(splsda.srbct, roc.comp = ncomp)
set.seed(40) # for reproducibility, only when the `cpus' argument is not used
# takes about 1 min to run
perf.srbct <- perf(splsda.srbct, validation = "Mfold", folds = 2,
dist = 'mahalanobis.dist', measure = "BER",
progressBar = TRUE, ncomp = 6)
colors <- color.mixo(1:5)
pdf("~/Desktop/LISA/CIM.pdf")
cim(splsda.srbct, row.sideColors = colors[factor(Y)], margins = c(10,10),
legend=list(legend = unique(levels(Y)), col=colors), row.cex = .1, col.cex = .1, transpose = TRUE)
dev.off()
Comments (4)
-
-
- attached CIM_SPLSDA_WITH_TUNING.pdf
- attached CIM_SPLSDA_WITH_TUNING_PLUS_VARIABLES.pdf
These are the two CIM I am mentioning earlier in the text, and which are also hosted on my GitHub. I am having a hard time to embedded them in the text using URL.
-
Hi, sorry we’re getting back to you now as we had not realised we have not updated our issues page on Bioconductor (we’re now on GitHub and discussions are on Discourse).
This is an analysis discussion and not a software issue. May I please ask you to submit it to Discourse, if you continue to have the question? That way other users will also be able to benefit from it.
-
- changed status to resolved
To move to Discourse
- Log in to comment
I am sorry, I completely forgot to log in to bitbucket before sending the issue.
But I am the writer of the issue "Issue
#162".Thank you so much for your advice and help.
Best,
Pierre-François