Suboptimal results obtained with sPLSDA

Dear mixOmics team,

First of all I would like to thank you so much for hard working to develop such an amazing tool. And also to congratulate Anh Le for getting The Moran Medal in Statistical Sciences from the Australian Academy of Science.

With the intern I am supervising, we are trying to define a universal transcriptomic signature for human cellular senescence. To achieve this, we have a compendium of 100 transcriptomes corresponding to timecourses where senescence have been induced with different triggers (DNA damage, oncogene induction, oxidative stress ...) as well as a quiescence control timecourse.

We use these data as the X matrix along a Y matrix which we parametrize so that we are matching senescence phases across inducers (CONTROL, EARLY, INTERMEDIATE, LATE senescence, and finally QUIESCENCE) as the input for an sPLS-DA.

After fine-tuning, we identify the optimal number of components to keep as well as the per-component number of variables to select. Using these parameter we finally run the sPLS-DA to get our signature.

Below is what we get when looking at the CIM.

CIM for sPLS-DA with parameters defined with tuning

As you can see, the classification is not optimal. And we noticed that by "artificially" increasing - even slightly - the number of variables to keep on each of the component, we managed to get better results.

CIM for sPLS-DA with parameters defined with tuning but with additional variables on each components

Making this comparison, we "know" our data has a better potential than the one we reached with the parameters given by the fine tuning. So we are probably making something wrong here.

The questions that we have are :

1) Could you please give us some ideas on how to explain that there is still valuable biological information hidden in our data that is not considered during the tuning of the sPLS-DA ?

2) Is there something that we are likely to make wrong ? Especially while seting up the validation parameters for the tuning phases ?

3) Despite reading the documentation for the CIM function, we fail to understand how the matrix used as an input relate to the original data. Would it be possible to have some additional explanations ?

Below is the code we used.

# Prepare data
X <- t(data_final)
Y <- pdata$Phases
Y <- droplevels(Y)


#---> PCA
pca.srbct = pca(X, ncomp = 10, center =TRUE, scale = TRUE)
plot(pca.srbct)
plotIndiv(pca.srbct, group = Y, ind.names = TRUE, 
          legend = TRUE, title = 'PCA on SRBCT')


#---> PLSDA
srbct.plsda <- plsda(X, Y, ncomp = 10)  # set ncomp to 10 for performance assessment later
plotIndiv(srbct.plsda , comp = c(1,2),
          group = Y, ind.names = TRUE, 
          ellipse = TRUE, legend = TRUE, title = 'PLSDA on SRBCT')

# with background
background = background.predict(srbct.plsda, comp.predicted=2, dist = "mahalanobis.dist") 


plotIndiv(srbct.plsda, comp = 1:2,
          group = Y, ind.names = TRUE, title = "Maximum distance",
          legend = TRUE,  background = background)


#---> sPLSDA

# Define the number of component

set.seed(2543) # for reproducibility, only when the `cpus' argument is not used
perf.plsda.srbct <- perf(srbct.plsda, validation = "Mfold", folds = 4, 
                         progressBar = TRUE, auc = TRUE, nrepeat = 30) 

plot(perf.plsda.srbct, col = color.mixo(5:7), sd = TRUE, legend.position = "horizontal")
auc.plsda = auroc(srbct.plsda, roc.comp = 6)

# Define the number of variables per components

list.keepX <- c(1:10,  seq(20, 1000, 100))

t1 = proc.time()  
tune.splsda.srbct <- tune.splsda(X, Y, ncomp = 3, validation = "Mfold", folds = 4,
                                 progressBar = TRUE, dist = 'mahalanobis.dist', measure = "BER",
                                 test.keepX = list.keepX, nrepeat = 30)
t2 = proc.time()
running_time = t2 - t1; running_time # running time

error <- tune.splsda.srbct$error.rate
ncomp <- tune.splsda.srbct$choice.ncomp$ncomp
select.keepX <- tune.splsda.srbct$choice.keepX[1:ncomp]

select.keepX <- tune.splsda.srbct$choice.keepX[1:ncomp]  # optimal number of variables to select
select.keepX

plot(tune.splsda.srbct, col = color.jet(2))

# Run the sPLSDA with defined parameters

splsda.srbct <- splsda(X, Y, ncomp = ncomp, keepX = select.keepX) 
plotIndiv(splsda.srbct, comp = c(1,2),
          group = Y, ind.names = FALSE, 
          ellipse = TRUE, legend = TRUE,
          title = 'sPLS-DA on SRBCT, comp 1 & 2')
auc.splsda = auroc(splsda.srbct, roc.comp = 6)
auc.splsda = auroc(splsda.srbct, roc.comp = ncomp)

set.seed(40) # for reproducibility, only when the `cpus' argument is not used
# takes about 1 min to run
perf.srbct <- perf(splsda.srbct, validation = "Mfold", folds = 2,
                   dist = 'mahalanobis.dist', measure = "BER",
                   progressBar = TRUE, ncomp = 6) 


colors <-  color.mixo(1:5)

pdf("~/Desktop/LISA/CIM.pdf")
cim(splsda.srbct, row.sideColors = colors[factor(Y)], margins = c(10,10),
    legend=list(legend = unique(levels(Y)), col=colors), row.cex = .1, col.cex = .1, transpose = TRUE)
dev.off()

Comments (4)