Suboptimal results obtained with sPLSDA

Issue #162 resolved
Former user created an issue

Dear mixOmics team,

First of all I would like to thank you so much for hard working to develop such an amazing tool. And also to congratulate Anh Le for getting The Moran Medal in Statistical Sciences from the Australian Academy of Science.

With the intern I am supervising, we are trying to define a universal transcriptomic signature for human cellular senescence. To achieve this, we have a compendium of 100 transcriptomes corresponding to timecourses where senescence have been induced with different triggers (DNA damage, oncogene induction, oxidative stress ...) as well as a quiescence control timecourse.

We use these data as the X matrix along a Y matrix which we parametrize so that we are matching senescence phases across inducers (CONTROL, EARLY, INTERMEDIATE, LATE senescence, and finally QUIESCENCE) as the input for an sPLS-DA.

After fine-tuning, we identify the optimal number of components to keep as well as the per-component number of variables to select. Using these parameter we finally run the sPLS-DA to get our signature.

Below is what we get when looking at the CIM.

CIM for sPLS-DA with parameters defined with tuning

As you can see, the classification is not optimal. And we noticed that by "artificially" increasing - even slightly - the number of variables to keep on each of the component, we managed to get better results.

CIM for sPLS-DA with parameters defined with tuning but with additional variables on each components

Making this comparison, we "know" our data has a better potential than the one we reached with the parameters given by the fine tuning. So we are probably making something wrong here.

The questions that we have are :

1) Could you please give us some ideas on how to explain that there is still valuable biological information hidden in our data that is not considered during the tuning of the sPLS-DA ?

2) Is there something that we are likely to make wrong ? Especially while seting up the validation parameters for the tuning phases ?

3) Despite reading the documentation for the CIM function, we fail to understand how the matrix used as an input relate to the original data. Would it be possible to have some additional explanations ?

Below is the code we used.

# Prepare data
X <- t(data_final)
Y <- pdata$Phases
Y <- droplevels(Y)


#---> PCA
pca.srbct = pca(X, ncomp = 10, center =TRUE, scale = TRUE)
plot(pca.srbct)
plotIndiv(pca.srbct, group = Y, ind.names = TRUE, 
          legend = TRUE, title = 'PCA on SRBCT')


#---> PLSDA
srbct.plsda <- plsda(X, Y, ncomp = 10)  # set ncomp to 10 for performance assessment later
plotIndiv(srbct.plsda , comp = c(1,2),
          group = Y, ind.names = TRUE, 
          ellipse = TRUE, legend = TRUE, title = 'PLSDA on SRBCT')

# with background
background = background.predict(srbct.plsda, comp.predicted=2, dist = "mahalanobis.dist") 


plotIndiv(srbct.plsda, comp = 1:2,
          group = Y, ind.names = TRUE, title = "Maximum distance",
          legend = TRUE,  background = background)


#---> sPLSDA

# Define the number of component

set.seed(2543) # for reproducibility, only when the `cpus' argument is not used
perf.plsda.srbct <- perf(srbct.plsda, validation = "Mfold", folds = 4, 
                         progressBar = TRUE, auc = TRUE, nrepeat = 30) 

plot(perf.plsda.srbct, col = color.mixo(5:7), sd = TRUE, legend.position = "horizontal")
auc.plsda = auroc(srbct.plsda, roc.comp = 6)

# Define the number of variables per components

list.keepX <- c(1:10,  seq(20, 1000, 100))

t1 = proc.time()  
tune.splsda.srbct <- tune.splsda(X, Y, ncomp = 3, validation = "Mfold", folds = 4,
                                 progressBar = TRUE, dist = 'mahalanobis.dist', measure = "BER",
                                 test.keepX = list.keepX, nrepeat = 30)
t2 = proc.time()
running_time = t2 - t1; running_time # running time

error <- tune.splsda.srbct$error.rate
ncomp <- tune.splsda.srbct$choice.ncomp$ncomp
select.keepX <- tune.splsda.srbct$choice.keepX[1:ncomp]

select.keepX <- tune.splsda.srbct$choice.keepX[1:ncomp]  # optimal number of variables to select
select.keepX

plot(tune.splsda.srbct, col = color.jet(2))

# Run the sPLSDA with defined parameters

splsda.srbct <- splsda(X, Y, ncomp = ncomp, keepX = select.keepX) 
plotIndiv(splsda.srbct, comp = c(1,2),
          group = Y, ind.names = FALSE, 
          ellipse = TRUE, legend = TRUE,
          title = 'sPLS-DA on SRBCT, comp 1 & 2')
auc.splsda = auroc(splsda.srbct, roc.comp = 6)
auc.splsda = auroc(splsda.srbct, roc.comp = ncomp)

set.seed(40) # for reproducibility, only when the `cpus' argument is not used
# takes about 1 min to run
perf.srbct <- perf(splsda.srbct, validation = "Mfold", folds = 2,
                   dist = 'mahalanobis.dist', measure = "BER",
                   progressBar = TRUE, ncomp = 6) 


colors <-  color.mixo(1:5)

pdf("~/Desktop/LISA/CIM.pdf")
cim(splsda.srbct, row.sideColors = colors[factor(Y)], margins = c(10,10),
    legend=list(legend = unique(levels(Y)), col=colors), row.cex = .1, col.cex = .1, transpose = TRUE)
dev.off()

Comments (4)

  1. Pierre-Francois Roux

    I am sorry, I completely forgot to log in to bitbucket before sending the issue.

    But I am the writer of the issue "Issue #162".

    Thank you so much for your advice and help.

    Best,

    Pierre-François

  2. Al J Abadi

    Hi, sorry we’re getting back to you now as we had not realised we have not updated our issues page on Bioconductor (we’re now on GitHub and discussions are on Discourse).

    This is an analysis discussion and not a software issue. May I please ask you to submit it to Discourse, if you continue to have the question? That way other users will also be able to benefit from it.

  3. Log in to comment