 edited description
GCCA/Diablo versus MBPLS
Hello,
Can you please confirm my understanding that there are 2 differences between GCCA/Diablo and MultiBlock PLS :

In GCCA/Diablo the sum of covariances which is maximized includes covariances between the Xk scores/components (besides covariances between the Xk scores/components and Y scores/components) whereas this is not the case with MBPLS

In GCCA there’s a design matrix to be inputted by the user whereas in MBPLS there are “superweights” or “block weights” which are parameters estimated by the algorithm in order to maximize the weighted (the weights being the previous “block weights”) sum of covariances between the Xk scores/components and Y scores/components
If this is correct, can you please explain what are the advantages of GCCA versus MBPLS and why you implemented GCCA rather than MBPLS in mixOmics ?
The reason I ask is because I think that an obvious drawback  for GCCA  of the differences above is that the Xk scores/components you get with GCCA should explain/predict Y less well than those you get with MBPLS. And by the way this is probably the reason why in the Diablo article the classification performance of Diablo is not as good as that of the ElasticNet, isn’t it ?
Thanks in advance, Arnaud
Comments (6)

reporter 
Hi Arnaud,
What do you mean by the MultiBlock PLS? Our block.pls function? or our wrapper.rgcca? or something not in mixOmics?

reporter Hi Florian,
I think it's indeed not in mixOmics, and by the way if I'm correct about the 2 above differences, the fact that the mixOmics functions are called "block.pls" is confusing because MBPLS is indeed significantly different ..
I'm referring to the MBPLS first introduced by Wold and then developed by Wangen and Kowalski (1998). For an application to multiomics data there is "Identifying multilayer gene regulatory modules from multidimensional genomic data" by Li et al. (2012). And a very good overview of different MBPLS algorithms is given in "Deflation in multiblock PLS" by Westerhuis and Smilde (2001).
Please let me know what you think ...
Thanks, Arnaud

Hello Arnaud, Sorry for the delay..
I'm not familiar with MBPLS, but it seems to me they are quite similar:
in MBPLS you maximise the sum of covariances between the Xk scores/components and Y scores/components, and these covariances are weighted by the "block weights" that are calculated internally
in GCCA/DIABLO, the design matrix allows you to do all the above plus to maximise covariances between the Xk scores/components as well if deemed necessary from a biological point of view. You can do all the above by setting the design matrix with nonzero entries only for Y, and even use some precomputed weights.
For instance, the following code maximises only covariances between each Xk (gene or lipid) and Y (and not between the Xk)
design = matrix(c(0,0,1,0,0,1,1,1,0), ncol = 3, nrow = 3, byrow = TRUE) rownames(design) = colnames(design) = c("gene", "lipid", "Y") design
You can also replace the 1s by something else if you want to weight your blocks.
The only difference I see is that in GCCA/DIABLO you would need precomputed weights (that you can get from a non sparse GCCA, output $weights from block.pls(da)) instead of being internally calculated (I'm not sure according to what) in MBPLS.

reporter Hello Florian,
Thanks for your answer. I have several comments, which can be divided in 3 topics : A) the difference between the block weights of Diablo and those of MBPLS; B) the difference between GCCA and MBPLS and respective advantages/drawbacks of the 2 methods and C) Some suggestions for mixOmics :
A) Difference between the block weights of Diablo and those of MBPLS
There are (at least) 3 big differences between the Diablo weights and the MBPLS weights :

The block weights in MBPLS are, conceptually, the slopes of the regressions of the Y scores on the Xk blocks' scores. Now this is very different from the Diablo's weights which are the absolute values of the correlations between the Xk scores and the Y scores. If we assume that there's only 1 component/score to simplify, this would be, for a linear regression model Y = a +bX, the difference between the slope "b" (<=> block weight of MBPLS) and the square root of the R2 (<=> block weight of Diablo).

The block weights in MBPLS are computed iteratively and together (jointly) with the variable weights (which are also computed iteratively) inside a loop which stops at the convergence of the Xk “super scores” (sum of each blocks’ scores).

The block weights of MBPLS are normalized to length 1 inside each loop instance, which isn’t the case for the Diablo weights.
(if you want details about some MBPLS algorithms you can look for example at appendix 1 of “Analysis of multiblock and hierarchical PCA and PLS models” by Westerhuis et al (2001))
B) Similarity/differences between GCCA and MBPLS and advantages/drawbacks of the 2 methods
First, it seems to me that the general/"usual" way to use GCCA is not to set in the design matrix all coefficients related to the covariances between the Xk to 0 and to set the coefficients relating the Xk to the Y to Diablo's block weights. And by the way both the Diablo article and the mixOmics manual state that the design matrix's coefficients are supposed to have values of 0 or 1 (not some weights between 0 and 1).
Second, even in the very specific case where you use GCCA in the way you suggest in your above message, you'd probably still get very different results because of the  big  differences between the block weights of Diablo and those of MBPLS described in A).
So in summary in both cases ("general" and "specific" uses of GCCA), GCCA and MBPLS seem quite different (or at least even if they superficially look similar, they probably generally give quite different results).
And finally, regarding advantages/drawbacks of both methods, I understand that in some cases it can be interesting to keep/impose a correlation structure between the omics scores (via the design matrix), which is an advantage of GCCA.
But I think that MBPLS explains/predicts the Y better than GCCA even when used in the way of your above message, because MBPLS iteratively and jointly tunes 2 sets of parameters (the variable weights and the block weights), whereas GCCA used in the way you suggest can only tune 1 set of parameters (the variable weights, once the block weights are computed externally once and for all), to maximize the weighted sum of covariances.
In fact you could test this fairly easily because some MBPLS algorithms gives prediction results strictly equivalent to those obtained by applying PLS to the concatenated Xk blocks (cf the Westerhuis and Smilde article). So you could apply the pls function in mixOmics to the datasets in the Diablo article (with concatenated Xk) and compare the classification performance to that of GCCA/Diablo (I might actually do it myself in the next weeks/months if I have the time).
C) Suggestions for mixOmics

In light of the differences between GCCA and MBPLS explained in A and B, I think you should change the name of the “block.pls” function in mixOmics to “gcca” or something like that, in order not to confuse the users.

If you haven’t read the the Westerhuis and Smilde article, there’s a variant of MBPLS (the 3rd algorithm in the paper: deflation of Y using super scores) which, although it gives the same prediction results as PLS applied to the concatenated Xk, allows to get “good” Xk scores associated to each block (whereas PLS applied to the concatenated Xk yields X scores which can include variables from different blocks/omics), which can be useful for interpretation purposes. So it would be great to add this variant of MBPLS to mixOmics, in order to make mixOmics a very comprehensive package (and then it would be legitimate to call the associated function “block.pls” :) …). And on top of this it would also be useful for the wider R community because I don’t think that there’s an R package implementing the “deflation of Y using super scores” variant of MBPLS, which is supposed to be the best one (I believe that the ade4 package implements the “deflation of X using super scores” variant of MBPLS, which is not as good for interpretation purposes).
Thanks in advance to the mixOmics team if it does it !
Regards, Arnaud


repo owner Hi Arnaud,
As discussed yesterday we have noted your request and will follow up in a few months. Thank you for your suggestions and input.
To clarify this issue: our block.plsda function is NOT the same implementation as the Multi Block PLS method. We start from the Generalised CCA algorithm, but with asymetric deflations and further improvements of GCCA to perform multiple data sets integration and feature selection.
 Log in to comment