GCCA/Diablo versus MB-PLS

Issue #84 new
Arnaud created an issue

Hello,

Can you please confirm my understanding that there are 2 differences between GCCA/Diablo and Multi-Block PLS :

  1. In GCCA/Diablo the sum of covariances which is maximized includes covariances between the Xk scores/components (besides covariances between the Xk scores/components and Y scores/components) whereas this is not the case with MB-PLS

  2. In GCCA there’s a design matrix to be inputted by the user whereas in MB-PLS there are “super-weights” or “block weights” which are parameters estimated by the algorithm in order to maximize the weighted (the weights being the previous “block weights”) sum of covariances between the Xk scores/components and Y scores/components

If this is correct, can you please explain what are the advantages of GCCA versus MB-PLS and why you implemented GCCA rather than MB-PLS in mixOmics ?

The reason I ask is because I think that an obvious drawback - for GCCA - of the differences above is that the Xk scores/components you get with GCCA should explain/predict Y less well than those you get with MB-PLS. And by the way this is probably the reason why in the Diablo article the classification performance of Diablo is not as good as that of the ElasticNet, isn’t it ?

Thanks in advance, Arnaud

Comments (6)

  1. Florian Rohart

    Hi Arnaud,

    What do you mean by the Multi-Block PLS? Our block.pls function? or our wrapper.rgcca? or something not in mixOmics?

  2. Arnaud reporter

    Hi Florian,

    I think it's indeed not in mixOmics, and by the way if I'm correct about the 2 above differences, the fact that the mixOmics functions are called "block.pls" is confusing because MB-PLS is indeed significantly different ..

    I'm referring to the MB-PLS first introduced by Wold and then developed by Wangen and Kowalski (1998). For an application to multi-omics data there is "Identifying multi-layer gene regulatory modules from multi-dimensional genomic data" by Li et al. (2012). And a very good overview of different MB-PLS algorithms is given in "Deflation in multiblock PLS" by Westerhuis and Smilde (2001).

    Please let me know what you think ...

    Thanks, Arnaud

  3. Florian Rohart

    Hello Arnaud, Sorry for the delay..

    I'm not familiar with MBPLS, but it seems to me they are quite similar:

    -in MBPLS you maximise the sum of covariances between the Xk scores/components and Y scores/components, and these covariances are weighted by the "block weights" that are calculated internally

    -in GCCA/DIABLO, the design matrix allows you to do all the above plus to maximise covariances between the Xk scores/components as well if deemed necessary from a biological point of view. You can do all the above by setting the design matrix with non-zero entries only for Y, and even use some pre-computed weights.

    For instance, the following code maximises only covariances between each Xk (gene or lipid) and Y (and not between the Xk)

    design = matrix(c(0,0,1,0,0,1,1,1,0), ncol = 3, nrow = 3, byrow = TRUE)
    rownames(design) = colnames(design) = c("gene", "lipid", "Y")
    design
    

    You can also replace the 1s by something else if you want to weight your blocks.

    The only difference I see is that in GCCA/DIABLO you would need pre-computed weights (that you can get from a non sparse GCCA, output $weights from block.pls(da)) instead of being internally calculated (I'm not sure according to what) in MBPLS.

  4. Arnaud reporter

    Hello Florian,

    Thanks for your answer. I have several comments, which can be divided in 3 topics : A) the difference between the block weights of Diablo and those of MB-PLS; B) the difference between GCCA and MB-PLS and respective advantages/drawbacks of the 2 methods and C) Some suggestions for mixOmics :

    A) Difference between the block weights of Diablo and those of MB-PLS

    There are (at least) 3 big differences between the Diablo weights and the MB-PLS weights :

    1. The block weights in MB-PLS are, conceptually, the slopes of the regressions of the Y scores on the Xk blocks' scores. Now this is very different from the Diablo's weights which are the absolute values of the correlations between the Xk scores and the Y scores. If we assume that there's only 1 component/score to simplify, this would be, for a linear regression model Y = a +bX, the difference between the slope "b" (<=> block weight of MB-PLS) and the square root of the R2 (<=> block weight of Diablo).

    2. The block weights in MB-PLS are computed iteratively and together (jointly) with the variable weights (which are also computed iteratively) inside a loop which stops at the convergence of the Xk “super scores” (sum of each blocks’ scores).

    3. The block weights of MB-PLS are normalized to length 1 inside each loop instance, which isn’t the case for the Diablo weights.

    (if you want details about some MB-PLS algorithms you can look for example at appendix 1 of “Analysis of multiblock and hierarchical PCA and PLS models” by Westerhuis et al (2001))

    B) Similarity/differences between GCCA and MB-PLS and advantages/drawbacks of the 2 methods

    First, it seems to me that the general/"usual" way to use GCCA is not to set in the design matrix all coefficients related to the covariances between the Xk to 0 and to set the coefficients relating the Xk to the Y to Diablo's block weights. And by the way both the Diablo article and the mixOmics manual state that the design matrix's coefficients are supposed to have values of 0 or 1 (not some weights between 0 and 1).

    Second, even in the very specific case where you use GCCA in the way you suggest in your above message, you'd probably still get very different results because of the - big - differences between the block weights of Diablo and those of MB-PLS described in A).

    So in summary in both cases ("general" and "specific" uses of GCCA), GCCA and MB-PLS seem quite different (or at least even if they superficially look similar, they probably generally give quite different results).

    And finally, regarding advantages/drawbacks of both methods, I understand that in some cases it can be interesting to keep/impose a correlation structure between the omics scores (via the design matrix), which is an advantage of GCCA.

    But I think that MB-PLS explains/predicts the Y better than GCCA even when used in the way of your above message, because MB-PLS iteratively and jointly tunes 2 sets of parameters (the variable weights and the block weights), whereas GCCA used in the way you suggest can only tune 1 set of parameters (the variable weights, once the block weights are computed externally once and for all), to maximize the weighted sum of covariances.

    In fact you could test this fairly easily because some MB-PLS algorithms gives prediction results strictly equivalent to those obtained by applying PLS to the concatenated Xk blocks (cf the Westerhuis and Smilde article). So you could apply the pls function in mixOmics to the datasets in the Diablo article (with concatenated Xk) and compare the classification performance to that of GCCA/Diablo (I might actually do it myself in the next weeks/months if I have the time).

    C) Suggestions for mixOmics

    1. In light of the differences between GCCA and MB-PLS explained in A and B, I think you should change the name of the “block.pls” function in mixOmics to “gcca” or something like that, in order not to confuse the users.

    2. If you haven’t read the the Westerhuis and Smilde article, there’s a variant of MB-PLS (the 3rd algorithm in the paper: deflation of Y using super scores) which, although it gives the same prediction results as PLS applied to the concatenated Xk, allows to get “good” Xk scores associated to each block (whereas PLS applied to the concatenated Xk yields X scores which can include variables from different blocks/omics), which can be useful for interpretation purposes. So it would be great to add this variant of MB-PLS to mixOmics, in order to make mixOmics a very comprehensive package (and then it would be legitimate to call the associated function “block.pls” :-) …). And on top of this it would also be useful for the wider R community because I don’t think that there’s an R package implementing the “deflation of Y using super scores” variant of MB-PLS, which is supposed to be the best one (I believe that the ade4 package implements the “deflation of X using super scores” variant of MB-PLS, which is not as good for interpretation purposes).

    Thanks in advance to the mixOmics team if it does it !

    Regards, Arnaud

  5. Kim-Anh Le Cao repo owner

    Hi Arnaud,

    As discussed yesterday we have noted your request and will follow up in a few months. Thank you for your suggestions and input.

    To clarify this issue: our block.plsda function is NOT the same implementation as the Multi Block PLS method. We start from the Generalised CCA algorithm, but with asymetric deflations and further improvements of GCCA to perform multiple data sets integration and feature selection.

  6. Log in to comment