Suggestions for Mfold, relevance network, explained variance, and general help.

Issue #97 resolved
Former user created an issue

Hello, I got few small suggestions for the nice mixOmic package:

First, in the perf function, I would suggest to include the nrepeat in the examples of Mfold cross validation to remind users that it is important when using the Mfold, especially if sample number is low. Althought it is stipulated elsewhere, it does not appear in examples.

In the relevance network function, network, I noticed that the text size can be choosen using the following parameter; cex.node.name = 0.8; which influence automatically the size of the text box (circle or rectangle). The box feels pretty big, an option to modify the size the text box could be really nice. Also, the possibiltiy to set the lwd.edge depending on the intensity of the similarity (high correlation would result in larger connections) could be nice to vizualize the network. I will try Cytoscape to see possibilities.

About network function, I would have to like more information about the calculation of similarity performed in the function, without having to refer to the original papers. Since we input the PLS or SPLS model in it, I thought the similarity would have been somehow determined from the model, but in truth, the similarity value is not calculated from the PLS model, if I understood well? It is just the variable are selected from the model and threshold, right?

About Explained variance of PLS and SPLS model, which I check using PLSMODEL$explained_variance. It is written in the help that the explained variance may not decrease as in PCA. In fact I noticed the total is not 100%, and that some components gain in explained variance, which is very counter intuititve. A bit more explanation would be useful. Why this behaviour? How to report/use the explained variance then?

More information about the differences of the "classic" or "regression" mode of PLS and SPLS, without having to check the references, could be useful to users and guide them in their choice. As an example, I tried to compare the "modes" of PLS from the explained variance I obtain. The explained variance for the first component were the same for the canonical and regression mode for both X and Y blocks, but much higher for Y in the classic model. Thus I expect that the classic model works better for my dataset as the goal is to predict Y from a small number of component? I thanks the author to provide many references, however it is a bit confusing where to check to see differences about the models, and also some paywall may block the access to the journal, unfortunately.

Thank you very much for your time, Best, Arno Germond

Comments (2)

  1. Kim-Anh Le Cao repo owner

    Hi Arno,

    We are currently adding a few details the help file for:

    • perf: Repeats of the CV-folds. Repeated cross-validation implies that the whole CV process is repeated a number of times (\code{nrepeat}) to reduce variability across the different subset partitions. In the case of Leave-One-Out CV (\code{validation = 'loo''}), each sample is left out once (\code{folds = N} is set internally) and therefore nrepeat is by default 1.

    • PLS/sPLS about the explained variance and the modes. In your case the regression mode is probably best (we are thinking of removing the classic mode as it seems redundant to the regression one). For the explained variance: I specified: explained variance: amount of variance explained per component (note that contrary to PCA, this amount may not decrease as the aim of the method is not to maximise the variance, but the covariance between data sets)

    • networks description: Display relevance associations network for (regularized) canonical correlation analysis and (sparse) PLS regression. The function avoids the intensive computation of Pearson correlation matrices on large data set by calculating instead a pair-wise similarity matrix directly obtained from the latent components of our integrative approaches (CCA, PLS, block.pls methods). The similarity value between a pair of variables is obtained by calculating the sum of the correlations between the original variables and each of the latent components of the model. The values in the similarity matrix can be seen as a robust approximation of the Pearson correlation (see Gonzalez et al. 2012 for a mathematical demonstration and exact formula). The advantage of relevance networks is their ability to simultaneously represent positive and negative correlations, which are missed by methods based on Euclidian distances or mutual information. Those networks are bipartite and thus only a link between two variables of different types can be represented. The network can be saved in a .glm format using the \code{igraph} package, the function \code{write.graph} and extracting the output \code{object$gR}, see details. We prefer the users use cytoscape to fine tune their plots.

    While we are trying as much as possible to be thorough in the help files, we cannot give all possible details of the different methods. We thank you for your suggestions.

  2. Log in to comment