print.pca and explained variance

Create issue
Issue #56 resolved
Florian Rohart created an issue

The explained variance is calculated in print.pca based on the ncomp components instead of all the component. It should be based on the variance of all the data and not just the first few PCA components

Comments (5)

  1. Florian Rohart reporter

    fixed in v6. Also fixed in plotIndiv as PCA with NIPALS was showing explained variance.

    Explained variance is now calculated for PCA and NIPALS via a new output x$var.tot which is used in plotIndiv and plot.pca now.

  2. Sven Krackow

    The problem with the explained variance seems to me not to be that it refers to the retained components only. That is actually done by all other PCA functions I used, including the summary function. To me, the bug is that SD is used to calculate the percentage, instead of variance (Eigenvalues).

  3. Florian Rohart reporter

    Dear Sven,

    That is two different problems in my opinion:

    1- the explained variance was wrong as being a proportion of the variance explained by the first ncomp PC instead of the variance explained by all the data. This has been fixed for the next release. It does not make much sense to say that PC1 explains 75% of the data when it's actually 75% of the first 2 PCs only.

    2- the explained variance was calculated from the object$sdev instead of object$sdev^2. This has been corrected as well, and is what you report here if I understood your point.

    For future reference, the following link shows the link between SVD of X (with eigenvalues s_i) and the variance-covariance matrix of X (with eigenvalues lambda_i = s_i^2 / (n-1) )

  4. Sven Krackow

    Dear Florian ad 2: Yes, thank you!

    ad 1: Intuitively, I also took this stance, at first. However, there might be the following reason that other functions do restrict sum of variance to the number of components kept. If you consider the component reduction as meaning that you take from the analysis the point that some variables are of no importance (nuisance variables, so to speak), then it makes no sense to consider their spurious contribution. Furthermore, if one wants to know the % referring to the complete model, one can easily define the number of components appropriately and get it. However, if you now use the sum of variances of all components for any ncomp defined, one has to go to calculate the % relative to significant contributors "by hand" and then have troubles to get those figures into the graphs.

    However, I am not a mathematician, so I do not know if some strict argument or rationale speaks for the one or other method, I am only arguing as a practitioner and R-scriptor (I have no means to follow the calculus you are referring to...). If both approaches are mathematically valid, only differing in inference, I would recommend not implementing that change. If it is not mathematically valid, one might think of contacting authors of other packages!?

    Thanks again for your concerns!

  5. Log in to comment