Matrix of genotypes is singular

Issue #105 resolved
Nikolay Oskolkov created an issue

Dear developers of mixOmics,

I am a big fan of your package and use it a lot in my research. Recently I have been trying to process a matrix of genotypes (coded as 0, 1, 2) trough the plsda function and got this error:

"Error in checkForRemoteErrors(val) : one node produced an error: system is computationally singular: reciprocal condition number = 1.692e-16"

I did a harsh LD pruning of the matrix and removed highly correlated genotypes, but it did not help. I do not know what else I should do to fix the problem and would really appreciate your help. Thanks!

Best wishes, Nikolay


Nikolay Oskolkov, PhD Bioinformatician, SciLifeLab Bioinformatics Long-term Support (WABI) www.scilifelab.se/facilities/wabi/

Biology Department, Lund University Sölvegatan 35 , 22362 Lund

Phone: 0761463349 E-mail: nikolay.oskolkov@scilifelab.se


Comments (6)

  1. Kim-Anh Le Cao repo owner

    Hello Nikolay,

    thank you for your feedback. Genotype data are still difficult for us to deal with. Could you try filtering your data first using nearZeroVar()? I assume X = genotypes, what is your Y?

    Regards, Kim-Anh -- Please update my new email address: kimanh.lecao@unimelb.edu.aukimanh.lecao@unimelb.edu.au Dr. Kim-Anh Lê Cao Senior Lecturer, Statistical Genomics NHMRC Career Development Fellow

    School of Mathematics and Statistics Centre for Systems Genomics Bld 184 The University of Melbourne | VIC 3010 T: +61 (0)3834 43971

    mixOmics: http://mixomics.org/

  2. Nikolay Oskolkov reporter

    Thank you very much Kim-Ahn for your fast reply! Y is a categorical (factor) variable: "sick" vs. "healthy" individual. I tried to filter X with nearZeroVar() and it did not solve the problem. Here are the command lines I used:

    my_nearZeroVar<-nearZeroVar(X,uniqueCut=40) X[,rownames(my_nearZeroVar$Metrics)]<-NULL

    my_folds=5 my_nrepeat=10 my_progressBar=TRUE my_cpus=4

    my_plsda<-plsda(X,Y,ncomp=20) my_perf.plsda<-perf(my_plsda,validation='Mfold',folds=my_folds,progressBar=my_progressBar,nrepeat=my_nrepeat,auc=FALSE,cpus=my_cpus)

    On the 18-th component it throws the error:

    "Error in checkForRemoteErrors(val) : one node produced an error: system is computationally singular: reciprocal condition number = 8.75659e-17"

  3. Kim-Anh Le Cao repo owner

    Dear Nikolay,

    What happens is that the data become too sparse after the deflations (i.e. once the residual matrices are deflated, at each component step). Usually with our models we only retain a few handful of components, not more. In the genotype data world I know this is not the norm but for for a supervised analysis with PLSDA you (obviously) will have to cut the # of components, to 18! You will be able to visualise the error rate (plot function on your perf object) to see whether the error rate decreases or reaches a plateau.

    Regards, Kim-Anh -- Please update my new email address: kimanh.lecao@unimelb.edu.aukimanh.lecao@unimelb.edu.au Dr. Kim-Anh Lê Cao Senior Lecturer, Statistical Genomics NHMRC Career Development Fellow

    School of Mathematics and Statistics Centre for Systems Genomics Bld 184 The University of Melbourne | VIC 3010 T: +61 (0)3834 43971

    mixOmics: http://mixomics.org/

  4. Nikolay Oskolkov reporter

    Thanks a lot Kim-Anh for the clarification! I understand better now what is going on regarding the deflated matrix of genotypes. Yes, I will cut the number of components. Thanks a lot again for your fantastic package!

    Best, Nikolay

  5. Nikolay Oskolkov reporter

    Sorry, Kim-Anh, just one last question. After I have specified ncomp=17 in the "plsda" function and then ran "perf" with the command lines I posted previously, I got very strange looking BER plot please see attached BER_genotypes.png

    How should I interpret this, why is BER constant? Btw the scree plot looked like this scree_plot_genotypes.png. Would really appreciate your opinion, thanks!

    Best, Nikolay

  6. Log in to comment