Calculating the optimal number for k clustering

Issue #378 resolved
Former user created an issue

Hi There,

I would like to estimate the "Sum of Squares/Total Sum of Squares" in order to find the optimal value for the K cluster. Is there a way to do it in Bio3d?

Thanks, Subha

Comments (10)

  1. Xinqiu Yao

    Can you tell a bit more details about what you are going to do? What "Sum of Squares" do you want to calculate? What is "the optimal value for the K cluster"? Do you mean the number of clusters?

  2. Xinqiu Yao

    Hi Subha,

    To my knowledge there is no function in Bio3D that directly calculate the ratio. But I think it is not difficult to write a script to do it. See below for an example and you are encouraged to improve it by yourself:

    guess_k <- function(x, k.max=20) {
      ratio <- rep(NA, k.max)
      for(i in 2:k.max) {
         grps <- kmeans(x, i)
         ratio[i] <- grps$betweenss/grps$totss
      }
      return(ratio)
    }
    
    # Suppose using the PC1-PC2 coordinates to do the clustering
    y <- guess_k(pc$z[, 1:2])
    plot(y, type='l')
    
  3. skal24

    Hi Yao,

    Sorry to bother you with a question about this again. Why does the plot(and/or ratio values) changes every time I run the function?

    Thanks, Subha

  4. Xinqiu Yao

    Because k-means algorithm is intrinsically random. For the purpose of reproduction the hclust() may be better. It is not so difficulty to write a script for calculating the ratio with hclust(), as long as you understand the meaning of betweenss and totss (e.g. you can find the explanation in ?kmeans). You are encouraged to write it by yourself, but feel free to let me know if you have any trouble to make it.

  5. skal24

    Hi Yao,

    Thanks for that info. I am not quite familiar with these methods. Will go through the methods and see if I can do one for hclust.

    Best Regards, Subha

  6. Log in to comment