Calculating the optimal number for k clustering
Hi There,
I would like to estimate the "Sum of Squares/Total Sum of Squares" in order to find the optimal value for the K cluster. Is there a way to do it in Bio3d?
Thanks, Subha
Comments (10)
-
-
Hi Yao,
I am looking to choose an optimal value of k for performing k clustering and found some papers that calculates the SSR/SST for different number of clusters and chooses the value of k based on an 'elbow criteria'. Is there a way to do that in bio3d.
Pl see, Fig.S2 of the article http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0145758
Thanks, Subha
-
Hi Subha,
To my knowledge there is no function in Bio3D that directly calculate the ratio. But I think it is not difficult to write a script to do it. See below for an example and you are encouraged to improve it by yourself:
guess_k <- function(x, k.max=20) { ratio <- rep(NA, k.max) for(i in 2:k.max) { grps <- kmeans(x, i) ratio[i] <- grps$betweenss/grps$totss } return(ratio) } # Suppose using the PC1-PC2 coordinates to do the clustering y <- guess_k(pc$z[, 1:2]) plot(y, type='l')
-
Hi Yao,
Thanks a lot for that info and sample code.
Regards, Subha
-
- changed status to resolved
-
Hi Yao,
Sorry to bother you with a question about this again. Why does the plot(and/or ratio values) changes every time I run the function?
Thanks, Subha
-
- changed status to open
-
Because k-means algorithm is intrinsically random. For the purpose of reproduction the hclust() may be better. It is not so difficulty to write a script for calculating the ratio with hclust(), as long as you understand the meaning of betweenss and totss (e.g. you can find the explanation in
?kmeans
). You are encouraged to write it by yourself, but feel free to let me know if you have any trouble to make it. -
Hi Yao,
Thanks for that info. I am not quite familiar with these methods. Will go through the methods and see if I can do one for hclust.
Best Regards, Subha
-
- changed status to resolved
- Log in to comment
Can you tell a bit more details about what you are going to do? What "Sum of Squares" do you want to calculate? What is "the optimal value for the K cluster"? Do you mean the number of clusters?