k number in geostas

Issue #214 resolved
leleonp created an issue

Which could be a good parameter to choose the k number in geostas function? In other words, How can I choose k? Thanks

Luis

Comments (4)

  1. Lars Skjærven

    Hi Luis, I don't think there is a trivial answer to this question, so I would recommend you to consult other sources for choosing the correct number of clusters. See e.g. http://www.statmethods.net/advstats/cluster.html, but also the original geostas paper for more information.

    On the more technical side of it, you can follow the recipe from the above link to plot the within groups sum of squares to get a feeling on which could be a good k:

    pdb <- read.pdb("1d1d", multi=TRUE)
    gs  <- geostas(pdb)
    
    wss <- rep(NA, 10)
    for (i in 2:10) {
      km <- kmeans(1-gs$amsm, centers=i)
      wss[i] <- sum(km$withinss)
    }
    
    plot(1:10, wss, type="b", xlab="Number of Clusters",
          ylab="Within groups sum of squares") 
    

    Note that bio3d implementation of geostas uses kmeans clustering by default, but you can change to hierarchical clustering by using argument clustalg="hclust". Also note (as described in the code above) you can cluster the AMSM yourself by kmeans(1-gs$amsm) or hclust(1-gs$amsm).

    Hope this helps. Lars

  2. Log in to comment