Clustering PC-space using hclust

Issue #305 resolved
Karan Kapoor created an issue

Hello,

I am trying to cluster the individual conformers in the PC-space generated using the pca.xyz command. But as the trajectory is large, ~100,000 frames, I am getting memory error when using hclust (Error: cannot allocate vector of size 37.6 Gb).

Is there a way to reduce the number of conformers that the clustering will be carried out on? For example add a step =10 in hclust?

Thanks, Karan

Comments (3)

  1. Lars Skjærven

    Hi Karan, The easiest is probably to reduce the size of your trajectory prior to calling pca.xyz(). You can alternatively filter out structures from the projection of structures to the PCs (which I assume your clustering is based):

    # read every 100 frame 
    trj <- read.ncdf("mytrj.nc", stride=100)
    
    # or index a subset of the full trajectory
    trj2 <- trj[seq(1, 100000, by = 100), ]
    
    # or use the trim function
    trj3 <- trim(trj, row.inds = seq(1, 100000, by = 100))
    
    # trim the pca object
    pc <- pca.xyz(trj)
    
    # trim the pc object
    z <- pc$z[seq(1, 100000, by=100), 1:10]
    
    # clustering
    hc <- hclust(dist(z))
    
  2. Karan Kapoor reporter

    PCA_all.pngPCA_reduced.pngcluster_all.pngcluster_reduced.png

    Lars,

    Thanks for the quick reply. I was able to do the analysis using the suggested changes.

    I am not very experienced with PCA and not sure if I am losing a lot of information when reducing the trajectory size. I have attached the two plots I get before and after reducing the trajectory (every 10 steps). There seems to be similar groupings of conformations along the PC1 in both the plots but the jumps between these groups are less smooth in the second plot (white shows interversion between different groups, correct?).

    But if I cluster this PC space (attached), the separation between the different clusters is much more clear when I use the reduced trajectory.

    Do you have any suggestion which plots I should use to interpret the data?

    Thanks, Karan

  3. Barry Grant

    This all depends on your purpose Karan. It looks like both the full and reduced set analysis are giving you similar distributions and clusters in PC space so you could start further analysis with the more tractable reduced set and then verify any conclusions by reference back to the full set.

  4. Log in to comment