Error in reading netcdf file + Memory issues with fit.xyz()
Hello, I have two questions: 1) I am trying to read in a large trajectory file (50000 frames) generated by Amber12. The original format of the file is mdcrd, but since there is no function in bio3d to read this format, I convert it to netcdf format using ptraj. But whenever I try to read this file using read.ncdf function, I get the following:
traj <- read.ncdf("trj.netcdf", time=TRUE, verbose=TRUE)
[1] "Reading file trj.netcdf"
[1] "Produced by program: cpptraj"
[1] "File conventions AMBER version 1.0"
[1] "Frames: 50001"
[1] "Atoms: 4542"
Error: cannot allocate vector of size 5.1 Gb
Unfortunately, there is no multicore option in read.ncdf() (or is there?) so I cannot seem to read big netcdf files. I have tried to reboot R and even used at.sel option for only backbone atoms. Is there any workaround for this?
2) Due to the first issue, as discussed in the documentation I converted my trajectory file into the dcd file using VMD, which works fine. But now when I want to use fit.xyz() in order to fit my trajectory to a pdb structure (only backbone atoms), it takes forever and the system hangs, even with maximum number of cores (ncore=8). Ultimately I get the error:
Error in fork() : Unable to fork R
Are there optimal values for ncore and nseg.scale which would make this work? Or would I be better off running it on a cluster with many more number of processors.
System/R Session info:
> sessionInfo(package="bio3d")
R version 3.0.2 (2013-09-25)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
character(0)
other attached packages:
[1] bio3d_2.0-1
loaded via a namespace (and not attached):
[1] base_3.0.2 BH_1.51.0-4 bigmemory.sri_0.1.2
[4] datasets_3.0.2 graphics_3.0.2 grDevices_3.0.2
[7] methods_3.0.2 ncdf_1.6.6 parallel_3.0.2
[10] stats_3.0.2 tools_3.0.2 utils_3.0.2
[13] XML_3.98-1.1
I have already spent over 2 days trying to fix these issues, so any help would be much appreciated!
Thank you
Best regards
Rohit
Comments (7)
-
-
Hi Rohit,
To your first question:
read.ncdf doesn't support multicore (neither ncore nor nseg.scale works). As Lars suggested, you may need to reduce file size before reading it in R, or with at.sel option. It didn't work with at.sel at your first attempt possibly because it basically reads whole data and then applies atom selection. So, one way is to split the single big file into several smaller files and then read all of them once with the command,
trj <- read.ncdf(c(files_names), at.sel=sele)
The second question:
It usually uses more RAM with ncore > 1 and so possibly caused the system hangs in your case. Have you tried with ncore=1? Alternatively, you can set nseg.scale>1 to split data beforehand, e.g. ncore=8 and nseg.scale=20.
Btw, I don't think it would help to run R on a cluster, because the multicore in Bio3D doesn't support parallelization with CPU cores on different nodes.
Let me know if you still have problems. Good luck!
-
reporter Dear Lars and Xin-Qiu,
Thank you both for your very useful comments.
Lars:
Your first suggestion:
pdb <- read.pdb("11_build_TIP4P/prot_sol.amb.pdb") sele <- atom.select(pdb, "calpha") trj <- read.ncdf("14_production/01_classA/hei.nc", at.sel=sele) dim(trj) [1] 15000 387
This still doesn't work :/ Same error. What are your system specs? That might be the difference.
Your second suggestion:
trj = read.ncdf("14_production/01_classA/hei.nc", first=1, last=10) dim(trj) [1] 10 92232
This seems like an interesting proposal, but could you please tell me as to how to go about adding up all the frames into one single trajectory matrix, after all of them have been read in in multiple parts? R is not really known for its loops, so that might not be the best idea. We tried something similar with making lists of the trajectory matrix and working with chunks in the list for various functions which will also output lists. But from the viewpoint of structural biology, that might not be correct especially when calculating average structure etc. Also, the resulting RMSD plots also seemed a little off. Perhaps I can take a look at this again. Do let me know what you think about this?
Xin-Qiu
This solution does seem rather elegant, if only you could please tell me how to input multiple files in the command you provided (the syntax). Say I have 5 files: traj1.netcdf, traj2.netcdf...traj5.netcdf
My apologies if my questions sound naive, but I am not exactly a pro in R. Its only Bio3D that has compelled me to work with R and quite frankly it has been a good learning experience so far!
Thanks again to both of you for your help.
Regards Rohit
-
Hi Rohit,
The command is very simple: Suppose your PDB file used for MD simulation (having exactly the same number of atoms to that in the trajectory file) is "top.pdb"
pdb <- read.pdb("top.pdb") sele <- atom.select(pdb, "calpha") # choose CA atoms only # sele <- atom.select(pdb, "backbone") # Or choose backbone atoms files <- paste("traj", 1:5, ".netcdf", sep="") trj <- read.ncdf(files, at.sel=sele)
Note that check the location of your files to make sure they are under the same directory where you start R. Otherwise, you need add path to both PDB and trajectory files, e.g. files <- file.path(your_path, files)
-
you could do something wicked like:
library(bio3d) # read in PDB (same atom numbering as netcdf file) pdb <- read.pdb("11_build_TIP4P/prot_sol.amb.pdb") sele <- atom.select(pdb, "calpha") # allocate vector and matrix for rmsd and fitted CA crds all.rmsd <- rep(NA, 1500) ca.trj <- matrix(NA, nrow=1500, ncol=length(sele$xyz)) step <- 10; for ( i in 1:150 ) { first <- ((i-1)*step)+1; last <- first+step-1; trj = read.ncdf("14_production/01_classA/hei.nc", first=first, last=last) all.rmsd[first:last] = rmsd(pdb$xyz[sele$xyz], trj[,sele$xyz], fit=TRUE) ca.trj[first:last, ] = fit.xyz(pdb$xyz[sele$xyz], trj[,sele$xyz]) } rd <- rmsd(pdb$xyz[sele$xyz], ca.trj) trj2 <- read.ncdf("14_production/01_classA/hei.nc", at.sel=sele, first=1, last=1500) rd2 <- rmsd(pdb$xyz[sele$xyz], trj2, fit=TRUE) identical(rd, rd2) [1] TRUE
with a low "step" you'll avoid some heavy memory usage.
-
- changed status to resolved
-
reporter Dear Lars and Xin-Quo, My sincere apologies for responding with much delay, I had been busy with other projects. The last 'wicked' solution provided by Lars works perfectly! Thank you very much for your help.
- Log in to comment
Hi Rohit, These big trajectory files can certainly be a problem since R will attempt to read the entire trajectory into memory. I would therefore consider to reduce the size of the input trajectory (e.g. you could maybe divide it into several parts, or do
in ptraj?).
Alternatively, as you suggested, use the at.sel argument. see examples below:
setting nseg.scale > 1 does not help?
See also arguments first and last:
Hope it helps. L