Large pdb files

Issue #127 resolved
Maria Kalimeri created an issue

Hi,

I am sorry if this has been mentioned before. I did look for it and didn't find anything relevant.

I have two issues to report.

The first is that the function read.pdb is unable to read pdb files that contain more than 99999 atoms. Other softwares such as VMD, after the first 99999 atom-positions have been filled, they start enumerating with hexadecimal numbers, eg 186a0 for 100000. From what I saw in the source of read.pdb, in the eleno field it expects a numeric value so it crashes with the error scan() expected 'a real', got '186a0'

If you ever decide to cure that, mind that also the resno field would need to be a character instead of numeric.

At the moment, I cannot estimate whether this modification would impact other functions that use read.pdb or whether this was implemented this way on purpose but since it is giving me a hard time right now I thought I should mention.

Another perhaps more important issue is that even if one modifies the fields eleno and resno to character instead of numeric, the calculation

atom <- read.table(text = sapply(raw.atom, split.fields), stringsAsFactors = FALSE, sep = ",", quote = "", colClasses = atom.format[!drop.ind, "what"], col.names = atom.format[!drop.ind, "name"], comment.char = "", na.strings = "")

takes practically forever to finish when the pdb has more than 250000 or 300000 atoms (my case is 700000 atoms). I think it is this sapply(raw.atom, split.fields) but I don't know why it is so slow.

Best

Comments (12)

  1. Lars Skjærven

    Hi Maria, Thanks for noting this and contributing interesting suggestions. We are aware of some of these limitations - in particular the speed related issue. Barry can probably answer more in-depth on this.

    The PDB files you are working with are self-made right (i.e. they are not in the PDB)? In that case I would suggest stripping off solvent in an external program prior to reading the structure into R. (you've probably already thought of that though).

    Another option - if you're using Amber - you can explore the 'feature_amber' branch of bio3d. This branch contains functionality to read and parse amber parameter/topology and coordinate files. e.g. you can do:

    > prmtop <- read.prmtop("1SX4_sol.prmtop")
    > prmtop
    
     Call:
      read.prmtop(file = "1SX4_sol.prmtop")
    
     Class:
      amber, prmtop
    
     System information: 
          Total number of atoms:  715845
          Solute residues:  8043 (of 206422)
          Solute molecules:  49 (of 198428)
          Box dimensions:  109.47 x 215.25 x 215.25 x 215.25
    

    Reading this prmtop amber file containing 715.845 atoms takes 2.8 minutes on my computer.

  2. Maria Kalimeri reporter

    Thanx Lars for the prompt reply.

    Yes, the file is a self-made pdb. In fact the whole of it is some sort of solvent for which I am trying to calculate orientational properties and such. So I need to include them all. For the time being I did a quick-n-dirty equivalent of the read.pdb file (nowhere as neat).

    As a side note, if I had a large file (more than 99999 atoms) including water but the analyzable part of my pdb/dcd - say protein - was within the first 99999, if I am not terribly mistaken there is no particular need for striping the waters off. The read.pdb can do it fine as long as the flag -maxlines is set. It will however give a warning that not all pdb was read.

    Best Maria

  3. Barry Grant

    These are all good points Maria, thanks! The read.pdb() section in question dates back to when 60,000 atoms was a large system. It is clear now that we should indeed alter things for the next version to overcome these limitations. Hopefully the read.prmtop() that Lars mentioned can help. If not let us know and we will come up with an intermediate solution but it sounds like you might have one - what approach did you take?

  4. Maria Kalimeri reporter

    It is true. Myself, one year ago, I would have not guessed that I'd soon need to include in my analysis atoms after position 99,999. Needs grow fast apparently :-). Anyhow, for the moment I used "readlines" to read the whole file (since this doesn't take much time), then used grep to keep only the lines with ATOM field and finally build a kind-of "pdb" with substring(raw.atom, colStart, colEnd).

    However I am not accounting for the "if" cases and the other subtleties that the "read.pdb" function does. Probably structural modelers will never encounter my problem. I am coming from molecular dynamics and as I mentioned previously the pdb I use is self-made. So, hopefully this issue is not that urgent.

    Maria

  5. Barry Grant

    Thanks Maria. We will add this to the To Do list but keep this issue open so we can keep track of it. I may change this thread from type: "bug" to type: "proposal" at a later point once we have a fix on the Master branch.

  6. Lars Skjærven

    A quicker read.pdb version is now provided with function read.pdb2 in branch feature_cpp. See commit 5dae18 (and updates thereafter).

    Test on 219.000 atoms amber generated PDB file:

    > system.time(read.pdb("amb.pdb"))
       user  system elapsed 
    135.978   0.031 136.731 
    
    > system.time(read.pdb2("amb.pdb"))
       user  system elapsed 
      2.034   0.018   2.057 
    

    pdb.read2 also parses hexadecimal element numbers provided by VMD (use argument hex=TRUE). I'm not completely sure this functionality is needed though.

  7. Barry Grant

    This looks great, thanks Lars! I am up for having these types of function feature heavily in the next major version and like all you are doing on the feature_cpp branch. btw: is there much windows related installation pain associated with these?

  8. Lars Skjærven

    Thanks. I'm not very familiar with this, but I guess we provide pre-compiled binaries for windows (and possibly for mac?) and that Rcpp is also available for windows.

  9. Log in to comment