pdbsplit - lower and upper case

Issue #89 resolved
Lars Skjærven created an issue

some PDBs are using lower and upper case chain identifiers - see e.g. PDB id 4aol. here both chains a and A exist. with the current code of pdbsplit both are returned, even though I only seek chain a (not chain A):

path=tempdir()
ids = c("4aol_a")
rawfiles = get.pdb(ids, path=path)
files = pdbsplit(rawfiles, ids = ids, path=path)

According to http://www.wwpdb.org/procedure.html : Lower case letters should be used only after all upper letters and numbers have been used.. we should therefore probably remove all function calls to toupper in pdbsplit to make pdbsplit() case sensitive. agree?

PDB id 3R1C is another example of this:

files = pdbsplit(get.pdb("3R1C", path=path))

Blast seems to deal with this by using chain identifier AA instead of a:

seq <- get.seq("P12612")
blast <- blast.pdb(seq)
hits <- plot.blast(blast, cutoff = 100)

ids = hits$pdb.id[grep("4AOL", hits$pdb.id)]
> ids
 [1] "4AOL_A"  "4AOL_AA" "4AOL_H"  "4AOL_HH" "4AOL_B"  "4AOL_BB" "4AOL_E" 
 [8] "4AOL_EE" "4AOL_G"  "4AOL_GG" "4AOL_D"  "4AOL_DD" "4AOL_Z"  "4AOL_ZZ"
[15] "4AOL_Q"  "4AOL_QQ"

so that the call to get.pdb and pdbsplit will not catch the chains with lower case identifiers in the PDB file:

files = pdbsplit(rawfiles, ids = ids, path=path)

Warning message:
In pdbsplit(rawfiles, ids = ids, path = path) :
  unmatched ids: 4AOL_AA, 4AOL_HH, 4AOL_BB, 4AOL_EE, 4AOL_GG, 4AOL_DD, 4AOL_ZZ, 4AOL_QQ

hmmm...

Comments (6)

  1. Barry Grant

    My favorite - lovely PDB format inconsistencies ;-)

    I just had a quick look at pdbsplit() and was shocked by how much it has changed and how complicated it now appears. It is obviously much more flexible and useful too.

    I am fine with removing the toupper() calls but guess that it might require a little more familiarity with the function than I have currently - as toupper() is called on both the 'pdbId_chainId' components. If you have the stomach for it please go ahead and edit.

    Regarding the NCBI PDB BLAST database it might be better to catch the possible double character chain IDs and convert them to lowercase single character entries within the new get.blast() function itself (along with a warning and flag to disable conversion).

    Personally I don't like the double character convention and we should complain to them about this.

    Either way this will likely come back to bite someone - so having it documented here is a good start - if only the search on bitbucket worked a little better.

  2. Lars Skjærven reporter

    jepps. I made it a bit messy, but it works.. somehow. I'll definitively add some tests for this function so we have better control.

    do you want to add the conversion to get.blast? I'll update the pdbsplit.

  3. Barry Grant

    Ok commit 0e2259d on master should have blast.pdb() working the way we discussed with a new 'chain.single' argument to go back to the old way. See below.

    Note that I have not checked the RCSB PDB files themselves to see if these lowercase chains all really exist.

    > library(bio3d)
    > aa <- get.seq("P12612")
    > blast <- blast.pdb(aa)
    > head(blast$pdb.id)
    [1] "3P9D_A" "3P9D_I" "3P9E_a" "3P9E_i" "4D8Q_A" "4D8Q_I"
    
    > url <- blast$url
    > i <- get.blast(url)
    > head(i$pdb.id)
    [1] "3P9D_A" "3P9D_I" "3P9E_a" "3P9E_i" "4D8Q_A" "4D8Q_I"
    
    > i2 <- get.blast(url, chain.single=FALSE)
    > head(i2$pdb.id)
    [1] "3P9D_A"  "3P9D_I"  "3P9E_AA" "3P9E_II" "4D8Q_A"  "4D8Q_I"
    
  4. Barry Grant

    Marking this as resolved for now

    > path=tempdir()
    > ids = c("4aol_a")
    > rawfiles = get.pdb(ids, path=path)
    trying URL 'http://www.rcsb.org/pdb/files/4aol.pdb'
    > files = pdbsplit(rawfiles, ids = ids, path=path)
    |==================================| 100%
    > ids
    [1] "4aol_a"
    > file.exists(files)
    [1] TRUE
    > basename(files)
    [1] "4aol_a.1.pdb"
    

    The blast issue has been solved as noted above. The "_a.1.pdb" may still be an issue, in which case we can re-open this issue.

  5. Log in to comment