Extracting Binding Site Residues for Multiple PDB Files using Bio3D

Issue #382 resolved

Former user created an issue 2016-07-30

Greetings!

I am trying to generate a PCA for the binding site residues for several crystal structures and MD files. First, I stored all the paths to the pdbfiles in an object, and then applied the read.pdb command to read in each pdb file, which worked fine:

pdbfiles <-list.files("pathtopdbfiles", ".pdb$", full.names=TRUE) pdb <- lapply(pdbfiles, read.pdb)

Next, I tried using the lapply command to extract the binding site residues for each pdb file and received the following error:

bs <- lapply(pdb, binding.site) Error in FUN(X263L, ...) : insufficent 'ligand' atoms in structure

I know that all the structures have a bound ligand because I checked manually, so I'm not sure how to resolve this issue.

Any feedback would be greatly appreciated! Tavina Offutt UCSD, Amaro Lab

Comments (17)

Xinqiu Yao
Hi Tavina,

Does the error occur for every pdb file or just some of them? If you are not sure, try some tests like:
- Does it work for pdb #1? (binding.site(pdb[[1]])
- Which pdb files does it not work? (This can be detected by adding some output message within lapply, e.g. (bs <- lapply(1:length(pdb), function(i) {cat('Working on pdb #', i, '\n', sep=''); binding.site(pdb[[i]])}))
If you can find one example pdb that gives the error, check what ligand it contains (e.g. simply print the pdb object can show you some basic information about the structure).

Also, it would be helpful for diagnosis if you can tell what bio3d version, R build, and OS are you using.

Let me know what you get by above testing.
- 2016-07-30T02:06:22+00:00
Tavina Offutt
Hi Xin-Qiu,

Thanks for your response. The error message does occur for some pdb files, and not all of them. When I run the command, binding.site(pdb1)), it runs fine. When I run the command to see which pdb files it doesn't work for, I get the following:

ba <- lapply(1:length(pdb), function(i) {cat('Working on pdb #', i, '\n', sep=''); binding.site(pdbi)}) Working on pdb #1 Working on pdb #2..... . . . . Working on pdb #263 Error in binding.site(pdbi) : insufficent 'ligand' atoms in structure

PDB # 263 is PDB code 5d4j, and the ligand is: C21 H18 Cl F N6 O2. Also, if I load in this pdb separately, I can run the binding.site command perfectly fine. However, it doesn't work when its in the list of several pdbfiles for some reason:

h <- binding.site(pdb263) Error in binding.site(pdb263) : insufficent 'ligand' atoms in structure

I am using R version 3.1.1, Bio3D version 2.2-4, and my operating system is Linux hana 2.6.32-642.3.1.el6.x86_64.

Thanks, Tavina
- 2016-08-01T18:57:28+00:00

Xinqiu Yao

That's indeed weird. I guess it might be something wrong in reading the pdb file with 'lapply()'. Try following steps and let me know what output is for each of them:

#!
# 1. Simply print the pdb # 263. 
pdb[[263]]

#2. Print the file name of pdb #263 (Check if the output corresponds to a proper pdb file).
pdbfiles[263]

#3. If the step #2 gives a proper pdb file name, try following commands; Otherwise, stop here.
pdb <- read.pdb(pdbfiles[263])
pdb
bs <- binding.site(pdb)
bs

2016-08-01T19:49:20+00:00

Tavina Offutt
#1. When I print the pdb, I get all the file paths, and the pdb information at the end:

pdb263 .......... . . .

, "/scratch/tclaibor/Ensemble_docking_project/4GCJ/PCA_xtalstructures/raw_pdbs/5FP6.pdb", "/scratch/tclaibor/Ensemble_docking_project/4GCJ/PCA_xtalstructures/raw_pdbs/5IEV.pdb", "/scratch/tclaibor/Ensemble_docking_project/4GCJ/PCA_xtalstructures/raw_pdbs/5IEX.pdb", "/scratch/tclaibor/Ensemble_docking_project/4GCJ/PCA_xtalstructures/raw_pdbs/5IEY.pdb", "/scratch/tclaibor/Ensemble_docking_project/4GCJ/PCA_xtalstructures/raw_pdbs/5K4J.pdb" )263L)

Total Models#: 1 Total Atoms#: 2372, XYZs#: 7116 Chains#: 1 (values: A)
```
 Protein Atoms#: 2229  (residues/Calpha atoms#: 277)
 Nucleic acid Atoms#: 0  (residues/phosphate atoms#: 0)

 Non-protein/nucleic Atoms#: 143  (residues: 143)
 Non-protein/nucleic resid values: [ HOH (143) ]
```
Protein sequence: MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKISTAIREISLLKELNHPNIVKLLDVI HTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKP QNLLINTEGAIKLADFGLARAVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRA LFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQ...<cut>...HLRL
- attr: atom, helix, sheet, seqres, xyz, calpha, remark, call #2. When I print the filename of pdb 263, I get the following:
pdbfiles[263]

[1] "/scratch/tclaibor/Ensemble_docking_project/4GCJ/PCA_xtalstructures/raw_pdbs/5ANO.pdb"

#3. Results for the following commands:

x <- read.pdb(pdbfiles[263])

HEADER TRANSFERASE 07-SEP-15 5ANO

x

Call: read.pdb(file = pdbfiles[263])

Total Models#: 1 Total Atoms#: 2372, XYZs#: 7116 Chains#: 1 (values: A)
```
 Protein Atoms#: 2229  (residues/Calpha atoms#: 277)
 Nucleic acid Atoms#: 0  (residues/phosphate atoms#: 0)

 Non-protein/nucleic Atoms#: 143  (residues: 143)
 Non-protein/nucleic resid values: [ HOH (143) ]
```
Protein sequence: MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKISTAIREISLLKELNHPNIVKLLDVI HTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKP QNLLINTEGAIKLADFGLARAVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRA LFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQ...<cut>...HLRL
- attr: atom, helix, sheet, seqres, xyz, calpha, remark, call
b <- binding.site(x)

Error in binding.site(x) : insufficent 'ligand' atoms in structure
- 2016-08-01T20:06:54+00:00
Xinqiu Yao
Well, it is clear now. The problematic pdb (i.e. /scratch/tclaibor/Ensemble_docking_project/4GCJ/PCA_xtalstructures/raw_pdbs/5ANO.pdb) has only protein and water molecules (See the output of step #1, Non-protein/nucleic resid values: [ HOH (143) ] means 143 water molecules). Water will not be counted as "ligand" and so the error "insufficent 'ligand' atoms in structure".

Remove this file from the list and then you can proceed with other files (pdbfiles <- pdbfiles[-263]). If error still occurs, you may have more "APO" pdb files. Using similar steps to identify and remove them from the list.

Let me know if you have more questions!
- 2016-08-01T20:47:42+00:00
Xinqiu Yao
Are you sure you've checked 5ANO? This is the file corresponding to the 263th entry in the list. Make sure you remove this file not others. Good luck!
- 2016-08-01T21:05:03+00:00
Tavina Offutt
I just realized it is indeed 5ANO, which is an apo structure. Thanks!!
- 2016-08-01T21:11:10+00:00
Tavina Offutt
Hi Xin-Qiu,

I have an additional question about using the binding.site command. Ultimately, I want to generate a PCA plot of the active site residues only for a list of crystal structures and MD trajectory frames. I thought I should first identify the binding site residues, and then I could perform the alignments, and generate the covariance matrix, so forth. However, I am not able to perform alignments/fits on the binding site residues. Any thoughts/suggestions for how I should go about organizing the code to be able to do this?

Thanks!
- 2016-08-01T22:08:49+00:00
Xinqiu Yao
Hi Tavina,

I would suggest create an alignment (specifically an pdbs object) based on whole chains first. Then, following analysis including binding site identification, PCA, etc. can all start from this alignment. You may also save the created alignment object as a R binary data file ('.RData'), and then next time you can simply load the file instead of re-calculating everything from scratch.

Generate a 'pdbs' from raw pdb files.

You are encouraged to go over the Tutorial for PCA, starting from the section 'Constructing Experimental Structure Ensembles for a Protein Family'.

Basically, you need to split each raw pdb file into single-chain based files (if it contain multiple chains), with the function 'pdbsplit()' or 'get.pdb(split=TRUE)'. Then, read the chain files of your interest and do the alignment, using the function 'pdbaln()'.

A key step here is to know what chains are those you want. This information typically comes from the Blast searching results as indicated in the tutorial. However, if you somehow skipped the Blast but have the prior knowledge of chain IDs to analyze, or you know that all the chains are subjected to analysis, you can create the file list manually. Typical steps of making a 'pdbs' look like:
```
# If you have blast results
hits <- plot.blast(blast)
files <- get.pdb(hits, path="raw_pdbs", split = TRUE)

# Or, if you have a bunch of raw pdb files and want to include all chains of them for further analysis
files <- pdbsplit(list.files("pathtopdbfiles", ".pdb$", full.names=TRUE))

# Create pdbs and save it.
pdbs <- pdbaln(files)
save(pdbs, file='pdbs.RData')
```
It's better to have a visual inspection of the alignment before moving on for further analysis. For example, use seaview or any other your favorite alignment viewing program to check if there are apparent alignment errors due to e.g. large unsolved parts in some of the structures. (There should be a FASTA formated alignment file named 'aln.fa' generated for you to check. If there is no such a file, you can write one with write.fasta(pdbs, file='aln.fa').)

Map binding site residues to 'pdbs' and do focused PCA.

Using the generated 'pdbs', you can identify binding site residues for each structure, map them to the alignment, find consensus binding site residues, and do structural fitting and PCA based on these consensus residues. Following code snippet is adapted from my previous working script and may be a reference for you to write your own script.
```
load('pdbs.RData')

# find binding site for each structure.
bs <- lapply(pdbs$id, function(x) {
  pdb <- read.pdb(x)
  binding.site(pdb)
} )

# map binding site residues to alignment positions
bs.alipos <- lapply(1:length(bs), function(i) {
   resno <- bs[[i]]$resno
   which(pdbs$resno[i, ] %in% resno)
})

# find the consensus binding site
count <- table(unlist(bs.alipos))
bs.consensus <- as.numeric( names(count)[count == length(bs)] )
bs.consensus <- as.select(bs.consensus)

# fit based on consensus binding site
xyz <- fit.xyz(pdbs$xyz, pdbs, bs.consensus$xyz, bs.consensus$xyz)

# PCA
pc <- pca(xyz[, bs.consensus$xyz])
```
Note that the above codes are not tested and should be used as a reference only. Do feel free to let me know if you have any question!
- 2016-08-02T22:20:49+00:00
Tavina Offutt
#Thanks for the code. I tried it on a small test set of 3 crystal structures, and 3 MD frames, and it works fine. When I try it on 257 crystal structures and 2500 MD frames, the code breaks at the step where it finds a consensus binding site. If I understand the code correctly, it goes through each file/pdb and look to see which aligned binding site residues (bs.alipos) are found across all pdb files. When I print bs.consensus for the entire dataset, I get the following:

bs.consensus

numeric(0)

#I'm not sure why a consensus binding site is not found when all the pdb files is of the same protein, just with different bound inhibitors. Could the numbering of residues have something to do with this?
- 2016-08-04T20:50:08+00:00

Xinqiu Yao

It seems the inhibitors have distinct binding site... One way to check is to open all structures in VMD and see how ligands bind. Use following commands to generate fitted structures:

load('pdbs.RData')
xyz <- pdbfit(pdbs, outpath='fitlsq')

Then, in shell, type vmd -m fitlsq/*. Follow VMD main menu -> Extensions -> Tk Console. In the opened terminal copy&paste following codes:

user add key a {
        # Applies representations from the top molecule to all other molecules
        # based on save_state script by John Stone
  set srcmol [molinfo top]
  foreach mol [molinfo list] {
    if {$mol == $srcmol} continue
    #delete current representations
    set numreps [molinfo $mol get numreps]
    for {set i 0} {$i < $numreps} {incr i} {
      mol delrep 0 $mol
    }
  }
  for {set i 0} {$i < [molinfo $srcmol get numreps]} {incr i} {
    set rep [molinfo $srcmol get "{rep $i} {selection $i} {color $i} {material $i}"]
    lappend rep [mol showperiodic $srcmol $i]
    lappend rep [mol numperiodic $srcmol $i]
    lappend rep [mol showrep $srcmol $i]
    lappend rep [mol selupdate $i $srcmol]
    lappend rep [mol colupdate $i $srcmol]
    lappend rep [mol scaleminmax $srcmol $i]
    lappend rep [mol smoothrep $srcmol $i]
    lappend rep [mol drawframes $srcmol $i]
    foreach mol [molinfo list] {
      if {$mol == $srcmol} continue
      foreach {r s c m pbc numpbc on selupd colupd colminmax smooth framespec} $rep { break }
      eval "mol representation $r"
      eval "mol color $c"
      eval "mol selection {$s}"
      eval "mol material $m"
      eval "mol addrep $mol"
      if {[string length $pbc]} {
        eval "mol showperiodic $mol $i $pbc"
        eval "mol numperiodic $mol $i $numpbc"
      }
      eval "mol selupdate $i $mol $selupd"
      eval "mol colupdate $i $mol $colupd"
      eval "mol scaleminmax $mol $i $colminmax"
      eval "mol smoothrep $mol $i $smooth"
      eval "mol drawframes $mol $i {$framespec}"
      if { !$on } {
        eval "mol showrep $mol $i 0"
      }
    }
  }
}

In the 'Graphics -> Representations', input 'not protein and not water' and return. Choose 'Licorice' for 'Drawing Method'. Left click the Graphic Window and type 'a'. That would show ligand only for all structures. You may also 'Create Rep' for 'all' with 'NewCartoon' drawing (but without typing 'a' in the Graphic Window!).

When you have spotted any inhibitor that binds to different site from the 'major cluster', identify the structure ID of that inhibitor and exclude it from the dataset. Repeat the process until all structures have the same ligand binding location.

The residue numbering shouldn't be the problem, because we are using 'alignment position' here that is consistent across structures.

Hope it could help.

2016-08-05T14:30:15+00:00

Tavina Offutt
#Thanks for the vmd code. It worked fine, and I was able to identify the crystal structures with an alternate binding site, and eliminate them. When I ran the commands to generate the PCA for the crystal structures only, and the MD structures only, it worked fine. However, when I run them together, I am getting an error with the xyz fitting.

xyz <- fit.xyz(pdbs$xyz, pdbs, bs.consensus$xyz, bs.consensus$xyz) Error in fit.xyz(pdbs$xyz, pdbs, bs.consensus$xyz, bs.consensus$xyz) : NA elements selected for fitting (check indices)

#I didn't receive this error with the crystal structures dataset or the MD frames dataset. When looking at the documentation for fit.xyz, I thought I specified the indices as the atoms that are conserved across all the binding sites.
# Alternatively, I tried the fitting the same way as in demo for pca, which worked fine but I'm not sure if it is fitting on the binding site residues only as opposed to the entire protein.

xyz <- pdbfit(pdbs, bs.consensus)
- 2016-08-13T23:08:09+00:00
Xinqiu Yao
The 'pdbfit()' call has used the bs residues but not sure why it did not give errors as 'fit.xyz()' did. They should be consistent...

Anyway, I guess 'fit.xyz()' is correct and in the fitting some NAs were selected. Can you tell me how did you combine the crystal structure data with MD simulations? Also, is the starting structure of MD from one of the crystal structures?
- 2016-08-16T19:39:00+00:00
Tavina Offutt
I wrote out each frame of the MD trajectory as an individual pdb file. The pdb files of the MD frames and the crystal structure pdb files were all in the same directory path that I loaded into a list using list.file command. Also, the MD trajectory is from the same protein as in the crystal structures. However, the specific crystal structure used to generate the MD trajectory was eliminated as it had missing residues; I do plan to add back in the one that was prepared for MD simulations.

Thanks,

Mrs. Tavina L Offutt, M.S. Chemistry and Biochemistry PhD Candidate Rommie Amaro Lab tclaibor@ucsd.edu
- 2016-08-16T22:15:08+00:00
Xinqiu Yao
I guess it might be because the 'bs.consensus' was calculated based on crystal structures, while after including MD frames the alignment in the original 'pdbs' is changed to some extent. Therefore the 'bs.consensus' does not match the "new" pdbs.

Including the structure used for MD could solve the problem, but still make sure the alignment of crystal structures in the new 'pdbs' is the same as the original 'pdbs'.
- 2016-08-17T14:48:14+00:00
Tavina Offutt
The bs.consensus was calculated based on both the crystal and MD frames. After loading in both the crystal and MD frames pdb files, I performed the alignment on all of the structures so the consensus binding site is based on both the crystal structures and the MD frames, not the crystal structures only. I will try including the crystal structure used for MD and see if this helps.

Thanks!

Mrs. Tavina L Offutt, M.S. Chemistry and Biochemistry PhD Candidate Rommie Amaro Lab tclaibor@ucsd.edu
- 2016-08-17T15:50:06+00:00
Xinqiu Yao
- changed status to resolved
- 2016-09-21T22:33:51+00:00
Log in to comment

Assignee: Bio3D Project

Type: task

Priority: minor

Status: resolved

Component: Q&A

Version: –

Votes: 0

Watchers: 1

Comments (17)

Generate a 'pdbs' from raw pdb files.

Map binding site residues to 'pdbs' and do focused PCA.