Extracting Binding Site Residues for Multiple PDB Files using Bio3D
Greetings!
I am trying to generate a PCA for the binding site residues for several crystal structures and MD files. First, I stored all the paths to the pdbfiles in an object, and then applied the read.pdb command to read in each pdb file, which worked fine:
pdbfiles <-list.files("pathtopdbfiles", ".pdb$", full.names=TRUE) pdb <- lapply(pdbfiles, read.pdb)
Next, I tried using the lapply command to extract the binding site residues for each pdb file and received the following error:
bs <- lapply(pdb, binding.site) Error in FUN(X263L, ...) : insufficent 'ligand' atoms in structure
I know that all the structures have a bound ligand because I checked manually, so I'm not sure how to resolve this issue.
Any feedback would be greatly appreciated! Tavina Offutt UCSD, Amaro Lab
Comments (17)
-
-
Hi Xin-Qiu,
Thanks for your response. The error message does occur for some pdb files, and not all of them. When I run the command, binding.site(pdb1)), it runs fine. When I run the command to see which pdb files it doesn't work for, I get the following:
ba <- lapply(1:length(pdb), function(i) {cat('Working on pdb #', i, '\n', sep=''); binding.site(pdbi)}) Working on pdb
#1Working on pdb#2..... . . . . Working on pdb #263 Error in binding.site(pdbi) : insufficent 'ligand' atoms in structurePDB # 263 is PDB code 5d4j, and the ligand is: C21 H18 Cl F N6 O2. Also, if I load in this pdb separately, I can run the binding.site command perfectly fine. However, it doesn't work when its in the list of several pdbfiles for some reason:
h <- binding.site(pdb263) Error in binding.site(pdb263) : insufficent 'ligand' atoms in structure
I am using R version 3.1.1, Bio3D version 2.2-4, and my operating system is Linux hana 2.6.32-642.3.1.el6.x86_64.
Thanks, Tavina
-
That's indeed weird. I guess it might be something wrong in reading the pdb file with 'lapply()'. Try following steps and let me know what output is for each of them:
#! # 1. Simply print the pdb # 263. pdb[[263]] #2. Print the file name of pdb #263 (Check if the output corresponds to a proper pdb file). pdbfiles[263] #3. If the step #2 gives a proper pdb file name, try following commands; Otherwise, stop here. pdb <- read.pdb(pdbfiles[263]) pdb bs <- binding.site(pdb) bs
-
#1. When I print the pdb, I get all the file paths, and the pdb information at the end:pdb263 .......... . . .
, "/scratch/tclaibor/Ensemble_docking_project/4GCJ/PCA_xtalstructures/raw_pdbs/5FP6.pdb", "/scratch/tclaibor/Ensemble_docking_project/4GCJ/PCA_xtalstructures/raw_pdbs/5IEV.pdb", "/scratch/tclaibor/Ensemble_docking_project/4GCJ/PCA_xtalstructures/raw_pdbs/5IEX.pdb", "/scratch/tclaibor/Ensemble_docking_project/4GCJ/PCA_xtalstructures/raw_pdbs/5IEY.pdb", "/scratch/tclaibor/Ensemble_docking_project/4GCJ/PCA_xtalstructures/raw_pdbs/5K4J.pdb" )263L)
Total Models#: 1 Total Atoms#: 2372, XYZs#: 7116 Chains#: 1 (values: A)
Protein Atoms#: 2229 (residues/Calpha atoms#: 277) Nucleic acid Atoms#: 0 (residues/phosphate atoms#: 0) Non-protein/nucleic Atoms#: 143 (residues: 143) Non-protein/nucleic resid values: [ HOH (143) ]
Protein sequence: MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKISTAIREISLLKELNHPNIVKLLDVI HTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKP QNLLINTEGAIKLADFGLARAVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRA LFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQ...<cut>...HLRL
- attr: atom, helix, sheet, seqres, xyz,
calpha, remark, call
#2. When I print the filename of pdb 263, I get the following:
pdbfiles[263]
[1] "/scratch/tclaibor/Ensemble_docking_project/4GCJ/PCA_xtalstructures/raw_pdbs/5ANO.pdb"
#3. Results for the following commands:x <- read.pdb(pdbfiles[263])
HEADER TRANSFERASE 07-SEP-15 5ANO
x
Call: read.pdb(file = pdbfiles[263])
Total Models#: 1 Total Atoms#: 2372, XYZs#: 7116 Chains#: 1 (values: A)
Protein Atoms#: 2229 (residues/Calpha atoms#: 277) Nucleic acid Atoms#: 0 (residues/phosphate atoms#: 0) Non-protein/nucleic Atoms#: 143 (residues: 143) Non-protein/nucleic resid values: [ HOH (143) ]
Protein sequence: MENFQKVEKIGEGTYGVVYKARNKLTGEVVALKKISTAIREISLLKELNHPNIVKLLDVI HTENKLYLVFEFLHQDLKKFMDASALTGIPLPLIKSYLFQLLQGLAFCHSHRVLHRDLKP QNLLINTEGAIKLADFGLARAVVTLWYRAPEILLGCKYYSTAVDIWSLGCIFAEMVTRRA LFPGDSEIDQLFRIFRTLGTPDEVVWPGVTSMPDYKPSFPKWARQ...<cut>...HLRL
- attr: atom, helix, sheet, seqres, xyz, calpha, remark, call
b <- binding.site(x)
Error in binding.site(x) : insufficent 'ligand' atoms in structure
- attr: atom, helix, sheet, seqres, xyz,
calpha, remark, call
-
Well, it is clear now. The problematic pdb (i.e. /scratch/tclaibor/Ensemble_docking_project/4GCJ/PCA_xtalstructures/raw_pdbs/5ANO.pdb) has only protein and water molecules (See the output of step
#1, Non-protein/nucleic resid values: [ HOH (143) ] means 143 water molecules). Water will not be counted as "ligand" and so the error "insufficent 'ligand' atoms in structure".Remove this file from the list and then you can proceed with other files (
pdbfiles <- pdbfiles[-263]
). If error still occurs, you may have more "APO" pdb files. Using similar steps to identify and remove them from the list.Let me know if you have more questions!
-
Are you sure you've checked 5ANO? This is the file corresponding to the 263th entry in the list. Make sure you remove this file not others. Good luck!
-
I just realized it is indeed 5ANO, which is an apo structure. Thanks!!
-
Hi Xin-Qiu,
I have an additional question about using the binding.site command. Ultimately, I want to generate a PCA plot of the active site residues only for a list of crystal structures and MD trajectory frames. I thought I should first identify the binding site residues, and then I could perform the alignments, and generate the covariance matrix, so forth. However, I am not able to perform alignments/fits on the binding site residues. Any thoughts/suggestions for how I should go about organizing the code to be able to do this?
Thanks!
-
Hi Tavina,
I would suggest create an alignment (specifically an pdbs object) based on whole chains first. Then, following analysis including binding site identification, PCA, etc. can all start from this alignment. You may also save the created alignment object as a R binary data file ('.RData'), and then next time you can simply load the file instead of re-calculating everything from scratch.
Generate a 'pdbs' from raw pdb files.
You are encouraged to go over the Tutorial for PCA, starting from the section 'Constructing Experimental Structure Ensembles for a Protein Family'.
Basically, you need to split each raw pdb file into single-chain based files (if it contain multiple chains), with the function 'pdbsplit()' or 'get.pdb(split=TRUE)'. Then, read the chain files of your interest and do the alignment, using the function 'pdbaln()'.
A key step here is to know what chains are those you want. This information typically comes from the Blast searching results as indicated in the tutorial. However, if you somehow skipped the Blast but have the prior knowledge of chain IDs to analyze, or you know that all the chains are subjected to analysis, you can create the file list manually. Typical steps of making a 'pdbs' look like:
# If you have blast results hits <- plot.blast(blast) files <- get.pdb(hits, path="raw_pdbs", split = TRUE) # Or, if you have a bunch of raw pdb files and want to include all chains of them for further analysis files <- pdbsplit(list.files("pathtopdbfiles", ".pdb$", full.names=TRUE)) # Create pdbs and save it. pdbs <- pdbaln(files) save(pdbs, file='pdbs.RData')
It's better to have a visual inspection of the alignment before moving on for further analysis. For example, use
seaview
or any other your favorite alignment viewing program to check if there are apparent alignment errors due to e.g. large unsolved parts in some of the structures. (There should be a FASTA formated alignment file named 'aln.fa' generated for you to check. If there is no such a file, you can write one withwrite.fasta(pdbs, file='aln.fa')
.)Map binding site residues to 'pdbs' and do focused PCA.
Using the generated 'pdbs', you can identify binding site residues for each structure, map them to the alignment, find consensus binding site residues, and do structural fitting and PCA based on these consensus residues. Following code snippet is adapted from my previous working script and may be a reference for you to write your own script.
load('pdbs.RData') # find binding site for each structure. bs <- lapply(pdbs$id, function(x) { pdb <- read.pdb(x) binding.site(pdb) } ) # map binding site residues to alignment positions bs.alipos <- lapply(1:length(bs), function(i) { resno <- bs[[i]]$resno which(pdbs$resno[i, ] %in% resno) }) # find the consensus binding site count <- table(unlist(bs.alipos)) bs.consensus <- as.numeric( names(count)[count == length(bs)] ) bs.consensus <- as.select(bs.consensus) # fit based on consensus binding site xyz <- fit.xyz(pdbs$xyz, pdbs, bs.consensus$xyz, bs.consensus$xyz) # PCA pc <- pca(xyz[, bs.consensus$xyz])
Note that the above codes are not tested and should be used as a reference only. Do feel free to let me know if you have any question!
-
#Thanks for the code. I tried it on a small test set of 3 crystal structures, and 3 MD frames, and it works fine. When I try it on 257 crystal structures and 2500 MD frames, the code breaks at the step where it finds a consensus binding site. If I understand the code correctly, it goes through each file/pdb and look to see which aligned binding site residues (bs.alipos) are found across all pdb files. When I print bs.consensus for the entire dataset, I get the following:
bs.consensus
numeric(0)
#I'm not sure why a consensus binding site is not found when all the pdb files is of the same protein, just with different bound inhibitors. Could the numbering of residues have something to do with this?
-
It seems the inhibitors have distinct binding site... One way to check is to open all structures in VMD and see how ligands bind. Use following commands to generate fitted structures:
load('pdbs.RData') xyz <- pdbfit(pdbs, outpath='fitlsq')
Then, in shell, type
vmd -m fitlsq/*
. Follow VMD main menu -> Extensions -> Tk Console. In the opened terminal copy&paste following codes:user add key a { # Applies representations from the top molecule to all other molecules # based on save_state script by John Stone set srcmol [molinfo top] foreach mol [molinfo list] { if {$mol == $srcmol} continue #delete current representations set numreps [molinfo $mol get numreps] for {set i 0} {$i < $numreps} {incr i} { mol delrep 0 $mol } } for {set i 0} {$i < [molinfo $srcmol get numreps]} {incr i} { set rep [molinfo $srcmol get "{rep $i} {selection $i} {color $i} {material $i}"] lappend rep [mol showperiodic $srcmol $i] lappend rep [mol numperiodic $srcmol $i] lappend rep [mol showrep $srcmol $i] lappend rep [mol selupdate $i $srcmol] lappend rep [mol colupdate $i $srcmol] lappend rep [mol scaleminmax $srcmol $i] lappend rep [mol smoothrep $srcmol $i] lappend rep [mol drawframes $srcmol $i] foreach mol [molinfo list] { if {$mol == $srcmol} continue foreach {r s c m pbc numpbc on selupd colupd colminmax smooth framespec} $rep { break } eval "mol representation $r" eval "mol color $c" eval "mol selection {$s}" eval "mol material $m" eval "mol addrep $mol" if {[string length $pbc]} { eval "mol showperiodic $mol $i $pbc" eval "mol numperiodic $mol $i $numpbc" } eval "mol selupdate $i $mol $selupd" eval "mol colupdate $i $mol $colupd" eval "mol scaleminmax $mol $i $colminmax" eval "mol smoothrep $mol $i $smooth" eval "mol drawframes $mol $i {$framespec}" if { !$on } { eval "mol showrep $mol $i 0" } } } }
In the 'Graphics -> Representations', input 'not protein and not water' and return. Choose 'Licorice' for 'Drawing Method'. Left click the Graphic Window and type 'a'. That would show ligand only for all structures. You may also 'Create Rep' for 'all' with 'NewCartoon' drawing (but without typing 'a' in the Graphic Window!).
When you have spotted any inhibitor that binds to different site from the 'major cluster', identify the structure ID of that inhibitor and exclude it from the dataset. Repeat the process until all structures have the same ligand binding location.
The residue numbering shouldn't be the problem, because we are using 'alignment position' here that is consistent across structures.
Hope it could help.
-
#Thanks for the vmd code. It worked fine, and I was able to identify the crystal structures with an alternate binding site, and eliminate them. When I ran the commands to generate the PCA for the crystal structures only, and the MD structures only, it worked fine. However, when I run them together, I am getting an error with the xyz fitting.
xyz <- fit.xyz(pdbs$xyz, pdbs, bs.consensus$xyz, bs.consensus$xyz) Error in fit.xyz(pdbs$xyz, pdbs, bs.consensus$xyz, bs.consensus$xyz) : NA elements selected for fitting (check indices)
#I didn't receive this error with the crystal structures dataset or the MD frames dataset. When looking at the documentation for fit.xyz, I thought I specified the indices as the atoms that are conserved across all the binding sites.
# Alternatively, I tried the fitting the same way as in demo for pca, which worked fine but I'm not sure if it is fitting on the binding site residues only as opposed to the entire protein.xyz <- pdbfit(pdbs, bs.consensus)
-
The 'pdbfit()' call has used the bs residues but not sure why it did not give errors as 'fit.xyz()' did. They should be consistent...
Anyway, I guess 'fit.xyz()' is correct and in the fitting some NAs were selected. Can you tell me how did you combine the crystal structure data with MD simulations? Also, is the starting structure of MD from one of the crystal structures?
-
I wrote out each frame of the MD trajectory as an individual pdb file. The pdb files of the MD frames and the crystal structure pdb files were all in the same directory path that I loaded into a list using list.file command. Also, the MD trajectory is from the same protein as in the crystal structures. However, the specific crystal structure used to generate the MD trajectory was eliminated as it had missing residues; I do plan to add back in the one that was prepared for MD simulations.
Thanks,
Mrs. Tavina L Offutt, M.S. Chemistry and Biochemistry PhD Candidate Rommie Amaro Lab tclaibor@ucsd.edu
-
I guess it might be because the 'bs.consensus' was calculated based on crystal structures, while after including MD frames the alignment in the original 'pdbs' is changed to some extent. Therefore the 'bs.consensus' does not match the "new" pdbs.
Including the structure used for MD could solve the problem, but still make sure the alignment of crystal structures in the new 'pdbs' is the same as the original 'pdbs'.
-
The bs.consensus was calculated based on both the crystal and MD frames. After loading in both the crystal and MD frames pdb files, I performed the alignment on all of the structures so the consensus binding site is based on both the crystal structures and the MD frames, not the crystal structures only. I will try including the crystal structure used for MD and see if this helps.
Thanks!
Mrs. Tavina L Offutt, M.S. Chemistry and Biochemistry PhD Candidate Rommie Amaro Lab tclaibor@ucsd.edu
-
- changed status to resolved
- Log in to comment
Hi Tavina,
Does the error occur for every pdb file or just some of them? If you are not sure, try some tests like:
#1? (binding.site(pdb[[1]]
)bs <- lapply(1:length(pdb), function(i) {cat('Working on pdb #', i, '\n', sep=''); binding.site(pdb[[i]])})
)If you can find one example pdb that gives the error, check what ligand it contains (e.g. simply
print
the pdb object can show you some basic information about the structure).Also, it would be helpful for diagnosis if you can tell what bio3d version, R build, and OS are you using.
Let me know what you get by above testing.