blast.pdb does not return the PDBids of PDBs with identical sequences (so called "identical proteins" by NCBI) present in the PDB database.

Issue #921 new
Matthieu created an issue

Hi,

This must be a common issue but I didn't see it discussed here. On the NCBI server, when doing a pblast on the sequences from the pdb (database: “Protein Data Bank proteins (pdb)”), the pblast output groups the PDBs for which the sequences are the same (called "Identical proteins") under a single entry from that group. This does not mean the structures in that group with identical protein sequences are the same however. How do I get all the PDBids (not only the ones with non-redundant sequences from the PDB) from within bio3d that match a sequence of interest? The strategy I see in a bio3d tutorial using blast.pdb to retrieve similar structure does not provide what I'm expecting i.e. all the similar structures (given a cutoff). Below is an example to illustrate that.

# Here is a list of 20 pdbs of the kinesin KIF14 from the same study. Their sequences contain the same kinesin domain with 5 different constructs differing by their length at their Cterminal end. Several of these PDBs are from the same construct, i.e. have exactly the same KIF14 sequence but their structure are different (they can have a different nucleotide bound for instance):

c("6WWS","6WWE","6WWF","6WWG","6WWH","6WWI","6WWJ","6WWK","6WWL","6WWM","6WWN","6WWO","6WWP","6WWQ","6WWR","6WWT","6WWU","6WWV","7LVQ","7LVR")->K14cryoEMpdbs

# Using blast.pdb

read.pdb(K14cryoEMpdbs[1])->pdb
pdbseq(atom.select(pdb, chain="K",value=TRUE))->seq # kinesin is the chain K in 6WWS
blast.pdb(seq)->blast
str(blast) # I don't see where the PDBs of 'Identical proteins' are reported (NCBI provides this information).

# Therefore, if I extract the PDBids from pdb.id, which are the main entries of the blast output...

unique(unlist(lapply(strsplit(blast$hit.tbl$pdb.id,split="_"),FUN="[[",1)))->pdbsObtained

# ... a lot of the PDBids from K14cryoEMpdbs are missing:

setdiff(K14cryoEMpdbs[-1],pdbsObtained)

# [1] "6WWF" "6WWG" "6WWH" "6WWJ" "6WWK" "6WWL" "6WWN" "6WWO" "6WWQ" "6WWR"

# [11] "6WWU" "6WWV" "7LVQ" "7LVR"

intersect(K14cryoEMpdbs[-1],pdbsObtained)
# [1] "6WWE" "6WWI" "6WWM" "6WWP" "6WWT" # each corresponds to one instance of the 5 protein constructs used.

sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 11 (bullseye)

(…)

other attached packages:
[1] bio3d_2.4-4

Best,

Matthieu

Comments (0)

  1. Log in to comment