get.seq changing PDB IDs leading to seg fault
Hello,
I am performing a blast search on the following sequence. For the purposes of this, lets assume it it contained in a file called "fasta_file.fasta":
>575584.HMPREF0010_00009
MLPPVPKTKSSEVTDIINSAVLTGSISEFQYFRCKRLLNDIKETEPLDWFLLSNSIIEMYFDNPILAHQYAREVLKISNS
VSILSNLYFVFLSSVDFSSANENIDKIISLCSKQNLPLESFIPIDFKPITYFLDGILNDDLNYYKRFKKEDFNEFIQFFE
IKNKLEIDSRVLKHIGSILFKCFNSRNVRCRKYEYSFIDDEFLILLYVDRSFDEIDAMNSEIFSKCYDEGLIDELNKLSY
FIIPYEVGVD
My code for analysis is as follows:
seq <- read.fasta( fasta_file.fasta )
blast <- blast.pdb(seq$ali[1,], database = 'pdb', time.out = NULL)
seqs <- get.seq(blast$hit.tbl$acc)
aln <- seqaln(seqbind(seq$ali[1,], seqs))
ide <- seqidentity(aln)
for (k in 1:length(blast$hit.tbl$evalue) ){
if(blast$hit.tbl$evalue[k] < 0.0001) {
pdbname <- blast$hit.tbl$pdb.id[k]
if(ide["seq1",pdbname] >= 0.7){
file <- get.pdb(blast$hit.tbl$pdb.id[k], path='output', split= FALSE)
}
}
}
However, the code loves to segfault at the following line
if(ide["seq1",pdbname] >= 0.7){
I have determined that this is due to blast hit pdb ID 5X3X. In the blast$hit.tbl$pdb.id array, four names are reported: 5X3X_A, 5X3X_B, 5X3X_a, 5X3X_b. However, in the array seqs, the following ids are present: 5X3X_A, 5X3X_B, 5X3X_AA, 5X3X_BB.
Can someone explain to me why the get.seq call is renaming the PDB filenames, and switching lower case letters with dual upper case letters? Are there any other of these filename chainging conventions that I should be aware of? I am about to do a computationally demanding blast search, and I would hate if the code segfaults have way through because of an issue like this.
Comments (6)
-
-
reporter Thank you. Unfortunately, the rownames command doesn't seem to induce a change in my vector. Also, I have noticed that length of the vector seqs is not the same as blast$hit.tbl$pdb.id. Could you explain why this is?
I have engineered a fix:
alphabet_lower=c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z") alphabet_upper=c("_AA","_BB","_CC","_DD","_EE","_FF","_GG","_HH","_II","_JJ","_KK","_LL","_MM","_NN","_OO","_PP","_QQ","_RR","_SS","_TT","_UU","_VV","_WW","_XX","_YY","_ZZ") pdbname <- blast$hit.tbl$pdb.id[k] for( z in 1:length(alphabet_lower) ){ if(grepl( alphabet_lower[z], pdbname) ==TRUE){ pdbname = substr(pdbname,1,4) pdbname = paste(pdbname, alphabet_upper[z], sep="") } }
However, I am wondering if there any other situations like this, where they may be a difference in filenames. Would it possible to suggest a fix where bio3d handled these kind of scenarios internally?
-
A bit late seeing this but as was already noted you are jumping between databases with different nameing conventions (PDB and NR). Probably best to stick to one database throughout. Again it is not bio3d that is doing this but the places you are requesting data from.
See example code below which may be doing similar things to what you were in the first post. Hope this helps!
library(bio3d) s <- read.fasta( "fasta_file.fasta" ) raw <- blast.pdb(s) ## Side-Note: ## The evalues are quite poor for all our hits and the %ide is low also ~30% > summary(raw$hit.tbl$evalue) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.730 0.800 2.300 2.181 2.600 9.700 > summary(raw$hit.tbl$identity) Min. 1st Qu. Median Mean 3rd Qu. Max. 21.78 26.92 26.92 29.88 35.00 38.78 ## Download sequence records aa <- get.seq( raw$hit.tbl$pdb.id ) ## Align sequences along with our query aln <- seqaln( seqbind(s, aa)) ## Side-note: This alignment will be of questionable quality with respect to your query sequence ## Sequence identity matrix ide <- seqidentity(aln) ## Note again that your query sequence has very low values in this matrix! > summary(ide[1,-1]) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0850 0.1070 0.1090 0.1138 0.1170 0.1400 ## No hit is above 0.7 (70%) but just to demo your for loop replacment we can use 12% names( which(ide[1,-1] > 0.12) ) [1] "3QGM_A" "3QGM_B" "3QGM_C" "3QGM_D" "4JDP_A" "4JDP_B" ## obviously you can do get PDB from there...
-
reporter Thank you, professor Grant. Could you please comment further on why the alignment will be of questionable quality?
-
- marked as minor
- changed component to ToDo
- changed version to v2.3 [devel]
-
assigned issue to
Cautions must be applied if
blast.pdb()
returns hits of large systems containing many chains. In these systems, chains may be labeled with both upper and lower cases, such as 'A' and 'a', and both chains may show up in the blast hit table. If this occurs,get.seq()
usingblast$hit.tbl$pdb.id
may return incomplete results because the APIs for fetching sequences are case insensitive. For example,pdb <- read.pdb('5X3X') pdb <- trim(pdb, chain='A') aa <- pdbseq(pdb) blast <- blast.pdb(aa, database='pdb') blast$hit.tbl[1:10, ] ## Note row #1 and #3. queryid subjectids identity positives alignmentlength mismatches gapopens q.start 1 Query_131483 gi|1175129182|pdb|5X3X|A 100.000 100 280 0 0 1 2 Query_131483 gi|1175129183|pdb|5X3X|B 100.000 100 280 0 0 1 3 Query_131483 gi|1175129186|pdb|5X3X|AA 100.000 100 280 0 0 1 4 Query_131483 gi|1175129187|pdb|5X3X|BB 100.000 100 280 0 0 1 5 Query_131483 gi|1182607627|pdb|5X41|A 100.000 100 280 0 0 1 6 Query_131483 gi|1182607628|pdb|5X41|B 100.000 100 280 0 0 1 7 Query_131483 gi|1182607631|pdb|5X41|C 100.000 100 280 0 0 1 8 Query_131483 gi|1182607632|pdb|5X41|D 100.000 100 280 0 0 1 9 Query_131483 gi|1175129190|pdb|5X40|A 99.643 100 280 1 0 1 10 Query_131483 gi|1175129191|pdb|5X40|B 99.643 100 280 1 0 1 q.end s.start s.end evalue bitscore mlog.evalue pdb.id acc 1 280 1 280 0 536 709.1962 5X3X_A 1175129182 2 280 1 280 0 536 709.1962 5X3X_B 1175129183 3 280 1 280 0 536 709.1962 5X3X_a 1175129186 4 280 1 280 0 536 709.1962 5X3X_b 1175129187 5 280 1 280 0 536 709.1962 5X41_A 1182607627 6 280 1 280 0 536 709.1962 5X41_B 1182607628 7 280 1 280 0 536 709.1962 5X41_C 1182607631 8 280 1 280 0 536 709.1962 5X41_D 1182607632 9 280 13 292 0 535 709.1962 5X40_A 1175129190 10 280 13 292 0 535 709.1962 5X40_B 1175129191 ## Use NCBI server and "nr" database. Note that we lose entries ## with lower-case chain IDs such as "5X3X_a". seqs.nr <- get.seq(blast$hit.tbl$pdb.id[1:10]) seqs.nr$id [1] "5X3X_A" "5X3X_B" "5X41_A" "5X41_B" "5X41_C" "5X41_D" "5X40_A" "5X40_B" ## Use EBI and "pdb" database. The same results. ## Side-note: The naming convention is different between databases. seqs.pdb <- get.seq(blast$hit.tbl$pdb.id[1:10], db="pdb") seqs.pdb$id [1] "PDB:5X3X_A" "PDB:5X3X_B" "PDB:5X41_A" "PDB:5X41_B" "PDB:5X41_C" "PDB:5X41_D" "PDB:5X40_A" "PDB:5X40_B" ## The only way to get a full list is to use "acc" instead of ## "pdb.id" and the "nr" database. However, sequence ids ## will change for lower-case entries from e.g. "a" to "AA". seqs <- get.seq(blast$hit.tbl$acc[1:10]) seqs$id [1] "5X3X_A" "5X3X_B" "5X3X_AA" "5X3X_BB" "5X41_A" "5X41_B" "5X41_C" "5X41_D" "5X40_A" "5X40_B" ## Note that get.pdb() does not have this problem. files <- get.pdb(blast$hit.tbl$pdb.id[1:10], split=TRUE) files [1] "./split_chain/5X3X_A.pdb" "./split_chain/5X3X_B.pdb" "./split_chain/5X3X_a.pdb" [4] "./split_chain/5X3X_b.pdb" "./split_chain/5X41_A.pdb" "./split_chain/5X41_B.pdb" [7] "./split_chain/5X41_C.pdb" "./split_chain/5X41_D.pdb" "./split_chain/5X40_A.pdb" [10] "./split_chain/5X40_B.pdb"
Although this is not a bio3d "bug", we might consider shooting a warning to let users be aware of this situation.
-
- marked as proposal
Not a bug.
- Log in to comment
Hi,
Bio3D doesn't change the file names of your PDBs - it is the 'nr' database from which the sequences were retrieved. I don't know why they replace 'a' to 'AA' but one possible reason is that the API might be case sensitive (i.e. can't distinguish 'a' from 'A').
If the name mismatch caused any problems or errors, try rename the column and row names of 'ide' with 'blast$hit.tbl$pdb.id', e.g.
rownames(ide) <- c('seq1', blast$hit.tbl$pdb.id)
. Let me know if you still get errors after renaming.