get.seq changing PDB IDs leading to seg fault

Issue #509 new

coparks2012 created an issue 2017-10-08

Hello,

I am performing a blast search on the following sequence. For the purposes of this, lets assume it it contained in a file called "fasta_file.fasta":

>575584.HMPREF0010_00009
MLPPVPKTKSSEVTDIINSAVLTGSISEFQYFRCKRLLNDIKETEPLDWFLLSNSIIEMYFDNPILAHQYAREVLKISNS
VSILSNLYFVFLSSVDFSSANENIDKIISLCSKQNLPLESFIPIDFKPITYFLDGILNDDLNYYKRFKKEDFNEFIQFFE
IKNKLEIDSRVLKHIGSILFKCFNSRNVRCRKYEYSFIDDEFLILLYVDRSFDEIDAMNSEIFSKCYDEGLIDELNKLSY
FIIPYEVGVD

My code for analysis is as follows:

  seq    <- read.fasta( fasta_file.fasta )
  blast  <- blast.pdb(seq$ali[1,], database = 'pdb', time.out = NULL)
  seqs  <- get.seq(blast$hit.tbl$acc)
  aln     <- seqaln(seqbind(seq$ali[1,], seqs))
  ide    <- seqidentity(aln)

 for (k in 1:length(blast$hit.tbl$evalue) ){
        if(blast$hit.tbl$evalue[k] < 0.0001) {
        pdbname <- blast$hit.tbl$pdb.id[k]

        if(ide["seq1",pdbname] >= 0.7){
        file <- get.pdb(blast$hit.tbl$pdb.id[k], path='output', split= FALSE)
       }

        }
     }

However, the code loves to segfault at the following line

        if(ide["seq1",pdbname] >= 0.7){

I have determined that this is due to blast hit pdb ID 5X3X. In the blast$hit.tbl$pdb.id array, four names are reported: 5X3X_A, 5X3X_B, 5X3X_a, 5X3X_b. However, in the array seqs, the following ids are present: 5X3X_A, 5X3X_B, 5X3X_AA, 5X3X_BB.

Can someone explain to me why the get.seq call is renaming the PDB filenames, and switching lower case letters with dual upper case letters? Are there any other of these filename chainging conventions that I should be aware of? I am about to do a computationally demanding blast search, and I would hate if the code segfaults have way through because of an issue like this.

Comments (6)

Xinqiu Yao
Hi,

Bio3D doesn't change the file names of your PDBs - it is the 'nr' database from which the sequences were retrieved. I don't know why they replace 'a' to 'AA' but one possible reason is that the API might be case sensitive (i.e. can't distinguish 'a' from 'A').

If the name mismatch caused any problems or errors, try rename the column and row names of 'ide' with 'blast$hit.tbl$pdb.id', e.g. rownames(ide) <- c('seq1', blast$hit.tbl$pdb.id). Let me know if you still get errors after renaming.
- 2017-10-08T23:08:05+00:00

coparks2012 reporter

Thank you. Unfortunately, the rownames command doesn't seem to induce a change in my vector. Also, I have noticed that length of the vector seqs is not the same as blast$hit.tbl$pdb.id. Could you explain why this is?

I have engineered a fix:

alphabet_lower=c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z")
alphabet_upper=c("_AA","_BB","_CC","_DD","_EE","_FF","_GG","_HH","_II","_JJ","_KK","_LL","_MM","_NN","_OO","_PP","_QQ","_RR","_SS","_TT","_UU","_VV","_WW","_XX","_YY","_ZZ")
 pdbname <- blast$hit.tbl$pdb.id[k]

 for( z in 1:length(alphabet_lower) ){
             if(grepl( alphabet_lower[z], pdbname) ==TRUE){
                pdbname = substr(pdbname,1,4)
                pdbname = paste(pdbname, alphabet_upper[z], sep="")
            }
     }

However, I am wondering if there any other situations like this, where they may be a difference in filenames. Would it possible to suggest a fix where bio3d handled these kind of scenarios internally?

2017-10-09T17:42:42+00:00

Barry Grant

A bit late seeing this but as was already noted you are jumping between databases with different nameing conventions (PDB and NR). Probably best to stick to one database throughout. Again it is not bio3d that is doing this but the places you are requesting data from.

See example code below which may be doing similar things to what you were in the first post. Hope this helps!

library(bio3d)
s <- read.fasta( "fasta_file.fasta" )
raw <- blast.pdb(s)


## Side-Note: 
## The evalues are quite poor for all our hits and the %ide is low also ~30%
> summary(raw$hit.tbl$evalue)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  0.730   0.800   2.300   2.181   2.600   9.700

> summary(raw$hit.tbl$identity)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  21.78   26.92   26.92   29.88   35.00   38.78

## Download sequence records
aa <- get.seq( raw$hit.tbl$pdb.id )

## Align sequences along with our query
aln <- seqaln( seqbind(s, aa)) 


## Side-note: This alignment will be of questionable quality with respect to your query sequence

## Sequence identity matrix 
ide    <- seqidentity(aln)

## Note again that your query sequence has very low values in this matrix! 
> summary(ide[1,-1])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.0850  0.1070  0.1090  0.1138  0.1170  0.1400

## No hit is above 0.7 (70%) but just to demo your for loop replacment we can use 12%
names( which(ide[1,-1] > 0.12) )

[1] "3QGM_A" "3QGM_B" "3QGM_C" "3QGM_D" "4JDP_A" "4JDP_B"

## obviously you can do get PDB from there...

2017-10-13T21:54:48+00:00

coparks2012 reporter
Thank you, professor Grant. Could you please comment further on why the alignment will be of questionable quality?
- 2017-10-24T02:23:47+00:00

Xinqiu Yao

marked as minor
changed component to ToDo
changed version to v2.3 [devel]
assigned issue to

Xinqiu Yao

Cautions must be applied if blast.pdb() returns hits of large systems containing many chains. In these systems, chains may be labeled with both upper and lower cases, such as 'A' and 'a', and both chains may show up in the blast hit table. If this occurs, get.seq() using blast$hit.tbl$pdb.id may return incomplete results because the APIs for fetching sequences are case insensitive. For example,

pdb <- read.pdb('5X3X')
pdb <- trim(pdb, chain='A')
aa <- pdbseq(pdb)
blast <- blast.pdb(aa, database='pdb')

blast$hit.tbl[1:10, ]

## Note row #1 and #3.

        queryid                subjectids identity positives alignmentlength mismatches gapopens q.start
1  Query_131483  gi|1175129182|pdb|5X3X|A  100.000       100             280          0        0       1
2  Query_131483  gi|1175129183|pdb|5X3X|B  100.000       100             280          0        0       1
3  Query_131483 gi|1175129186|pdb|5X3X|AA  100.000       100             280          0        0       1
4  Query_131483 gi|1175129187|pdb|5X3X|BB  100.000       100             280          0        0       1
5  Query_131483  gi|1182607627|pdb|5X41|A  100.000       100             280          0        0       1
6  Query_131483  gi|1182607628|pdb|5X41|B  100.000       100             280          0        0       1
7  Query_131483  gi|1182607631|pdb|5X41|C  100.000       100             280          0        0       1
8  Query_131483  gi|1182607632|pdb|5X41|D  100.000       100             280          0        0       1
9  Query_131483  gi|1175129190|pdb|5X40|A   99.643       100             280          1        0       1
10 Query_131483  gi|1175129191|pdb|5X40|B   99.643       100             280          1        0       1
   q.end s.start s.end evalue bitscore mlog.evalue pdb.id        acc
1    280       1   280      0      536    709.1962 5X3X_A 1175129182
2    280       1   280      0      536    709.1962 5X3X_B 1175129183
3    280       1   280      0      536    709.1962 5X3X_a 1175129186
4    280       1   280      0      536    709.1962 5X3X_b 1175129187
5    280       1   280      0      536    709.1962 5X41_A 1182607627
6    280       1   280      0      536    709.1962 5X41_B 1182607628
7    280       1   280      0      536    709.1962 5X41_C 1182607631
8    280       1   280      0      536    709.1962 5X41_D 1182607632
9    280      13   292      0      535    709.1962 5X40_A 1175129190
10   280      13   292      0      535    709.1962 5X40_B 1175129191

## Use NCBI server and "nr" database. Note that we lose entries 
##    with lower-case chain IDs such as "5X3X_a".
seqs.nr <- get.seq(blast$hit.tbl$pdb.id[1:10])
seqs.nr$id
[1] "5X3X_A" "5X3X_B" "5X41_A" "5X41_B" "5X41_C" "5X41_D" "5X40_A" "5X40_B"

## Use EBI and "pdb" database. The same results.
## Side-note: The naming convention is different between databases.
seqs.pdb <- get.seq(blast$hit.tbl$pdb.id[1:10], db="pdb")
seqs.pdb$id
[1] "PDB:5X3X_A" "PDB:5X3X_B" "PDB:5X41_A" "PDB:5X41_B" "PDB:5X41_C" "PDB:5X41_D" "PDB:5X40_A" "PDB:5X40_B"

## The only way to get a full list is to use "acc" instead of 
##    "pdb.id" and the "nr" database. However, sequence ids 
##    will change for lower-case entries from e.g. "a" to "AA".
seqs <- get.seq(blast$hit.tbl$acc[1:10])
seqs$id
 [1] "5X3X_A"  "5X3X_B"  "5X3X_AA" "5X3X_BB" "5X41_A"  "5X41_B"  "5X41_C"  "5X41_D"  "5X40_A"  "5X40_B" 

## Note that get.pdb() does not have this problem.
files <- get.pdb(blast$hit.tbl$pdb.id[1:10], split=TRUE)
files
 [1] "./split_chain/5X3X_A.pdb" "./split_chain/5X3X_B.pdb" "./split_chain/5X3X_a.pdb"
 [4] "./split_chain/5X3X_b.pdb" "./split_chain/5X41_A.pdb" "./split_chain/5X41_B.pdb"
 [7] "./split_chain/5X41_C.pdb" "./split_chain/5X41_D.pdb" "./split_chain/5X40_A.pdb"
[10] "./split_chain/5X40_B.pdb"

Although this is not a bio3d "bug", we might consider shooting a warning to let users be aware of this situation.

2018-01-10T21:40:02+00:00

Xinqiu Yao
- marked as proposal
Not a bug.
- 2018-02-04T20:30:33+00:00
Log in to comment

Assignee: Xinqiu Yao

Type: proposal

Priority: minor

Status: new

Component: ToDo

Version: v2.3 [devel]

Votes: 0

Watchers: 1