Blast and sequence identity question

Issue #497 new

Former user created an issue 2017-09-08

Hello,

I just started my work in Bio3D, and bioinformatics in general, and have some general questions about the output from blast, and how to best parse the sequence identity output.

Specifically, I am trying to do a blast search to return all crystal structures of the gamma specific isoform of PI3K (called PI3K gamma) in the PDB:

pdb   <- read.pdb('3ENE.pdb')
seq   <- pdbseq(pdb)
blast <- blast.pdb(seq, database = 'pdb', time.out= NULL)
hits  <- plot.blast(blast, cutoff=240)
files <- get.pdb(hits, path='raw_pdbs', split=TRUE)
pdbs  <- pdbaln(files

I find that many of the sequence identity values are in the 30-40% range, much lower than would be if all the returned blast results were of the gamma isoform. Indeed, I am getting many hits for the alpha, beta, and delta isoforms as well. I am using the following command to check the % sequence identity of the PDB files, with respect to each other.

pdbs$id <- substr(basename(pdbs$id),1, 6)
seqidentity(pdbs)

What is the best way for me to perform my blast search, and prune the results, such that the returned hits are all of the gamma isoform? Also, I would like to check the % sequence identity of every returned blast hit against the queried 3ENE.pdb sequence only. Currently, seqidentity(pdbs) returns a matrix of values, containing % sequence identity between blast hits as well, which is more information than I need, and complicates my analysis. What is the most efficient way for me to check the % sequence identity between my queried sequence, and my returned hits?

```

Comments (4)

Lars Skjærven

Hello and welcome to Bio3D. Your script looks good for your purpose, but you might have overlooked the functionality of plot.blast() here. Use this function and the cutoff argument to adjust the inclusion criteria. Using argument cutoff=600 you will get rid of those lower similarity hits.

Some suggestions which should answer your questions:

> hmm <- hmmer(get.seq("3ene_A"))
> hits <- plot(hmm, cutoff=600)
> files <- get.pdb(hits, path='raw_pdbs', split=TRUE, gzip=TRUE)

# annotate hits
> anno <- pdb.annotate(hits$pdb.id)

# most (or all?) is the gamma subunit
> grepl("gamma", tolower(anno$compound))
 [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[25]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[85]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

> pdbs  <- pdbaln(files)
> pdbs$id <- basename.pdb(pdbs$id)

> ide <- seqidentity(pdbs)
> summary(c(ide))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.9510  0.9950  0.9970  0.9931  0.9990  1.0000 

# seqide between query and all hits
> head(ide["3ene_A", ])
3ene_A 1e8y_A 4xz4_A 4xx5_A 3tjp_A 3ibe_A 
 1.000  0.999  1.000  1.000  0.999  0.996

2017-09-08T07:50:33+00:00

coparks2012

Hello, Lars. Your response has been very helpful. I have been playing around today, and have a couple more questions.

I would like to specify a moderately low cut off, loop through, check the sequence identity, and log those pdbs that must be removed, due to a low % sequence identity. This is done in the following code snippet:

number_of_files <- length(files)

#--- loop through and check seq identity
ide       <- seqidentity(pdbs)
seq_mask   <- integer( number_of_files )
remove_ids <- c()
num_remove <- 0


FileConn <- file('remove_pdbs.tsv', 'w')
for (i in 1:number_of_files ){    
    if( ide['raw_pdbs/split_chain/1E7U_A.pdb', i] < 0.90){

        seq_mask[i] <- 1
    num_remove  <- num_remove + 1
    remove_ids[num_remove] <- i

    write( hits$pdb.id[i], FileConn, append=TRUE )
    }
}

if(num_remove > 0){
     pdbs <- pdbs[-remove_ids]
     print( c("total number of files removed:", num_remove) )
}


#---find core, and make core.inds those w/elipsoid volume <= 1.0 A3
core <- core.find(pdbs)
core.inds <-print(core, vol=1.0)

This seems to work until the portion of the code that calls core. Perhaps I am not removing the indicies correctly from pdbs. Specifically, I error out, and get the following:

Error in core.find.pdbs(xyz, ...) : input 'pdbs' should either be:
               a list object from 'read.fasta.pdb'
               or a numeric 'xyz' matrix of aligned coordinates
Calls: core.find -> core.find.default -> core.find.pdbs

If I make the cut off 600, as you mentioned, the code proceeds through with no error. This is because I do not have to remove any files, as they all have a sequence identity % > 0.90. I would like to cluster my PDB files by RMSD, and extract the structures closest to each cluster average. I do this with the following code snippet:

xyz <- pdbfit( pdbs, core.inds, outpath ='aligned_files')


#---  compute RMSD, histogram RMSDs, cluster, and make dendogram
number_of_clusters <- 5
rd       <- rmsd(xyz)
hc.rd    <- hclust(as.dist(rd))


#pdbs$id  <- substr( basename(pdbs$id), 1, 6)
hist(rd, breaks= 40, xlab = 'RMSD(A)', main  = 'histogram of RMSD')
hclustplot(hc.rd, k=number_of_clusters)

###divide frames into k rows
grps <- cutree(hc.rd, k=number_of_clusters)


###calculate the mean of the group, and write out closest structure
for (i in 1:number_of_clusters ){ 
    gp.ave     <- colMeans(xyz[grps==i,])
    gp.min.ind <- which.min(rmsd(gp.ave, xyz[grps==i,]))
    gp.min.xyz <- xyz[grps==i,][gp.min.ind,]
    print( 'found min of group:' )
    print(  gp.min.ind )
    print( '---------' )
}

However, here I always get an error when the loop index i is equal to number of clusters. Specifically, on the last loop iteration, I get:

Error in colMeans(xyz[grps == i, ]) : 
  'x' must be an array of at least two dimensions
In addition: Warning messages:
1: In rmsd(gp.ave, xyz[grps == i, ]) :
  No indices provided, using the 824 non NA positions

2: In rmsd(gp.ave, xyz[grps == i, ]) :
  No indices provided, using the 746 non NA positions

3: In rmsd(gp.ave, xyz[grps == i, ]) :
  No indices provided, using the 808 non NA positions

4: In rmsd(gp.ave, xyz[grps == i, ]) :
  No indices provided, using the 798 non NA positions

Execution halted

What is it about the last loop iteration that throws this error?

Just in case you want to run the complete code, I will post the whole script. Just remember, I get the second error with cutoff=600, as I get an earlier error when cutoff=200 due to the core function.

library(bio3d)

blast_flag = 1
#---do this if you have not done blast already
if( blast_flag == 1){
    print('we are blasting')
    pdb     <- read.pdb('1E7U')
    seq     <- pdbseq(pdb)  
    blast   <- blast.pdb(seq, database = 'pdb', time.out= NULL)
    hits    <- plot.blast(blast, cutoff = 200)
    files   <- get.pdb(hits, path='raw_pdbs', split =TRUE)
    pdbs    <- pdbaln(files)
}else {
    print('we are not blasting. getting files')
    #---do this if you have already performed blast
    files   <- get.pdb(list.files('raw_pdbs', '.pdb$', full.names=TRUE), path = 'raw_pdbs/split_chain', split = TRUE)
    pdbs    <- pdbaln(files)
}
number_of_files <- length(files)

#--- loop through and check seq identity
ide       <- seqidentity(pdbs)
seq_mask   <- integer( number_of_files )
remove_ids <- c()
num_remove <- 0


FileConn <- file('remove_pdbs.tsv', 'w')
for (i in 1:number_of_files ){    
    if( ide['raw_pdbs/split_chain/1E7U_A.pdb', i] < 0.90){

        seq_mask[i] <- 1
    num_remove  <- num_remove + 1
    remove_ids[num_remove] <- i

    write( hits$pdb.id[i], FileConn, append=TRUE )
    }
}

if(num_remove > 0){
     pdbs <- pdbs[-remove_ids]
     print( c("total number of files removed:", num_remove) )
}


#---find core, and make core.inds those w/elipsoid volume <= 1.0 A3
core <- core.find(pdbs)
core.inds <-print(core, vol=1.0)


###----write out core positions to PDB file
write.pdb(xyz=pdbs$xyz[1, core.inds$xyz], file = 'quick_core.pdb')


###--- align all pdbs to core.inds. dump new coordinates to xyz
xyz <- pdbfit( pdbs, core.inds, outpath ='aligned_files')


#---  compute RMSD, histogram RMSDs, cluster, and make dendogram
number_of_clusters <- 5
rd       <- rmsd(xyz)
hc.rd    <- hclust(as.dist(rd))


#pdbs$id  <- substr( basename(pdbs$id), 1, 6)
hist(rd, breaks= 40, xlab = 'RMSD(A)', main  = 'histogram of RMSD')
hclustplot(hc.rd, k=number_of_clusters)

###divide frames into k rows
grps <- cutree(hc.rd, k=number_of_clusters)


###calculate the mean of the group, and write out closest structure
for (i in 1:number_of_clusters ){ 
    gp.ave     <- colMeans(xyz[grps==i,])
    gp.min.ind <- which.min(rmsd(gp.ave, xyz[grps==i,]))
    gp.min.xyz <- xyz[grps==i,][gp.min.ind,]
    print( 'found min of group:' )
    print(  gp.min.ind )
    print( '---------' )
}

2017-09-08T22:10:49+00:00

Lars Skjærven

Use function trim() to subset the pdbs object, e.g:

sele <- ide["1e7u", ] > 0.90)
pdbs2 <- trim(pdbs, row.inds=sele)

check the content of pdbs to explore why you can not subset like you do in your first code chunk.

In the last example do you want the cluster representatives? In that case I would look at the rmsd matrix, and not the xyz matrix of cartesian coordatinates:

  rep.inds <- rep(NA, length(unq))
  unq <- unique(grps)

  for(i in 1:length(uniq)) {
    grp <- unq[i]
    inds <- which(grps==grp)

    if(length(inds) > 1) {
      tmp <- rd[inds, inds]

      m <- rowMeans(tmp)
      rep.ind <- which.min(m)
      rep.ind <- all.inds[inds][rep.ind]
    }
    else {
      rep.ind <- inds[1]
    }

2017-09-09T06:11:20+00:00

coparks2012
Thank, you. Yes, I am trying to extract the cluster representatives. My goal is to find, amongst the blast hits, the structures that are most different, and then use these for ensemble docking. Now that I am thinking about it, would it be better to take RMSDs of active site residues only, and do the clustering based off those? If so, how would I trim my xyz matrix to consist of only the xyz coordinates of active site residues from each pdb? I think that would be the only modification required. I naively tried the following to get the binding site residues:
```
    pdb     <- read.pdb('1E7U')
    seq     <- pdbseq(pdb)  
    blast   <- blast.pdb(seq, database = 'pdb', time.out= NULL)
    hits    <- plot.blast(blast, cutoff = 200)
    files   <- get.pdb(hits, path='raw_pdbs', split =TRUE)
    pdbs    <- pdbaln(files)

    bs <- binding.site( pdbs )
```
Active site issue aside, I think I understand your code in general. However, I get the following error:
```
Error: object 'all.inds' not found
```
It also appears that you do not use rep.inds anywhere in the code you posted.

My interpretation of your code is that you find the pdb with the lowest RMSD on average with any other PDB within the grp, and return this as your structure. Based off my interpretation of your code snippet, I modified and arrived at the following code:
```
grps <- cutree(hc.rd, k=number_of_clusters)
unq <- unique(grps)
for (i in 1:length(unq) ){

      grp  <- i
      inds <- which(grps==grp)


      if(length(inds) > 1) {
          tmp <- rd[inds, inds]
          m       <- rowMeans(tmp)
          rep.ind <- which.min(m)

             print( 'found structure:' )
             print(  c(rep.ind, pdbs2$id[ inds[rep.ind] ] ) )
             print( '---------' )
    }
    else {
             print( 'found structure:' )
             print(  pdbs2$id[ inds[1] ] )
             print( '---------' )
    }
}
```
Can you please comment on whether I modified your code correctly, and arrived at a correct result? I am not asking you to debug, but rather whether the code is conceptually correct for extracting the representative cluster.

I have noticed that I will frequently get the same PDB, but different chains, as the representative structure of two different groups. For instance, if I run the full code with number_of_clusters = 5, and 90% sequence identity cut off, I will get 4BFR_A.pdb and 4BFR_B.pdb as the representative structure for two different clusters. This seems incorrect to me. Is my intuition that this is indeed incorrect wrong here?
- 2017-09-10T22:40:41+00:00
Log in to comment

Assignee: –

Type: task

Priority: major

Status: new

Component: –

Version: –

Votes: 0

Watchers: None