sequence analysis question

Hello,

I have a multiple sequence fasta file, containing the sequences of proteins in the tuberculosis proteome. I would like to perform a blast search, and determine if any of the tuberculosis sequences share a 50% sequence identity, for a length of more than 50% of the bacterial query, with an E-value less than 0.0001, with any protein in the humna proteome.

So far, I have this code:

library(bio3d)

seq    <- read.fasta('test.fa')
print( length(seq$id) )

for (i in 1:length(seq$id) ) {


    print( c('performing blast search for sequence:', i) )
    blast <- blast.pdb(seq$ali[i,], database = 'swissprot', time.out = NULL)


    for (k in 1:length(blast$hit.tbl$evalue) ){
        if(blast$hit.tbl$evalue < 0.0001) {
            if( grepl('HUMAN', blast$hit.tbl$subjectids[k]) ){
            print( c('we have a match', blast$hit.tbl$subjectids[k] ) ) 
        }
        }
     }

}

In the code above, I can determine if the hit is e_value < 0.0001. However, I am not sure if this is the best way to determine if the hit is from the human proteome, and whether the method would apply if I blast against the nr database. Also, I am also not sure how to then determine what the % sequence identity is between my query sequence and this hit, and what the length is. I have used the seqidentity function before, but this is after I blasted against the PDB database, and downloaded all the files. As I am not blasting against the pdb here, I am not sure how to proceed.

Comments (7)