sequence analysis question
Hello,
I have a multiple sequence fasta file, containing the sequences of proteins in the tuberculosis proteome. I would like to perform a blast search, and determine if any of the tuberculosis sequences share a 50% sequence identity, for a length of more than 50% of the bacterial query, with an E-value less than 0.0001, with any protein in the humna proteome.
So far, I have this code:
library(bio3d)
seq <- read.fasta('test.fa')
print( length(seq$id) )
for (i in 1:length(seq$id) ) {
print( c('performing blast search for sequence:', i) )
blast <- blast.pdb(seq$ali[i,], database = 'swissprot', time.out = NULL)
for (k in 1:length(blast$hit.tbl$evalue) ){
if(blast$hit.tbl$evalue < 0.0001) {
if( grepl('HUMAN', blast$hit.tbl$subjectids[k]) ){
print( c('we have a match', blast$hit.tbl$subjectids[k] ) )
}
}
}
}
In the code above, I can determine if the hit is e_value < 0.0001. However, I am not sure if this is the best way to determine if the hit is from the human proteome, and whether the method would apply if I blast against the nr database. Also, I am also not sure how to then determine what the % sequence identity is between my query sequence and this hit, and what the length is. I have used the seqidentity function before, but this is after I blasted against the PDB database, and downloaded all the files. As I am not blasting against the pdb here, I am not sure how to proceed.
Comments (7)
-
-
You can do this in bio3d along the lines suggested but if you are going to be doing this type of potential ortholog mapping often then you might want to explore tools designed to do this more robustly. I have had good results with shmlast.
https://pypi.python.org/pypi/shmlast
You can read the results table from shmlast into R/bio3d for further selection and analysis.
-
reporter Thank, you.
Xin-Qui Yao, could you please tell me how to obtain what the length is between my query sequence and the blast hit?
-
reporter Does anyone have anyone else have an idea about how to tell the organism of the nr database blast hits?
-
As I said, you can download sequences and lengths will be very easy to know. For example, for a single sequence (a character vector) you can use
length(aa)
, or for a set of sequences in 'fasta' format, you can useapply(aln$ali, 1, length)
.For the question regarding organism, you might consider exploring other tools as advised. Besides the suggested website above, you can also try NCBI Blast directly, which I found already provides searching against human genome https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&BLAST_SPEC=OGP__9606__9558&LINK_LOC=blasttab&LAST_PAGE=blastn.
-
Again, bio3d is probably not the best tool to do this as we don't allow you to use other databases beyond those already noted in the docs for
blast.pdb()
. -
- changed status to resolved
- Log in to comment
Hi,
You can download sequences according to the hit.tbl and get sequence identity or length directly. For example,
About human genome, yes if you change the database your method might not work since the header of hits could change. I guess the best way is to search against just a 'human genome' database. I am not so familiar with genomics and so have no idea if such database exists or not.
Do others have any idea about the human genome?