sequence analysis question

Issue #498 resolved
coparks2012 created an issue

Hello,

I have a multiple sequence fasta file, containing the sequences of proteins in the tuberculosis proteome. I would like to perform a blast search, and determine if any of the tuberculosis sequences share a 50% sequence identity, for a length of more than 50% of the bacterial query, with an E-value less than 0.0001, with any protein in the humna proteome.

So far, I have this code:

library(bio3d)

seq    <- read.fasta('test.fa')
print( length(seq$id) )

for (i in 1:length(seq$id) ) {


    print( c('performing blast search for sequence:', i) )
    blast <- blast.pdb(seq$ali[i,], database = 'swissprot', time.out = NULL)


    for (k in 1:length(blast$hit.tbl$evalue) ){
        if(blast$hit.tbl$evalue < 0.0001) {
            if( grepl('HUMAN', blast$hit.tbl$subjectids[k]) ){
            print( c('we have a match', blast$hit.tbl$subjectids[k] ) ) 
        }
        }
     }

}

In the code above, I can determine if the hit is e_value < 0.0001. However, I am not sure if this is the best way to determine if the hit is from the human proteome, and whether the method would apply if I blast against the nr database. Also, I am also not sure how to then determine what the % sequence identity is between my query sequence and this hit, and what the length is. I have used the seqidentity function before, but this is after I blasted against the PDB database, and downloaded all the files. As I am not blasting against the pdb here, I am not sure how to proceed.

Comments (7)

  1. Xinqiu Yao

    Hi,

    You can download sequences according to the hit.tbl and get sequence identity or length directly. For example,

    seqs <- get.seq(blast$hit.tbl$acc)
    aln <- seqaln(seqbind(seqs, seq$ali[i, ]))
    seqidentity(aln)
    

    About human genome, yes if you change the database your method might not work since the header of hits could change. I guess the best way is to search against just a 'human genome' database. I am not so familiar with genomics and so have no idea if such database exists or not.

    Do others have any idea about the human genome?

  2. Barry Grant

    You can do this in bio3d along the lines suggested but if you are going to be doing this type of potential ortholog mapping often then you might want to explore tools designed to do this more robustly. I have had good results with shmlast.

    https://pypi.python.org/pypi/shmlast

    You can read the results table from shmlast into R/bio3d for further selection and analysis.

  3. coparks2012 reporter

    Thank, you.

    Xin-Qui Yao, could you please tell me how to obtain what the length is between my query sequence and the blast hit?

  4. coparks2012 reporter

    Does anyone have anyone else have an idea about how to tell the organism of the nr database blast hits?

  5. Xinqiu Yao

    As I said, you can download sequences and lengths will be very easy to know. For example, for a single sequence (a character vector) you can use length(aa), or for a set of sequences in 'fasta' format, you can use apply(aln$ali, 1, length).

    For the question regarding organism, you might consider exploring other tools as advised. Besides the suggested website above, you can also try NCBI Blast directly, which I found already provides searching against human genome https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&BLAST_SPEC=OGP__9606__9558&LINK_LOC=blasttab&LAST_PAGE=blastn.

  6. Barry Grant

    Again, bio3d is probably not the best tool to do this as we don't allow you to use other databases beyond those already noted in the docs for blast.pdb().

  7. Log in to comment