hmmer() and blast() output

Issue #296 resolved
Lars Skjærven created an issue

at some point the output of blast.pdb() and hmmer() should be made equivalent - it is not ideal to have these so different in their output data structures.

Comments (10)

  1. Lars Skjærven reporter

    It might be. The problem is that output differs depending on the type of search you perform. We could potentially rename the most important columns: acc (pdb.id in blast), bitscore (score in hmmer), and evalue. These three, which I think is always present, can be called the same: acc, bitscore, evalue in both hmmer and blast.

    What do you think?

  2. Barry Grant

    The inconsistency is annoying. However, changing blast.pdb() pdb.id for acc will break lots of things in our workflows - and the vignettes.

    Note that the **blast.pdb() output has a gi.id and a pdb.id. We could keep the pdb.id (for backward compatibility and utility) and consider changing the gi.id for acc.

    The other thing to change in hmmer(), besides those already listed by Lars, is hit.tbl and raw. The later can be renamed in both or either function.

  3. Lars Skjærven reporter

    I'm not in favour of adding additional matrices / data frames to the output of hmmer. The output of hmmer is currently a dataframe with each row being a 'hit'. In my opinion this is cleaner than what is currently provided by blast.pdb. For the latter I think the matrix hit.tbl is sufficient if we add columns pdb.id, mlog.evalue, and acc, and transform the output to a data frame. raw seems to contain redundant information to hit.tbljust with column subjectids parsed. If this is needed I suggest we add an argument raw=FALSE in which TRUE would output the raw matrix / data frame. In addition e.g. blast$hit.tbl[,"bitscore"] and blast$bitscore is the same - correct?

    We can certainly do a few name changes to hmmer which will make life easier. How about:

    • add column pdb.id, but only when db = "pdb" - this will be a copy of acc
    • add column bitscore to hmmer output - copy of score
    • add column mlog.evalue to hmmer output - calculated as -log(evalue)

    Alternatively to adding columns, we can rename them. Any thoughts?

  4. Barry Grant

    Sounds fine, lets not add the additional data.frames to hmmer().

    For blast.pdb() the raw matrix is returned because we have to do quite a bit of parsing - due to the way NCBI combines db entries. The parsing/unwrapping used to sometimes fail when there dbs were updated so the raw output was and still is useful in these cases. I don't really want to add a raw=FALSE option - Lets just return it as we already have it.

    Indeed, the blast hit.tbl should be a data.frame. This would also negate the need for the separate bitscore vector (if these are finally stored properly - i.e. not character as currently (note blast$hit.tbl[,"bitscore"] and blast$bitscore are not the same type which is silly).

    Note there is no hit.tbl$pdb.id col or $acc col. If we add them to hmmer() data.frame then we should probably have then in the blast.pdb() also.

    btw: where is the URL to go get results from hmmer() outside of bio3d? Is it possible to get one like with the blast.pdb() function?

    Otherwise, all the changes listed for hmmer() sound good:

    • add column pdb.id
    • add column bitscore
    • add column 'mlog.evalue`

    Would be cool and totally appropriate if the plot.blast() function could take the output from hmmer() - this would be the goal for getting these two function sets to the same place.

  5. Lars Skjærven reporter

    should be resolved with commit 26502c1

    > names(bl)
    [1] "hit.tbl" "raw"     "url"    
    > head(bl$hit.tbl)
          queryid               subjectids identity positives alignmentlength
    1 Query_96855  gi|157834528|pdb|2ABL|A      100       100             163
    2 Query_96855   gi|30749934|pdb|1OPK|A      100       100             162
    3 Query_96855   gi|93279684|pdb|2FO0|A      100       100             162
    4 Query_96855   gi|30749935|pdb|1OPL|A      100       100             162
    5 Query_96855   gi|30749936|pdb|1OPL|B      100       100             162
    6 Query_96855 gi|1002351501|pdb|5DC0|B      100       100             107
      mismatches gapopens q.start q.end s.start s.end    evalue bitscore
    1          0        0       1   163       1   163 1.20e-120      337
    2          0        0       2   163      37   198 3.19e-116      338
    3          0        0       2   163      34   195 5.49e-116      337
    4          0        0       2   163      76   237 2.47e-115      337
    5          0        0       2   163      76   237 2.47e-115      337
    6          0        0      57   163       1   107  9.18e-74      217
      mlog.evalue pdb.id        acc
    1    276.1279 2ABL_A  157834528
    2    265.9398 1OPK_A   30749934
    3    265.3969 2FO0_A   93279684
    4    263.8931 1OPL_A   30749935
    5    263.8931 1OPL_B   30749936
    6    168.1743 5DC0_B 1002351501
    
    
    > hmm <- hmmer(seq, db="pdb")
    > names(hmm)
    [1] "hit.tbl" "url" 
    
    > head(bl$hit.tbl)
          queryid               subjectids identity positives alignmentlength
    1 Query_96855  gi|157834528|pdb|2ABL|A      100       100             163
    2 Query_96855   gi|30749934|pdb|1OPK|A      100       100             162
    3 Query_96855   gi|93279684|pdb|2FO0|A      100       100             162
    4 Query_96855   gi|30749935|pdb|1OPL|A      100       100             162
    5 Query_96855   gi|30749936|pdb|1OPL|B      100       100             162
    6 Query_96855 gi|1002351501|pdb|5DC0|B      100       100             107
      mismatches gapopens q.start q.end s.start s.end    evalue bitscore
    1          0        0       1   163       1   163 1.20e-120      337
    2          0        0       2   163      37   198 3.19e-116      338
    3          0        0       2   163      34   195 5.49e-116      337
    4          0        0       2   163      76   237 2.47e-115      337
    5          0        0       2   163      76   237 2.47e-115      337
    6          0        0      57   163       1   107  9.18e-74      217
      mlog.evalue pdb.id        acc
    1    276.1279 2ABL_A  157834528
    2    265.9398 1OPK_A   30749934
    3    265.3969 2FO0_A   93279684
    4    263.8931 1OPL_A   30749935
    5    263.8931 1OPL_B   30749936
    6    168.1743 5DC0_B 1002351501
    
    
    
    > hmm <- hmmer(seq, db="pdb")                         
    > names(hmm)
    [1] "hit.tbl" "url"    
    
    > head(hmm$hit.tbl)
        name    acc                             arch archScore       archindex bias
    1 2abl_A 2abl_A            PF00018.26 PF00017.22         4 215298286219076  2.8
    2 2fo0_A 2fo0_A PF00018.26 PF00017.22 PF07714.15         5 167299098334483  3.0
    3 1opk_A 1opk_A PF00018.26 PF00017.22 PF07714.15         5 167299098334483  3.0
    4 1opl_A 1opl_A PF00018.26 PF00017.22 PF07714.15         5 167299098334483  3.0
    5 1opl_A 1opl_B PF00018.26 PF00017.22 PF07714.15         5 167299098334483  3.0
    7 3t04_A 3t04_A                       PF00017.22         2    104928242891  3.9
        dcl                                          desc   evalue
    1 67152                           ABL TYROSINE KINASE 2.5e-110
    2 68088 Proto-oncogene tyrosine-protein kinase ABL1 ( 1.7e-108
    3 69020   Proto-oncogene tyrosine-protein kinase ABL1 1.7e-108
    4 69953        proto-oncogene tyrosine-protein kinase 2.0e-108
    5 69953        proto-oncogene tyrosine-protein kinase 2.0e-108
    7 70885                  Tyrosine-protein kinase ABL1  1.4e-67
                                          extlink flags        kg ndom nincluded
    1 https://www.ebi.ac.uk/pdbe/entry/pdb/2abl_A     3 Eukaryota    1         1
    2 https://www.ebi.ac.uk/pdbe/entry/pdb/2fo0_A     3 Eukaryota    1         1
    3 https://www.ebi.ac.uk/pdbe/entry/pdb/1opk_A     3 Eukaryota    1         1
    4 https://www.ebi.ac.uk/pdbe/entry/pdb/1opl_A     3 Eukaryota    1         1
    5 https://www.ebi.ac.uk/pdbe/entry/pdb/1opl_A     3 Eukaryota    1         1
    7 https://www.ebi.ac.uk/pdbe/entry/pdb/3t04_A     3 Eukaryota    1         1
      niseqs nregions nreported    pvalue score    sindex      species taxid
    1     11        1         1 -264.9962 369.8 115806351 Homo sapiens  9606
    2     11        1         1 -260.7520 363.8 115905927 Homo sapiens  9606
    3     11        1         1 -260.7509 363.8 116233168 Mus musculus 10090
    4     11        1         1 -260.6023 363.6 115749607 Homo sapiens  9606
    5     11        1         1 -260.6023 363.6 115749607 Homo sapiens  9606
    7     11        1         1 -166.5537 230.8 115740259 Homo sapiens  9606
                                                                    taxlink pdb.id
    1 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id= 2abl_A
    2 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id= 2fo0_A
    3 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id= 1opk_A
    4 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id= 1opl_A
    5 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id= 1opl_B
    7 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id= 3t04_A
      bitscore mlog.evalue
    1    369.8    252.3681
    2    363.8    248.1486
    3    363.8    248.1486
    4    363.6    247.9860
    5    363.6    247.9860
    7    230.8    153.9367
    

    looks alright?

  6. Xinqiu Yao

    These outputs look very nice. I hope the removal of pdb.id in blast won't break examples and vignettes (I assume we always use plot.blast() to generate the pdb id list, which has been fixed accordingly in Lars' recent commit). Will have a test soon. Thanks for the update!

  7. Log in to comment