- changed version to v2.3 [devel]
hmmer() and blast() output
at some point the output of blast.pdb() and hmmer() should be made equivalent - it is not ideal to have these so different in their output data structures.
Comments (10)
-
-
- changed version to v2.4/3.0 [future]
Let me know if this can be done quickly...
-
reporter It might be. The problem is that output differs depending on the type of search you perform. We could potentially rename the most important columns: acc (pdb.id in blast), bitscore (score in hmmer), and evalue. These three, which I think is always present, can be called the same: acc, bitscore, evalue in both hmmer and blast.
What do you think?
-
The inconsistency is annoying. However, changing blast.pdb()
pdb.id
foracc
will break lots of things in our workflows - and the vignettes.Note that the **blast.pdb() output has a
gi.id
and apdb.id
. We could keep thepdb.id
(for backward compatibility and utility) and consider changing thegi.id
foracc
.The other thing to change in hmmer(), besides those already listed by Lars, is
hit.tbl
andraw
. The later can be renamed in both or either function. -
reporter I'm not in favour of adding additional matrices / data frames to the output of hmmer. The output of hmmer is currently a dataframe with each row being a 'hit'. In my opinion this is cleaner than what is currently provided by blast.pdb. For the latter I think the matrix
hit.tbl
is sufficient if we add columnspdb.id
,mlog.evalue
, andacc
, and transform the output to a data frame.raw
seems to contain redundant information tohit.tbl
just with columnsubjectids
parsed. If this is needed I suggest we add an argumentraw=FALSE
in which TRUE would output the raw matrix / data frame. In addition e.g.blast$hit.tbl[,"bitscore"]
andblast$bitscore
is the same - correct?We can certainly do a few name changes to hmmer which will make life easier. How about:
- add column
pdb.id
, but only whendb = "pdb"
- this will be a copy ofacc
- add column
bitscore
to hmmer output - copy ofscore
- add column
mlog.evalue
to hmmer output - calculated as-log(evalue)
Alternatively to adding columns, we can rename them. Any thoughts?
- add column
-
Sounds fine, lets not add the additional data.frames to hmmer().
For blast.pdb() the
raw
matrix is returned because we have to do quite a bit of parsing - due to the way NCBI combines db entries. The parsing/unwrapping used to sometimes fail when there dbs were updated so theraw
output was and still is useful in these cases. I don't really want to add araw=FALSE
option - Lets just return it as we already have it.Indeed, the blast
hit.tbl
should be a data.frame. This would also negate the need for the separatebitscore
vector (if these are finally stored properly - i.e. not character as currently (noteblast$hit.tbl[,"bitscore"]
andblast$bitscore
are not the same type which is silly).Note there is no
hit.tbl$pdb.id
col or $acc col. If we add them to hmmer() data.frame then we should probably have then in the blast.pdb() also.btw: where is the URL to go get results from hmmer() outside of bio3d? Is it possible to get one like with the blast.pdb() function?
Otherwise, all the changes listed for hmmer() sound good:
- add column
pdb.id
- add column
bitscore
- add column 'mlog.evalue`
Would be cool and totally appropriate if the plot.blast() function could take the output from hmmer() - this would be the goal for getting these two function sets to the same place.
- add column
-
reporter should be resolved with commit 26502c1
> names(bl) [1] "hit.tbl" "raw" "url" > head(bl$hit.tbl) queryid subjectids identity positives alignmentlength 1 Query_96855 gi|157834528|pdb|2ABL|A 100 100 163 2 Query_96855 gi|30749934|pdb|1OPK|A 100 100 162 3 Query_96855 gi|93279684|pdb|2FO0|A 100 100 162 4 Query_96855 gi|30749935|pdb|1OPL|A 100 100 162 5 Query_96855 gi|30749936|pdb|1OPL|B 100 100 162 6 Query_96855 gi|1002351501|pdb|5DC0|B 100 100 107 mismatches gapopens q.start q.end s.start s.end evalue bitscore 1 0 0 1 163 1 163 1.20e-120 337 2 0 0 2 163 37 198 3.19e-116 338 3 0 0 2 163 34 195 5.49e-116 337 4 0 0 2 163 76 237 2.47e-115 337 5 0 0 2 163 76 237 2.47e-115 337 6 0 0 57 163 1 107 9.18e-74 217 mlog.evalue pdb.id acc 1 276.1279 2ABL_A 157834528 2 265.9398 1OPK_A 30749934 3 265.3969 2FO0_A 93279684 4 263.8931 1OPL_A 30749935 5 263.8931 1OPL_B 30749936 6 168.1743 5DC0_B 1002351501 > hmm <- hmmer(seq, db="pdb") > names(hmm) [1] "hit.tbl" "url" > head(bl$hit.tbl) queryid subjectids identity positives alignmentlength 1 Query_96855 gi|157834528|pdb|2ABL|A 100 100 163 2 Query_96855 gi|30749934|pdb|1OPK|A 100 100 162 3 Query_96855 gi|93279684|pdb|2FO0|A 100 100 162 4 Query_96855 gi|30749935|pdb|1OPL|A 100 100 162 5 Query_96855 gi|30749936|pdb|1OPL|B 100 100 162 6 Query_96855 gi|1002351501|pdb|5DC0|B 100 100 107 mismatches gapopens q.start q.end s.start s.end evalue bitscore 1 0 0 1 163 1 163 1.20e-120 337 2 0 0 2 163 37 198 3.19e-116 338 3 0 0 2 163 34 195 5.49e-116 337 4 0 0 2 163 76 237 2.47e-115 337 5 0 0 2 163 76 237 2.47e-115 337 6 0 0 57 163 1 107 9.18e-74 217 mlog.evalue pdb.id acc 1 276.1279 2ABL_A 157834528 2 265.9398 1OPK_A 30749934 3 265.3969 2FO0_A 93279684 4 263.8931 1OPL_A 30749935 5 263.8931 1OPL_B 30749936 6 168.1743 5DC0_B 1002351501 > hmm <- hmmer(seq, db="pdb") > names(hmm) [1] "hit.tbl" "url" > head(hmm$hit.tbl) name acc arch archScore archindex bias 1 2abl_A 2abl_A PF00018.26 PF00017.22 4 215298286219076 2.8 2 2fo0_A 2fo0_A PF00018.26 PF00017.22 PF07714.15 5 167299098334483 3.0 3 1opk_A 1opk_A PF00018.26 PF00017.22 PF07714.15 5 167299098334483 3.0 4 1opl_A 1opl_A PF00018.26 PF00017.22 PF07714.15 5 167299098334483 3.0 5 1opl_A 1opl_B PF00018.26 PF00017.22 PF07714.15 5 167299098334483 3.0 7 3t04_A 3t04_A PF00017.22 2 104928242891 3.9 dcl desc evalue 1 67152 ABL TYROSINE KINASE 2.5e-110 2 68088 Proto-oncogene tyrosine-protein kinase ABL1 ( 1.7e-108 3 69020 Proto-oncogene tyrosine-protein kinase ABL1 1.7e-108 4 69953 proto-oncogene tyrosine-protein kinase 2.0e-108 5 69953 proto-oncogene tyrosine-protein kinase 2.0e-108 7 70885 Tyrosine-protein kinase ABL1 1.4e-67 extlink flags kg ndom nincluded 1 https://www.ebi.ac.uk/pdbe/entry/pdb/2abl_A 3 Eukaryota 1 1 2 https://www.ebi.ac.uk/pdbe/entry/pdb/2fo0_A 3 Eukaryota 1 1 3 https://www.ebi.ac.uk/pdbe/entry/pdb/1opk_A 3 Eukaryota 1 1 4 https://www.ebi.ac.uk/pdbe/entry/pdb/1opl_A 3 Eukaryota 1 1 5 https://www.ebi.ac.uk/pdbe/entry/pdb/1opl_A 3 Eukaryota 1 1 7 https://www.ebi.ac.uk/pdbe/entry/pdb/3t04_A 3 Eukaryota 1 1 niseqs nregions nreported pvalue score sindex species taxid 1 11 1 1 -264.9962 369.8 115806351 Homo sapiens 9606 2 11 1 1 -260.7520 363.8 115905927 Homo sapiens 9606 3 11 1 1 -260.7509 363.8 116233168 Mus musculus 10090 4 11 1 1 -260.6023 363.6 115749607 Homo sapiens 9606 5 11 1 1 -260.6023 363.6 115749607 Homo sapiens 9606 7 11 1 1 -166.5537 230.8 115740259 Homo sapiens 9606 taxlink pdb.id 1 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id= 2abl_A 2 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id= 2fo0_A 3 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id= 1opk_A 4 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id= 1opl_A 5 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id= 1opl_B 7 http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id= 3t04_A bitscore mlog.evalue 1 369.8 252.3681 2 363.8 248.1486 3 363.8 248.1486 4 363.6 247.9860 5 363.6 247.9860 7 230.8 153.9367
looks alright?
-
These outputs look very nice. I hope the removal of pdb.id in blast won't break examples and vignettes (I assume we always use plot.blast() to generate the pdb id list, which has been fixed accordingly in Lars' recent commit). Will have a test soon. Thanks for the update!
-
reporter I can update the enma vignettes. Started briefly yesterday
-
reporter - changed status to resolved
resolved in v2.3
- Log in to comment