Remove un-linked concepts from initial search result

Issue #41 resolved
b created an issue

The problem is that many searches return concepts that have no data associated with them in GBK. This produces problems when trying to find something that is in there (e.g. try to find type 2 diabetes) as well as frustrations when you think you have something and there is no data (e.g. search for NGLY1 returns two records that are presumably pointing to the same concept, clicking on one 'NGLY1' gets you could content while the other 'NGLY1 gene' gets nothing.

I think this could be most easily approached with some curation of the database. Remove concepts that don't exist in the semmedDB triples table from the table that is queried for this service.

Comments (17)

  1. b reporter

    A question here is whether or not to return hits from the Implicitome if no Explicit relations can be found. For example, the disease XPCC has no explicit relations, but does show up in implicitome. If it were possible, I think we should make it possible to get there - for example, in the case that no explicit relations were present, but implicit relations were, land the user on the Implicit tab.

  2. Richard Bruskiewich

    XPCC is mislabeled I think. Kenneth had trouble pulling out the details from mygenes.info. The symbol 'XPC' seems to correspond to this gene.

    Not sure how we should handle such curation outliers. I'm sure there are a fair number such cases.

  3. b reporter

    No, it's correct, XPCC is a disease (see omim). Just happens to be related to the XPC gene.

    Curation wise, we just have to look for patterns that we can detect and fix. UMLS is far from perfect. Downstream we can work on community fixing of problems as they find them.

  4. Richard Bruskiewich

    Explains why XPCC didn't show up as a symbol search on MyGenes.info... :-)

    Not sure where the notion arose that it was an Entrez concept.. Sorry about that...

  5. Richard Bruskiewich

    The Explicit concept list (search by text popup results) no longer display concepts that don't have explicit predications. These are simply marked as orphans in the database, not deleted from the Concepts table. If they get predication data in the future, rerunning the data audit will make these concepts visible again.

    I've not yet taken a look at the counter idea of displaying implicit concepts when explicit concepts come up empty handed... I'll reviewt his next.

  6. Richard Bruskiewich

    SemMedDb concepts are marked with the "IS_ORPHAN" flag, based on a data_audit review of concept representation in Predications (flag = true if no predications are found). This flag is used to hide concepts without predications.

    The Concept by text search now also concurrently shows "Explicit" and "Implicit" hits in separate tables (merging the two tables would be somewhat difficult at this point...).

    Clicking the concepts of either, brings up the desktop session for that concept.

  7. Richard Bruskiewich

    Ben, you asked "...If it were possible, I think we should make it possible to get there - for example, in the case that no explicit relations were present, but implicit relations were, land the user on the Implicit tab. ...

    Following up on the splitting of concept matches to 'Explicit' and 'Implicit' concepts, it is now coded that when the user picks an 'Explicit' concept, the 'Explicit' relations concept table is brought up, whereas, if the user picks an 'Implicit' table, the Implicit relations concept table is brought up.

  8. Richard Bruskiewich

    I've added the "is_orphan" tag to the Implicitome concept table as well, and have written a data auditor similar to the one I previously wrote for SemMedDb, which checks whether or not a given concept has any tuples in the Tuples table of the database. If not, is is marked as 'orphan' and is filtered out of the initial concept by text search.

  9. Richard Bruskiewich

    I'm just testing it to make sure it works as expected, then I'll unleash it on the production database. Hopefully, this will filter out concepts in the user's initial query by text, which don't have any relationship data associated with them.

  10. b reporter

    As of today, this issue is resolved for the explicit relations but still open for the implicit. I understand the orphan tagging process is ongoing. (First test on a query for 'arf' lead to implicit concept (R)-4-Hydroxy-3-(3-oxo-1-phenylbutyl)-2-benzopyrone which has no links.)

  11. Richard Bruskiewich

    Yes, that is correct. The remaining background task is doing exactly that: striving to mark as "orphans" the implicitome concepts without associated tuples. Once it is complete, we'll have resolved this issue.

  12. Richard Bruskiewich

    The Implicitome orphan data audit background process is taking a veeeerrry loooong tiiiime to run.

    The way it is coded, it grabs subsets of 10,000 tuples at a time and records the observed concept_id's.

    I thought that such a batch size wouldn't overwhelm the RAM. I guess I could have aimed a bit higher.

    However, since the search for batches of tuples to read in is an "OFFSET... LIMIT.." SQL statement, each progressive search takes longer and longer to run.

    I also don't know if the process is now page thrashing on the server(?) since the implicitome table is so huge (hence, cannot be held entirely in memory). Not sure how I'd assess that, but that would really slow things down.

    Time is thus slowing down for the process, rather like falling in towards the event horizon of a black hole...

    I worry that as time goes on, the process might increasingly be challenging its mean-time-between-failure limit. Again, it is not designed to be resilient for restart simply because I didn't anticipate such an excruciatingly long process run.

    I'll let it continue running. However, I don't know how long it will take to run and if it fails, the whole data audit strategy will need to be rethought to identify a way forward.

    A number of ideas come to mind (some expressed before):

    1) If, in fact, our application only queries the top 1% percentile of the Tuple table anyway, then we could limit the parsing of tuples for identification of orphans to that number. That would reduce the run time at least 100 fold, if not more, since the Tuples table is already ordered by percentile.

    2) We should perhaps run the algorithm on an offline copy of the database, to update a copy of the concepts table (which is much smaller, only a few hundred thousand records), then transfer the updates to the production concept table (perhaps, with a suitably fabricated SQL update script)

    3) We could perhaps try to partition the Tuples table into smaller subsets, then parse them separately for concepts, then merge the resulting concept status (i.e. a concept is not an orphan is observed in one of the subset tuples tables). This is a kind of "MapReduce" strategy. One wonders if the Implicitome can be converted into a Hadoop-compatible environment for this purpose.

    4) (More radically), could or should the Implicitome database be moved into a NoSQL database (e.g. Neo4j) to accelerate the updates(?)

  13. Richard Bruskiewich

    BTW, as of 9 pm on the 27th July, the Implicitome orphan data audit is only ~ 60% complete. I have, however, verified that it is still apparently running (the log file reflects that fact), hopefully correctly.

  14. b reporter

    This appears to be working just fine now. Casual experimentation shows no search results that have no underlying content.

  15. Log in to comment