ABC data loading defect

Issue #56 resolved
Richard Bruskiewich created an issue

A number of defects are apparent relative to Implicitome release needs:

1) 502 proxy errors with some data access, e.g. http://knowledge.bio/query/concept/details/C1855923

2) The ABC model data loader appears to have failed to load significant data present in the original file, e.g. CWH43. Why? It needs to be loaded.

3) Also review availability of B model data for CYP2R1 and Seckel Syndrome

Comments (18)

  1. Richard Bruskiewich reporter

    Item 1) is fixed for graceful failure (not 502). The concept C1855923 though, is absent from the SemMedDb version in production.

    For items 2), 3) and PGAP2, the ABC data excerpt was loaded as a segmental (not whole file) upload. The failure to load the data in the first place is perplexing, but remedial actions were taken in the script design to mitigate this failure.

  2. Richard Bruskiewich reporter

    File uploading finally appears to be working, so B concept definitions are widely available for several new concepts. Closing this issue for now.

  3. b
    • changed status to open

    It appears that the large majority of B concepts have not been loaded. Examples for testing from the Implicitome paper (supplementary table S3): RTF1 hyperparathyroidism-jaw tumor syndrome GPIHBP1 combined lipase deficiency FBXW8 gloomy face syndrome, Dolichospondylic dysplasia, 3M1, 3M syndrome PMVK Mevalonic Aciduria DDX19B LCCS ZWINT MCPH2 DNAJC13 photic sneeze reflex

  4. b

    Adding another example to test from the paper: MCPH2 should show a link to Seckel Syndrome via B concept Primary microcephaly Currently MCPH2 shows up in the 'implicit' search results, but not the explicit, but then no implicit relations are reported

  5. Richard Bruskiewich reporter

    MCPH2 exists as a concept in the implicitome.concept table, but there are no tuples for it in the version of implicitome.tuples table I have:

    mysql> select * from concept where name='MCPH2' ; +------------+-------+------------+ | concept_id | name | definition | +------------+-------+------------+ | 3000570 | MCPH2 | | +------------+-------+------------+ 1 row in set (0.13 sec)

    mysql> select * from tuples where sub_id=3000570 or obj_id=3000570; Empty set (2 min 39.02 sec)

    To date, I've only filtered out SemMedDb.CONCEPT table entries that lack an associated Predication. I've not filtered out the Implicitome in a similar manner. That is why MCPH2 comes up: it's in the Implicitome concept table... even if it has no tuples.

    I have not (yet) checked the raw ABC file for MCPH2 hits. I guess that would be the next step of my detective work.

  6. Richard Bruskiewich reporter

    The plot thickens a bit:

    mysql> select * from concept where concept_id=1858535; +------------+---------------------------------------------------------+------------+ | concept_id | name | definition | +------------+---------------------------------------------------------+------------+ | 1858535 | microcephaly, primary autosomal recessive, 2 (disorder) | | +------------+---------------------------------------------------------+------------+ 1 row in set (0.00 sec)

    is what the ABC model data records as '1858535|MCPH2' in the ABC model file (as an A or C concept). Using this concept_id

    mysql> select count() from tuples where sub_id=1858535 or obj_id=1858535; +----------+ | count() | +----------+ | 13432 | +----------+ 1 row in set (2 min 7.03 sec)

  7. Richard Bruskiewich reporter

    The Implicitome 'Term' table may be useful - I've not really used it so far, but perhaps, I need to... Here's the result of the search on the two MCPH2 associated Implicitome concept id's:

    mysql> select * from term where concept_id=1858535; +---------+------------+------------+-------+---------------+----------------+------------+ | term_id | concept_id | subterm_id | text | casesensitive | ordersensitive | normalised | +---------+------------+------------+-------+---------------+----------------+------------+ | 328901 | 1858535 | 0 | MCPH2 | 0 | 1 | 1 | +---------+------------+------------+-------+---------------+----------------+------------+ 1 row in set (0.08 sec)

    mysql> select * from term where concept_id=3000570; +---------+------------+------------+----------------------------------------------+---------------+----------------+------------+ | term_id | concept_id | subterm_id | text | casesensitive | ordersensitive | normalised | +---------+------------+------------+----------------------------------------------+---------------+----------------+------------+ | 386192 | 3000570 | 0 | MCPH2 | 0 | 1 | 0 | | 386193 | 3000570 | 1 | microcephaly, primary autosomal recessive 2 | 0 | 1 | 0 | | 386194 | 3000570 | 2 | Microcephaly, primary autosomal recessive 2 | 0 | 1 | 0 | | 386195 | 3000570 | 3 | MCPH-II | 1 | 1 | 0 | | 386196 | 3000570 | 4 | MCPH-2 | 0 | 1 | 0 | | 386197 | 3000570 | 5 | microcephaly, primary autosomal recessive II | 0 | 1 | 0 | | 386198 | 3000570 | 6 | microcephaly, primary autosomal recessive2 | 0 | 1 | 0 | | 386199 | 3000570 | 7 | microcephaly, primary autosomal recessive-2 | 0 | 1 | 0 | | 386200 | 3000570 | 8 | Microcephaly, primary autosomal recessive II | 0 | 1 | 0 | | 386201 | 3000570 | 9 | Microcephaly, primary autosomal recessive2 | 0 | 1 | 0 | | 386202 | 3000570 | 10 | Microcephaly, primary autosomal recessive-2 | 0 | 1 | 0 | +---------+------------+------------+----------------------------------------------+---------------+----------------+------------+ 11 rows in set (0.03 sec)

  8. Richard Bruskiewich reporter

    B concepts for all of the Implicitome supplemental table genes are now completely loaded, and a global B concept loading background process using the same underlying code base, continues to run. Unless there is an unknown mode of failure for the global loader, all the known B concepts should soon be loaded. However, this should be reviewed further in the future.

  9. Richard Bruskiewich reporter

    I've devised and am running an Implicitome data audit script which will mark as "orphan" all concepts which do not have tuple entries in the database, thus filtering them out from view. The issue with MCPH3 above should be fixed by this.

  10. b

    As of now, there are still records with missing B concepts. Not hard to find from browsing around so would guess there are a lot: Linking Concept Co-Occurrence for 'Endometriosis' and 'FLCN' Linking Concept Co-Occurrence for 'Endometriosis' and 'GNRH1' Linking Concept Co-Occurrence for 'Endometriosis' and 'MUC16'

  11. Richard Bruskiewich reporter

    What is strange for me is that my work loading the Implicitome supplemental genes seemed to suggest that my batch script for loading B concepts normally works for small batch loading. I'll have to go back to the drawing board to figure out why the global loading process - which takes a very long time to run (at least at week, I think, across all the >200 million entries) still fails to snag all the B concepts.

  12. b

    @rbruskiewich this is the last outstanding data bug. Why don't you point me to the problematic data loading code. Maybe fresh eyes from me or others in the group can help crush it once and for all. I think its important enough that we should hold off publishing until its resolved.

  13. Log in to comment