Snippets

mason.malone Citations corruption data flow

Created by mason.malone
  1. Script calls create_item on a corrupted CrossrefItem (ci_item). Item 1939454 created.
  2. Script calls ci_item.create_others

    1. ProviderItem.create_others iterates over other provider classes. When it gets to WosItem, it calls WosItem.find_or_create_by_doi(10.1007/s004240050236)
      1. WosItem.find_or_create_by_doi() creates a WosItem for that doi, which invokes the WosItem.complete callback.
        1. WosItem.complete calls WosItem.pubmed_item
          1. WosItem.pubmed_item calls PubmedItem.find_or_create_by_pubmed_id(8781202)
            1. PubmedItem.find_or_create_by_pubmed_id() creates a PubmedItem for that pubmed_id, which invokes the PubmedItem.complete callback.
              1. PubmedItem.complete calls ProviderItem.complete

                1. ProviderItem.complete calls ProviderItem.create_item

                  1. ProviderItem.create_item calls Item.find_or_create_by_provider_item(self)
                    1. The doi column of PubmedItem 8781202 is blank, so Item.find_or_create_by_provider_item(...) calls where(...).first_or_create using the PubmedItem record.
                      1. Item.first_or_create creates Item 1939455 because of slight differences in the author list:
                        Item.find(1939454).author_list => "L Kornet, JR Jansen, EJ Gussenhoven, A Versprille"
                        Item.find(1939455).author_list => "L Kornet, J R Jansen, E J Gussenhoven, A Versprille"
                        
                2. Control returns to ProviderItem.complete, which calls item.complete

                  1. Item.complete calls Item.merge_similar
                    1. Item.merge_similar calls Item.similar_items(false)
                      1. Item.similar_items searches ElasticSearch with the title and author list of Item 1939455. If Item 1939454 has been indexed by ElasticSearch yet (which is likely, since the timespan between when it was created and this step is about 2 seconds), then it will match.
                    2. If item 1939454 was matched, it will be returned to Item.merge_similar, which will call Item.replace_with(self).
                      1. Item.replace_with will delete Item 1939454 and update everything to use Item 1939455.
                  2. Control returns to ProviderItem.create_others, which will see that WosItem.item.nil?. This is because WosItem.complete does not call ProviderItem.complete, which is what normally sets the item. Thus, all newly-created WosItems will have a nil item.
                  3. ProviderItem.create_others calls WosItem.instance.update_column(:item_id, 1939454). This results in corruption, since Item 1939454 has been deleted.
  3. Control returns to the script, which calls ci_item.item.complete. ci_item.item will correspond to Item 1939454. Although it was deleted, the object still exists.

    1. Item.complete calls Item.merge_similar
      1. Item.merge_similar calls Item.similar_items(false). If Item 1939455 has been indexed yet, then it will match. Unlike before, the timing is much more sensitive. The last time I caused this to happen, the time between when Item 1939455 is created and this step was 60 milliseconds. The index refresh rate for ElasticSearch is set to 1 second, so there's a 6% chance that the item has been indexed at this point.
    2. If item 1939455 was matched, it will be returned to Item.merge_similar, which will call Item.replace_with(self).
      1. Item.replace_with will delete Item 1939455 and update everything to use itself (Item 1939454). Since 1939454 was deleted, this will corrupt everything that was updated in the previous call to Item.replace_with, which can number in the thousands of rows.

Comments (0)

HTTPS SSH

You can clone a snippet to your computer for local editing. Learn more.