Wiki

Clone wiki

api-specifications / Guidelines / SignatureForChange

How do I detect changes relevant for my needs?

Some metadata may be more important to you than other; your business processes may need to track when they change. This question often arises, and there is no silver bullet: we, as providers of large sets of metadata, cannot design a solution that fits everyone's use cases.

Why?

The harmonized data models provide amounts of metadata, more that most consumers will actually need in their specialised use cases. Some may be interested to track the availability of new files. Some may want to detect new projects, or only changes in stages. Some are interested in the texts of abstracts or titles.

As providers of metadata, we expose only ONE element, the lastChangeTimestamp, that allows detecting any change to the metadata. This is explained in many details in this section. In summary, this single element won't allow to solve all possible use cases optimally, and we are not able to provide other "dates of changes" for all your cases, as this would be an infinite quest.

Moreover, each of you may be interested to detect several patterns of changes, to trigger different data updates or business processes. Below we provide an approach that allows all of this. Internally, we use this approach to generate the API contents as well.

How to do it

The algorithm is:

  1. Request a broader set of entities than you may actually need
  2. Process all response entities with a sort of "funnel" that will detect precisely what you need
  3. For that, you associate every entity's "urn" identifier, to a custom-made digital "signature" of only your metadata of interest
  4. You then compare the generated signature with the previous one, held in a local database
  5. If they differ, you have pinpointed a change in the exact metadata you were interested in!

The last 3 steps can be paralleized to detect multiple patterns of interest. This sounds complicated but is actually straightforward to implement and extremely robust to operate.

Practically

If you (very wisely) query the APIs only to retrieve only recently modified items, you will query on the lastChangeTimestamp that is updated on any change to any entity (see here). Therefore you use it to query all entities, or to refine (limit) your search. This will take care of step (1).

If you do this properly, this will guarantee that you process changes only once: the response will NOT contain any entity you have already handled.

Yet you still may end up processing a large amount of responses that, for the data you are interested in, were actually not modified. Remember to use pagination to scroll through all the result set.

Entities in the response will be represented in XML or JSON, depending on your query choice (via the "Accept" request header). You will have to process all of them, in sequence or parallel. The goal will be to selectively pick-up some few metadata values of your interest.

There are several options, choose the most convenient to you: * map XML or JSON to objects using a framework library * parse XML or JSON and listen to events representing the metadata of choice * convert XML or JSON to DOM and traverse to the metadata of choice * parse to match for XPath or JSONPath expressions

For one entity

  1. Extract the "urn" that is the entity's unique identifier
  2. Initialize a message digest from your favorite software library. The hashing function should not be too short like MD5, for example SHA256 is a good fit
  3. For all metadata values of interest encountered in the entity, by the option listed above, append the value into the message digest
  4. Once all values have been appended, generate the string "signature" representation of the digest, typically a long hexadecimal string
  5. Associate this signature to the urn of the entity

You do this for all entities, (and optionally for multiple metadata patterns you want to detect changes).

Then you have to compare the previously generated signature stored in a local database, with those generated above. This will give you the list of entities where your metadata of interest has changed.

Finally, upon successful processing, you update your local database with the signatures generated above.

Notes

This algorithm allows you to detect changes only: it won't tell you what the old and new values are. For that you will have to compare local storage with the entities returned in the API responses

For projects and publications, we are usually not deleting items. At the moment there is no "event" system that would allow you to detect a previously existing item no longer exists in our databases. If you need to detect that case, you will need to periodically "lookup" all "urn"s. If a HTTP 404 is returned this means the item has been deleted since the last invocation.

Updated