Wiki

Clone wiki

JummpIndexer / dev

Implementation considerations

Jummp will invoke the indexer every time a new model or an update to an existing model is submitted to an instance. On each occasion, Jummp will generate a JSON file containing basic information about the content to be indexed and will invoke the indexer, passing the JSON file as an argument.

The application is responsible for updating the information used to power the Jummp search engine. There is currently support for Solr and the EBI Search via OmicsDI, although it is possible to plug in alternative technologies. The current Solr schema is described here.

Since Jummp is capable of supporting multiple formats for representing models, the indexer has a dedicated component responsible for dealing with each model format recognised by Jummp.

The typical sequence of events for processing a submission is as follows:

  1. construct a RequestContext from the JSON file
  2. establish the communication with the external systems it needs to communicate with (e.g. relational database, search engine)
  3. instantiate a ModelIndexer capable of extracting information from the submission (see below)
  4. launch the indexing process, sending any gathered data to the appropriate store
  5. close all resources and shuts down.

Class diagram

A high-level overview of the project is available as a class diagram.

Key classes

  • RequestHandler: the master of ceremonies -- responsible for overseeing the indexing of the submission
  • RequestParser: parses the information from the JSON file provided by Jummp and constructs a RequestContext
  • RequestContext: stores the information extracted from the submission being processed
  • GormUtil: facade for interacting with the database and the ORM (GORM)
  • MiriamRegistryService: facade for extracting data about MIRIAM-compliant cross references from the Identifiers.org Registry
  • IndexingStrategy: contract for components capable of indexing model metadata according to a particular metadata schema defined in Jummp
  • ModelIndexer: interface for services capable of indexing models in a particular format using a given strategy for processing model metadata
  • ModelIndexerFactory: factory class for creating the ModelIndexer instance for a particular request
  • AnnotationReference: builder of ResourceReference objects given their URI. Delegates the work of retrieving information to implementations of the following interfaces: of the following
    • TermInformationProvider: contract for services that extract basic information like label or description about cross references
    • SynonymProvider: concrete implementations of this interface retrieve synonyms for terms coming from ontologies.
    • AncestryProvider: concrete implementations fetch the parents for terms coming from ontologies
  • OLSBasedAnnoProcessor and DatabaseBasedAnnoProcessor: super classes for the services implementing the interfaces above and are invoked by AnnotationReference in the process of constructing a fully-populated ResourceReference
  • AnnotationPersister: interface for saving a ResourceReference instance into the database
  • ProcessingIndexDataStrategy: interface for services defining callbacks that get invoked by RequestHandler once the indexing process has finished. The Solr implementation sends the data from the current RequestContext to the Solr server.

Indexing new content

Jummp instances can define their own strategies for capturing model metadata. This means that models in the same formats could be annotated differently in different instances of Jummp. To accommodate this level of flexibility, there is a dedicated indexing strategy for every metadata schema defined in an instance of Jummp. ModelIndexerFactory abstracts this complexity, providing an adequate ModelIndexer instance for the current request context.

Typically, the ModelIndexer will then parse the submission files seeking to populate the search index. The actual data that is extracted varies according to the model format, but will normally consist of textual information (e.g. model name, element descriptions, publication details), and cross references.

Extracting and resolving cross references

Cross references or resource references are effectively ways of uniquely identifying an external entity. They are used to link an entity from a model to another entity in an external location (typically, a publicly-available ontology or online database). Which cross reference is associated with a model entity can be indicated in several ways, but the most commonly used approaches are URNs and URIs. Both are supported by Jummp Indexer.

For certain cross references, including those coming from ontologies hosted by OLS or from public databases like UniProt, Ensembl, Reactome or KEGG, to name a few, Jummp Indexer provides utilities capable of retrieving various types of additional information (label, description, synonyms, parent terms). These details are used by Jummp to make improve the search results and also to produce a simpler model display.

Linking a metadata schema to the appropriate search fields

In order to improve model search and understanding of the context of a model, Jummp provides mechanisms for defining a metadata schema, which defines the properties that can be specified for a model, along with their range of expected value types. The metadata information accompanying a model should be mapped to the appropriate search engine fields. This is achieved through implementations of the IndexingStrategy interface. A default one is provided, but alternatives can be developed and plugged into the indexing process.

Updated