Wiki
Clone wikiJummpIndexer / dev
Implementation considerations
Jummp will invoke the indexer every time a new model or an update to an existing model is submitted to an instance. On each occasion, Jummp will generate a JSON file containing basic information about the content to be indexed and will invoke the indexer, passing the JSON file as an argument.
The application is responsible for updating the information used to power the Jummp search engine. There is currently support for Solr and the EBI Search via OmicsDI, although it is possible to plug in alternative technologies. The current Solr schema is described here.
Since Jummp is capable of supporting multiple formats for representing models, the indexer has a dedicated component responsible for dealing with each model format recognised by Jummp.
The typical sequence of events for processing a submission is as follows:
- construct a RequestContext from the JSON file
- establish the communication with the external systems it needs to communicate with (e.g. relational database, search engine)
- instantiate a ModelIndexer capable of extracting information from the submission (see below)
- launch the indexing process, sending any gathered data to the appropriate store
- close all resources and shuts down.
Class diagram
A high-level overview of the project is available as a class diagram.
Key classes
RequestHandler
: the master of ceremonies -- responsible for overseeing the indexing of the submissionRequestParser
: parses the information from the JSON file provided by Jummp and constructs aRequestContext
RequestContext
: stores the information extracted from the submission being processedGormUtil
: facade for interacting with the database and the ORM (GORM)MiriamRegistryService
: facade for extracting data about MIRIAM-compliant cross references from the Identifiers.org RegistryIndexingStrategy
: contract for components capable of indexing model metadata according to a particular metadata schema defined in JummpModelIndexer
: interface for services capable of indexing models in a particular format using a given strategy for processing model metadataModelIndexerFactory
: factory class for creating theModelIndexer
instance for a particular requestAnnotationReference
: builder ofResourceReference
objects given their URI. Delegates the work of retrieving information to implementations of the following interfaces: of the followingTermInformationProvider
: contract for services that extract basic information like label or description about cross referencesSynonymProvider
: concrete implementations of this interface retrieve synonyms for terms coming from ontologies.AncestryProvider
: concrete implementations fetch the parents for terms coming from ontologies
OLSBasedAnnoProcessor
andDatabaseBasedAnnoProcessor
: super classes for the services implementing the interfaces above and are invoked byAnnotationReference
in the process of constructing a fully-populatedResourceReference
AnnotationPersister
: interface for saving aResourceReference
instance into the databaseProcessingIndexDataStrategy
: interface for services defining callbacks that get invoked byRequestHandler
once the indexing process has finished. The Solr implementation sends the data from the current RequestContext to the Solr server.
Indexing new content
Jummp instances can define their own strategies for capturing model metadata. This means that models
in the same formats could be annotated differently in different instances of Jummp. To accommodate
this level of flexibility, there is a dedicated indexing strategy for every metadata schema defined
in an instance of Jummp. ModelIndexerFactory
abstracts this complexity, providing an adequate
ModelIndexer
instance for the current request context.
Typically, the ModelIndexer
will then parse the submission files seeking to populate the search index.
The actual data that is extracted varies according to the model format, but will normally consist of
textual information (e.g. model name, element descriptions, publication details), and cross references.
Extracting and resolving cross references
Cross references or resource references are effectively ways of uniquely identifying an external entity. They are used to link an entity from a model to another entity in an external location (typically, a publicly-available ontology or online database). Which cross reference is associated with a model entity can be indicated in several ways, but the most commonly used approaches are URNs and URIs. Both are supported by Jummp Indexer.
For certain cross references, including those coming from ontologies hosted by OLS or from public databases like UniProt, Ensembl, Reactome or KEGG, to name a few, Jummp Indexer provides utilities capable of retrieving various types of additional information (label, description, synonyms, parent terms). These details are used by Jummp to make improve the search results and also to produce a simpler model display.
Linking a metadata schema to the appropriate search fields
In order to improve model search and understanding of the context of a model, Jummp provides mechanisms for
defining a metadata schema, which defines the properties that can be specified for a model, along with their
range of expected value types. The metadata information accompanying a model should be mapped to the appropriate
search engine fields. This is achieved through implementations of the IndexingStrategy
interface. A default
one is provided, but alternatives can be developed and plugged into the indexing process.
Updated