Overview of ACCENUMB prefixes for an institute

Issue #46 new
Matija Obreza created an issue

Feature described in #45 cannot correctly extract the real sequential number of the accession without stripping out the prefix first.

Use case: ACCENUMBs "W6 1000", "W6 1001" or "0X1 1000", "0x1 1001". The method described in #45 will produce 6.1, 6.1001, 0.1 and 0.1 respectively.

Genesys should determine the prefixes used by genebanks for their collections by analyzing all ACCENUMBs of the genebank. The expected output is a list of commonly used prefixes (Strings). We can then further analyze the composition of the collection using each individual prefix using the existing "startsWith" filter.

Method

A set of longest common sub-strings is generated from the set of all ACCENUMBs. Unfortunately that will include very common occurrence of "PREFIX0" and "PREFIX00". A score is calculated for each suggested prefix based on the total number of accession numbers analyzed, the number of records matching the prefix and the prefix length. Weights need to be determined.

Results

The analysis results will not be persisted directly, but we will allow for further inspection of records using each prefix and will allow the institute admin to select which of the determined prefixes should be included in their genebank page. These can be explained with user-provided comment. E.g. "TMp-" is prefix for the plantains in the collection.

Using prefixes

The stored prefixes will be considered when generating the seqNo for each accession record. This will provide for more accurate filtering and sorting. The prefix itself can then also be persisted (to sort by prefix + seqNo?).

Comments (0)

  1. Log in to comment