Wiki

Clone wiki

OYSTER / MATRIXTOKENIZER

The primary purpose of the MatrixTokenizer Index Function ("MatrixTokenizer") is to support indexing (blocking) for the multivalued comparator function Matrix Comparator (MatrixOverlap). MatrixTokenizer is a parsing function to be used as an index generator (<Index> element). The function is to parse a character string into tokens (substrings). The substrings selected as tokens will depend upon the set of delimiter characters provided by the user. MatrixTokenizer has three control parameters. The first control parameter specifies the token delimiting characters. The second control parameter specifies the minimum length of a token to be indexed, and the third is an optional string of tokens to be excluded from the index (the exclusion list).

Semantics

The operation of MatrixTokenizer is as follows:

1.The input string from the reference is separated into a list of tokens based on the delimiter characters provided as the first control parameter.

2.After tokenization, any token that meets one of the following conditions is dropped from the list of tokens to be indexed

a.The length of the token is less than the minimum token length as specified by the second control parameter (default length = 2)

b.The token matches one of the excluded tokens in the exclusion list specified by the third control parameter

3.The reference is then indexed for each remaining token in concatenation with all other hash values produced within the same index generator.

Syntax

The syntax for MatrixTokenizer is "MatrixTokenizer(C1, C2, C3)" where

C1 is a string of characters to be used in addition to the blank character as token delimiters. MatrixTokenizer always uses the blank character as a token delimiter, but the user may specify additional characters through this parameter. The string of additional token delimiters is enclosed by apostrophes

C2 is a positive integer value indicating the minimum length of tokens to index.

C3 is an optional string enclosed in apostrophes representing a list of excluded items separated by the pipe (|) character.

For example, if the additional tokens delimiters are comma character (,) and hyphen (-) character, the minimum token length is 3, and the excluded list of items are "ABC" and "EFG", then the MatrixTokenizer function would be encoded in the <Segment> element of an <Index> element as <Segment Item="VarOne", Hash="ListTokenizer(',-', 3, 'ABC|EFG')"

MXT Requirements

MXT.1 Syntax for the MatrixTokenizer Function

MXT.1.1 The name of the function shall be "MatrixTokenizer"

MXT.1.2 The name shall not be case sensitive

MXT.2 Control parameters for MatrixTokenizer

MXT.2.1 The MatrixTokenizer index function shall have 3 control parameters (arity of 3) C1, C2, and C3 represented as "MatrixTokenizer(C1, C2, C3)"

MXT.2.2 The first control parameter C1 shall specify the additional tokenizing characters.

MXT.2.2.1 The list delimiting characters shall be enclosed in apostrophes.

MXT.2.2.2 The list delimiting characters shall be any characters except it shall not be an ampersand (&), less-than (<), greater-than (>), quotes ("), apostrophe ('), or pipe (|) character.

MXT.2.2.3 If no additional list delimiter characters are given or are not given in the proper format, the system shall tokenize only by the blank character.

MXT.2.3 The second control parameter C1 shall specify the minimum length of a list item to be indexed.

MXT.2.3.1 The minimum length shall be given as an integer value.

MXT.2.3.2 If the minimum length given is 0, then all tokens shall be indexed.

MXT.2.3.3 If the minimum length is not given or is not given in the proper format, the system shall default the minimum length to 2.

MXT.2.4 The third control parameter shall be the list of excluded token values.

MXT.2.4.1 The list of excluded token values shall be given as a string of characters enclosed by apostrophe characters.

MXT.2.4.2 If there is more than one excluded token value in list, then consecutive values shall be separated from each other by the pipe (|) character.

MXT.2.4.3 Excluded token values shall be comprised of any characters except they shall not contain an ampersand (&), less-than (<), greater-than (>), quotes ("), apostrophe ('), or pipe (|) character.

MXT.2.4.4 If the string of excluded token values is not given or not given in the proper format, no excluded token values shall be defined.

MXT.3 Input Tokenization

MXT.3.1 The input for MatrixTokenizer shall be a single character string.

MXT.3.2 The input string shall be divided into token values where each token is a substring of the input string delimited by one or more of the token delimiting characters.

MXT.3.3 The token delimiting characters shall not be included in the token values they define.

MXT.3.4 If the length of a token value is less than the user specified minimum length, then the token value shall not be indexed

MXT.3.5 Each letter in the remaining token values shall be changed to an uppercase letter

MXT.3.6 If a token value extracted from the input string matches a token value in the list of excluded token values, then the token value shall not be indexed.

MXT.4 Reference Indexing

MXT.4.1 In the case that the final list of token values is empty, then the reference shall not be indexed.

MXT.4.2 Otherwise, the reference shall be indexed for each token value in the final list of token values in concatenation with each hash value produced by other segments in the same <Index> definition including other MatrixTokenizer functions.

Updated