Wiki

Clone wiki

OYSTER / Development Oyster Functions

Creating Oyster Functions

Abstract Parent Classes and Interfaces

Abstract parent classes and interfaces are used to standardize functionality and behavior and make it easy to add new functions. The hierarchy of the classes are:

  • OysterAction base interface class
    • Compare
    • Score
    • Tokenize
    • Transform
  • OysterFunction abstract base class for all functions
    • Inner enum FunctionType
    • Inner class Builder
    • OysterFunctionDeterministic abstract base class for deterministic functions implementing these interfaces:
      • Compare
      • Transform
      • Tokenize
    • OysterFunctionProbabilistic abstract base class for probabilistic functions implementing these interfaces:
      • Compare
      • Score
    • OysterFunctionTokenize abstract base class for tokenizer-only functions implementing these interfaces:
      • Tokenize

To create a new function, identify whether it is deterministic, probabilistic, or a tokenizer and extend the appropriate base class. This will provide core and default functionality needed for that type of function. In most cases you will only need to implement the one abstract method: transform for deterministic or score for probabilistic and possibly the configure method if the function additional parameters. The base classes will provide the reset of the needed functionality. It is often easiest to identify a function with similar behaviors and copy and existing function and modify it.

OysterFunction

OysterFunction is the abstract base class of all functions. It uses inner classes to provide a mechanism for Oyster to safely instantiate and use the function classes. It also provides helpers and default functionality to all child classes.

  • void configure(String) - configuration method that throws an error if any arguments are passed for functions that don't take any arguments
  • String[] parseArgs(String) - function signature argument parsing that uses
    • comma (,) as the parameter separator
    • single quote (') to quote strings
    • space ( ) is ignored outside of quoted strings
  • boolean isArgValie(String...) - validates one or more arguments and returns false if any argument is null, or all whitespace
  • boolean rightAnswer(boolean) - if the function specified with a negation character (~ or !) at the beginning of the function signature the result value is inverted. All compare functions should use this method to return their result.
return rightAnswer(result);

FunctionType

FunctionType is an enumeration that lists all of the functions available to Oyster. It defines the referential name such as "SCAN" and the class the provides the functionality "edu.ualr.oyster.function.determistic.Scan". To add a new function, you add a new entry in the enumeration definition. And of course create the specified class as in this definition for SCAN.

SCAN("edu.ualr.oyster.functions.deterministic.Scan"),

Builder

Oyster uses the builder class to actually instantiate and initialize a function when it is referenced in a rule or index. The builder is used in the attribute parser to construct and initialize the defined functions.

OysterAction

OysterAction is the parent interface definition for all of the other function interfaces It is normally not used directly. Its purpose is to provide access to common OysterFunction methods when an interface is being used as the function reference.

Compare

The Compare interface defines the compare(String, String) method. When this interface is implemented by a function, that function can be used in the "Similarity" parameter of a Rule Term. All Compare methods should be careful to always use the rightAnswer() method to normalize the boolean result value based on the function signature.

Deterministic functions return true/false based on a string comparison of the transform method on both arguments. The String equals method is the return value. Some comparators are case sensitive and some are not depending on the requirements of the algorithm.

Probabilistic functions return true/false depending on wether the result of the score method is greater than or equal to the configured threshold value. The threshold is passed as a parameter that is greater than 0.0 and less than or equal to 1.0. in the function signature in the Similarity specification. If the calculated normalized score is less than the specified threshold the result is false, otherwise the result is true. Note that a threshold of 0.0 is normally considered invalid because it does not make sense, and 0.0 is used as the result when one or both of the supplied attributes for comparison are invalid

Transform

The Transform interface defined the transform(String) method. Transform takes a string input and returns an output that is altered by the indicated algorithm.

Tokenize

The Tokenize interface defines the tokenize(String) method. This method is used by indexing to generate an array of strings containing 0 - n elements. For functions that are not list or matrix functions tokenize returns an array containing a single value that is the output of the transform method using the supplied string parameter

Score

The Score interface defines the score(String, String) method. This method uses the named algorithm to compute a normalized score based on the calculated distance between the two strings. This computed value is used by the compare method to determine if the threshold value has been reached.

OysterFunctionDetermistic

OysterFunctionDeterministic is a child class of OyserFunction and the abstract parent class of deterministic functions. It provides default implementations of the compare and tokenize method so that the only method that must be defined in the child function class is transform. The reason for this is that transform provides the unique functionality of each defined function.

compare(String, String) default behavior

Calls transform on each string and compares the results using the String.equals() method [case sensitive] and returns the resulting true or false boolean value.

tokenize(String) default behavior

Calls transform on the string and stores the results in a single element String array which is returned

transform(String) abstract method

The transform method is defined as abstract so that it must be implemented by the child function class.

OysterFUnctionProbabilistic

OysterFunctionProbabilistic is a child class of OysterFunction and the abstract parent class of probabilistic functions. It provides default configure and compare methods.

configure(String parameters)

The configure method takes a single "threshold" parameter. It implements the threshold property and its set'er and get'er methods. It also allows the minimum and maximum threshold values and their set'ers and get'ers for special cases. It validates the parameter and throws an IllegalArgumentException if the supplied value is non-numeric, less than the minimum [default 0.0] or greater than the maximum [edfault 1.0] values.

If the child function class has more complex configuration requirements it can override the configure method and still use the parent class threshold property and accessors. Threshold validation is performed in the set'er methods.

compare(String,String) default behavior

Calls the score(String, String) method with the supplied parameters and then compares the resulting score against the threshold value. If the threshold has not been set (configured) it throws an IllegalArgumentException. This is lazy validation so that the parameter is only required in cases where the compare method is used (Similarity)

score(String,String) abstract method

The score method is defined as abstract so that it must be defined in the child function class

OysterFunctionTokenizer

OysterFunctionTokenizer is a child class of OysterFunction and the abstract parent class of stand-alone tokenizer functions (not deterministic or probabilistic). It provides common configuration properties along with their accessors:

  • List Delimiters (getListDelimiters & setListDelimiters) - a String containing the characters that are used to delimit elements of a list. There can be one or more delimiters which are evaluated separately. The default value if not specified is "|" (word-mark)
  • Pair Delimiters (getPairDelimiters * setPairDelimitere) - a String containing the characters that are used to delimit key-value pairs. There can be one or more delimiters which are evaluated separately. The default value if not specified is ":" (colon)
  • Minimum token length (getMinimumLength, setMinimumLength) - This is normally used as the minimum length for tokens that are included in the tokenizer output. The actual use is implementation dependent and can vary by function.

There is no default configuration function because the parameters and their order are not consistent between functions so no default pattern could be implemented. This means that the child class must implement its on configure method if it takes any parameters, but it can use the parseArgs methods of the uber-parent OysterFunction class to transform the method signature into an array of parameters.

Unit Testing

It is imperative that you also create unit tests for your new function. Again the easiest approach is to copy an existing class and modify it to suite your specific needs. You will not that the existing unit tests instantiate the function class directly, validate any set'ers and get'ers and invoke the OysterFunction builder class for the new function. Unit tests are stored under src/test/java/edu/ualr/oyster/functions/ in the same package as the class they test.

General notes

Of course any of these methods may be overridden in the child class for spacial requirements. This would e the case when a function requires more than the threshold parameter. The child implementation can still call the threshold set'et setThreshold() to take advantage of its validation logic. See SmithWatterman for an example.

Updated