1. Jesús Lanchas
  2. Aho-Corasick

Overview

Introduction

This is the source code distribution for an implementation of the Aho-Corasick automaton in Java. For license information, please see LICENSE.

That is a modified version of the original code written by Danny Yoo, and located in https://hkn.eecs.berkeley.edu/~dyoo/java/index.html.

The following commands require Gradle, which can be found here: http://www.gradle.org

Building the jar

To compile the jar, run gradle jar. The resulting jar should be created in:

build/libs/ahocorasick-1.4-SNAPSHOT.jar

Building the documentation

To build the javadocs, run gradle javadoc. The javadocs should be created in:

build/docs/javadoc

New features

Serialization

Now all the objects used to locate the strings are serializable, so now you can prepare a tree, serialize it to a file and in the future deserialize it and use it directly, without re-prepare it.

Deserializing a tree is slower than prepare it and if the tree is big the number of recursive calls may exceed the default limit. You can use the JVM argument ThreadStackSize (see http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html). So, I recommend the serialization of the tree only when it would be necessary :-)

    AhoCorasick tree = new AhoCorasick();
    tree.add("Apple");
    tree.add("Microsoft");
    tree.prepare();

    // Serialization to a temp file 
    File file = File.createTempFile("aho", "corasick");
    FileOutputStream fos = new FileOutputStream(file);
    ObjectOutputStream oos = new ObjectOutputStream(fos);
    oos.writeObject(tree);

    // Deserialization
    ObjectInputStream ois = new ObjectInputStream(new FileInputStream(file));
    tree = (AhoCorasick) ois.readObject();

Helper methods in the AhoCorasick class

To add strings to a tree now you can use the method prepare(String), instead of prepare(byte[] bytes, Object output).

To search strings now you have two options:

  • A progressive search, like in the previous version. The progressiveSearch call makes the first search and the next method advances in the search, providing the successive results. See example 1.

  • A complete search, in one call. The flags in the completeSearch method are used to indicate if overlapped results are allowed (true) or not (false). See example 2. if the method should return only outputs formed with valid tokens (using the StandardTokenizer provided by Lucene). See example 3.

Considering only tokens to create valid outputs

Optionally, you can indicate in the completeSearch methods that only tokens in the input text should be considered to located substrings. In the basic use, if you add to your tree the string al Ma and you search it in the input text Real Madrid, you will get one result. If you force the algorithm to consider only tokens (see example 3) you will not get results, because neither al nor Ma are tokens.

The tokenizer used is the StandardTokenizer provided by Lucene.

Examples

Example 1

A progressive search, like in the previous version.

    AhoCorasick tree = new AhoCorasick();
    tree.add("Input");
    tree.prepare();
    String inputText = "Input text";
    for (Iterator<SearchResult> iter = tree.progressiveSearch(inputText); iter.hasNext();) {
        SearchResult result = (SearchResult) iter.next();
        termsThatHit.addAll(result.getOutputs());
    }

Example 2

A complete search in one call, removing the overlapped results.

    AhoCorasick tree = new AhoCorasick();
    tree.add("Input");
    tree.add("In");
    tree.add("put");
    tree.add("Input text");
    tree.prepare();
    String inputText = "Input text";
    List<OutputResult> results = tree.completeSearch(inputText, false, false); // One result: 'Input text'

Example 3

Considering only tokens to create valid outputs.

    AhoCorasick tree = new AhoCorasick();
    tree.add("Input");
    tree.add("ut text");
    tree.add("text");
    tree.prepare();
    String inputText = "Input text";
    List<OutputResult> results = tree.completeSearch(inputText, true, true); // Two results: 'Input' and 'text'