Wiki

Clone wiki

acfsa / Home

acfsa - Aho-Corasick implementation in Python

This repository is a simple, fast implementation of the Aho-Corasick algorithm in Cython. This algorithm, widely used in text mining, allows a query text to be searched for exact matches of a potentially huge collections of terms.

Dependencies

  • Cython
  • Numpy
  • a C++ compiler (tested with GCC)

(I know that it requires relatively recent versions of both packages, but I do not know the minimum version that will work. If you try building the package with a version, and it doesn't work, edit this page and the setup.py if you have write access to the repository, or leave an issue if not). To minimize hassle, upgrade these packages to their most recent versions before attempting installation.

I have not tested this package on platforms other than Linux. Probably it would require MinGW or Cygwin to compile on Windows -- please update if you get it to work on Windows.

Installation

$ pip install --user git+https://bitbucket.org/wrenlab/acfsa.git

Usage

acfsa offers two public objects: Trie, and MixedCaseSensitivityTrie. They have similar APIs, but Trie is more primitive in the sense that all terms entered into it will be matched either case-sensitively (CS) or case-insensitively (CI), depending on the option you specify in the constructor, whereas MixedCaseSensitivityTrie allows you to match a collection of terms, some CI and some CS (this option is set per-term).

Generally, these are the steps to using either kind of Trie:

  1. Construct the Trie object, specifying any options
  2. Add dictionary terms to the Trie with Trie.add(text, key) (key is optional, but allows you to easily associate term IDs with terms to be searched)
  3. Call Trie.build(). After this, you can no longer add any new terms to the Trie. If you try to call Trie.search(text) without first calling Trie.build(), the behavior is unspecified.
  4. Search query text with Trie.search(text), which will return a (possibly empty) list of acfsa.Match objects.
  5. Use the match attributes in your application: Match.start and Match.end will return 0-indexed start and end locations for the match, and Match.key will contain the key you associated with the term (or None, if you didn't associate one).

Here is a simple example:

>> from acfsa import Trie
>> t = Trie()
>> t.add("hello", key=1)
>> t.add("you", key=2)
>> t.build()
>> matches = t.search("hello sir, how are you?")
>> matches
[<acfsa.Match @ 0-5, key=1>, <acfsa.Match @ 19-22, key=2>]

From this point, use the Match objects in your application with the Match.start, Match.end, and Match.key attributes.

Trie options

There are three options to the Trie() constructor. All but case_sensitive also apply to the MixedCaseSensitivityTrie:

  • case_sensitive (bool)
  • allow_overlaps (bool) - If True, returns all matches. If False, returns longest nonoverlapping matches.
  • boundary_characters (str) - If provided, matches are required to be bounded on both sides by one of the given characters, or by the query string beginning or end. If you want to find tokens in sentences, for instance, you might provide the string " .?" to split on spaces and ending punctuation. For better results on natural language, you may want to combine this option with a real sentence splitter, such as the Punkt sentence tokenizer from NLTK.

Advanced usage

This package is not made with thread/multiprocessing safety in mind. If you want to use this package in a parallel/multiprocessing environment, it would be safest to build a new Trie in each process. Trying to build the Trie once before fork()ing new processes is untested -- it may work, or may cause weird bugs.

Bugs

Please submit on the issue tracker.

Updated