marisa-trie / README.rst

Full commit


MARISA-Trie structure for Python (2.x and 3.x). Uses marisa-trie C++ library.

MARISA-Trie is a read-only trie that is very memory efficient.

There are official SWIG-based Python bindings included in C++ library distribution; this package provides an alternative unofficial Cython-based pip-installable Python bindings.


pip install marisa-trie


Create a new trie:

>>> import marisa_trie
>>> trie = marisa_trie.Trie()

Build a trie:

>>>[u'key1', u'key2', u'key12'])
<marisa_trie.Trie at ...>

Check if key is in trie:

>>> u'key1' in trie
>>> u'key20' in trie

Each key is assigned an unique ID from 0 to (n - 1), where n is the number of keys; you can use this ID to store a value in a separate structure (e.g. python list):

>>> trie.key_id(u'key2')


In future versions dict-like interface may become builtin.

Key can be reconstructed from the ID:

>>> trie.restore_key(1)

Find all prefixes of a given key:

>>> trie.prefixes(u'key12')
[u'key1', u'key12']

There is also a generator version of .prefixes method called .iter_prefixes.

Find all keys from this trie that starts with a given prefix:

>> trie.keys(u'key1')
[u'key1', u'key12']

(iterator version .iterkeys(prefix) is also available).

It is possible to save a trie to a file:

>>> with open('my_trie.marisa', 'w') as f:
...     trie.write(f)



Load a trie:

>>> trie2 = marisa.Trie()
>>> with open('my_trie.marisa', 'r') as f:
...     trie.load(f)


>>> trie2.load('my_trie.marisa')

Alternatively, you could build a trie using marisa-build command-line utility (provided by underlying C library; it should be downloaded and compiled separately) and then load it from a file.


There are no dedicated benchmarks for this package yet.

My quick tests show that memory usage is quite decent. For a list of 3000000 (3 million) Russian words memory consumption with different data structures (under Python 2.7):

  • list(unicode words) : about 300M
  • Trie from datrie library: about 70M
  • marisa_trie.Trie: 7M

Lookup speed seems to be about 2x slower than with datrie, but I haven't checked this with a good benchmark suite.


Development happens at github and bitbucket:

The main issue tracker is at github:

Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.

If you found a bug in a C++ part please report it to the original bug tracker.

Running tests and benchmarks

Make sure tox is installed and run

$ tox

from the source checkout. Tests should pass under python 2.6, 2.7, 3.2 and 3.3.


At the moment of writing the latest pip release (1.1) does not support Python 3.3; in order to run tox tests under Python 3.3 find the "virtualenv_support" directory in site-packages (of the env you run tox from) and place an sdist zip/tarball of the newer pip (from github) there.

$ tox -c bench.ini

runs benchmarks.

Authors & Contributors

This module is based on marisa-trie C++ library by Susumu Yata & contributors.


Wrapper code is licensed under MIT License. Bundled marisa-trie C++ library is licensed under BSD license.