Source

fuzzy /

Filename Size Date modified Message
src
test
26 B
Use a temporary character variable instead of modifying the string in place, since we want the original string to be left untouched after the encoding is done. Fixes issue #1.
135 B
Added tag 1.0 for changeset 1e260146ea6d
359 B
adding fuzzy .project file
196 B
#295: implementation of nysiis
2.4 KB
LICENSE and README
24 B
Forgot MANIFEST.in
3.6 KB
Tweak line length
35 B
normalizing egg names
1.4 KB
Improved setup.py; tell bitbucket our readme is reST

Fuzzy

Fuzzy is a python library implementing common phonetic algorithms quickly. Typically this is in string similarity exercises, but they're pretty versatile.

It uses C Extensions (via Pyrex) for speed.

The algorithms are:

Installation

Installation should be easy if you have a C compiler such as gcc. All you should need to do is easy_install/pip install it. If you have Pyrex it will regenerate the C code, otherwise it will use the pre-generated code. Here's a basic installation on a clean virtualenv:

(fuzzy_cean)Kotai:~ chmullig$ pip install https://bitbucket.org/yougov/fuzzy/get/1.0.tar.gz
Downloading/unpacking https://bitbucket.org/yougov/fuzzy/get/1.0.tar.gz
  Downloading 1.0.tar.gz
  Running setup.py egg_info for package from https://bitbucket.org/yougov/fuzzy/get/1.0.tar.gz
Installing collected packages: Fuzzy
  Running setup.py install for Fuzzy
    building 'fuzzy' extension
    gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes
        -DENABLE_DTRACE -arch i386 -arch ppc -arch x86_64 -pipe -I/System/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6
        -c src/fuzzy.c -o build/temp.macosx-10.6-universal-2.6/src/fuzzy.o
    gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes
        -DENABLE_DTRACE -arch i386 -arch ppc -arch x86_64 -pipe -I/System/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6
        -c src/double_metaphone.c -o build/temp.macosx-10.6-universal-2.6/src/double_metaphone.o
    gcc-4.2 -Wl,-F. -bundle -undefined dynamic_lookup -arch i386 -arch ppc -arch x86_64
        build/temp.macosx-10.6-universal-2.6/src/fuzzy.o build/temp.macosx-10.6-universal-2.6/src/double_metaphone.o
        -o build/lib.macosx-10.6-universal-2.6/fuzzy.so
Successfully installed Fuzzy
Cleaning up...
(fuzzy_cean)Kotai:~ chmullig$

Usage

The functions are quite easy to use!

>>> import fuzzy
>>> soundex = fuzzy.Soundex(4)
>>> soundex('fuzzy')
'F200'
>>> dmeta = fuzzy.DMetaphone()
>>> dmeta('fuzzy')
['FS', None]
>>> fuzzy.nysiis('fuzzy')
'FASY'

Performance

Fuzzy's Double Metaphone was ~10 times faster than the pure python implementation by Andrew Collins in some recent testing. Soundex and NYSIIS should be similarly faster. Using iPython's timeit:

In [3]: timeit soundex('fuzzy')
1000000 loops, best of 3: 326 ns per loop

In [4]: timeit dmeta('fuzzy')
100000 loops, best of 3: 2.18 us per loop

In [5]: timeit fuzzy.nysiis('fuzzy')
100000 loops, best of 3: 13.7 us per loop

Distance Metrics

We recommend the Python-Levenshtein module for fast, C based string distance/similarity metrics. Among others functions it includes:

In testing it's been several times faster than comparable pure python implementations of those algorithms.

Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.