Tagging: Adding some notion of ‘close enough’

Issue #3 new
Anton Kolechkin created an issue

Perhaps we should consider using some more permissive matching. Right now we look for exact matches but do you think it might be useful to relax that condition? For example, for words between 5 and 9 characters we could allow matches with a Levenshtein distance <2 and for 9 or more characters a Levenshtein distance <3. This would help with words with dashes as well as with other small typos (for example, I just ran into the word “knee--length” and of course it isn’t picking it up. Should we be less strict in the matching?

Comments (3)

  1. Anton Kolechkin reporter

    Explanation from @tomasrojo

    by the way, this is NLTKs definition of Levenshtein distance: Calculate the Levenshtein edit-distance between two strings. The edit distance is the number of characters that need to be substituted, inserted, or deleted, to transform s1 into s2. For example, transforming "rain" to "shine" requires three steps, consisting of two substitutions and one insertion: "rain" -> "sain" -> "shin" -> "shine". These operations could have been done in other orders, but at least three steps are needed.

    Allows specifying the cost of substitution edits (e.g., "a" -> "b"), because sometimes it makes sense to assign greater penalties to substitutions.

    This also optionally allows transposition edits (e.g., "ab" -> "ba"), though this is disabled by default.

  2. Log in to comment