A tokenizer for the real world in python....

Pipeline based architecture to modifying real world (myspace, twitter) text streams to make sense

(or why there is a flexible pipeline architecture)

source sentence: 
Heyyyyyyyyyy There u
 Want-    to 
go shopping?!?!-!-!-!  :) <3<3<3<3<3<3 jreily@mit.edu <body>gap.com</body> omg xoxoxoxoxoxoxoxoxoxoxoxoxoxoxoxoxoxo

possible tokenization 0: 
hey there you want to go shopping? :) <3 jreily@mit.edu gap.com omg xoxo

possible tokenization 1: 
hey want shopping happy_emoticon heart_emoticon my god xoxoxoxoxoxo

possible tokenization 2: 
hey want shopping xoxo

possible tokenization 3: 
hey want shop

Hookup your pipelines like this:

from Tokenizer import *

myBasicFilterSet = [CaseFilter(), WhiteSpaceNormalizationFilter(), StopWordsFilter(), NormalizeFilter(), DoubleWhiteSpaceFilter()]

print filterText(myBasicFilterSet, "Hey There you Want-    to go shopping?!?!-!-!-!")
-> hey want shopping

* Case filter (Hi -> hi)
* White space to normal space normalization (hey\nhi->hey hi, where \n is new line)
* Internet address (web, email) detection
* Emoticon detection
* Emoticon->language emotion mapping filter (  :) -> happy  )
* Excessive letter normalization (heeelllooo -> hello)
* Auto-spell correction  (teh -> the)
* Contraction normalization/expansion/compression (can't <-> cant <-> cannot)
* Abbreviation expansion/compression (lol <-> laughing out loud)
* Stop words removal, both normal (you) and in acronym (u)
* Non-ASCII letter -> whitespace filter  (why.are-you�->why are you )
* Excessive white space normalization (hey  hi   there -> hey hi there)
* Word size normalization (hey k love you xoxoxoxoxoxoxoxoxoxoxoxoxoxoxoxoxoxoxoxoxoxoxoxo -> hey love you xoxoxoxoxoxo)

Aaron Zinman <azinman@media.mit.edu>
Alex Dragulescu <dragu@mit.edu>

April 2008
Sociable Media Group
MIT Media Lab