Commits

Eric Rochester  committed 71916ce

Updated the Python tokenizer to have output closer to the Penn TB.

  • Participants
  • Parent commits 131de2a
  • Branches monad-stack

Comments (0)

Files changed (1)

File bin/pytokenize.py

 
 
 Token = namedtuple('Token', 'text source raw offset length')
-reTOKEN = re.compile(r"\w+('+\w+)?")
+reTOKEN = re.compile(r"(\w+('+\w+)?)|(\S)")
 
 
 def get_files(args):
         for line in fin:
             for match in reTOKEN.finditer(line):
                 raw = match.group(0)
+                if not raw:
+                    raw = match.group(2)
+                if not raw:
+                    continue
                 (start, end) = match.span()
 
                 yield Token(