1. Eric Rochester
  2. bakers12


Eric Rochester  committed 71916ce

Updated the Python tokenizer to have output closer to the Penn TB.

  • Participants
  • Parent commits 131de2a
  • Branches monad-stack

Comments (0)

Files changed (1)

File bin/pytokenize.py

View file
 Token = namedtuple('Token', 'text source raw offset length')
-reTOKEN = re.compile(r"\w+('+\w+)?")
+reTOKEN = re.compile(r"(\w+('+\w+)?)|(\S)")
 def get_files(args):
         for line in fin:
             for match in reTOKEN.finditer(line):
                 raw = match.group(0)
+                if not raw:
+                    raw = match.group(2)
+                if not raw:
+                    continue
                 (start, end) = match.span()
                 yield Token(