1. david_walker
  2. Kiva Editor's Assistant


david_walker  committed 7db736a

Token.pos was a single Penn Treebank token type, such as 'NN'. With this checkin, it becomes a list of PosTag namedtuple objects, each of which has a token type and a probability value. In most cases there will be only a single entry in the list, but there can be three or more. This change is necessary because the parser fails to parse some sentences given only the highest-probability part-of-speech tag for each token, but succeeds if lower-probability alternatives are present.

  • Participants
  • Parent commits 465478a
  • Branches parse

Comments (0)

Files changed (1)

File tagger.py

View file
  • Ignore whitespace
 import subprocess
 import tempfile
 import codecs
+import collections
+PosTag = collections.namedtuple('PosTag', 'pos prob')
 TNT_BIN = '/home/david/delphin/bin/tnt'
 TRIGRAM_PATH = '/home/david/delphin/components/tnt/models/wsj.tnt'
         # Execute TNT; capture stderr so it doesn't pollute the console
-        process = subprocess.Popen([TNT_BIN, TRIGRAM_PATH, token_file.name],
+        # the option '-z100' requests that alternative tags be emitted
+        # if they have probability at least one hundredth the best one.
+        process = subprocess.Popen([TNT_BIN, '-z100', TRIGRAM_PATH,
+                                    token_file.name],
         # add part of speech tag to tokens, being careful to align the
-        # pos assignments with the printable tokens we sent
+        # pos assignments with the printable tokens we sent.
         i = 0
         for line in process.communicate()[0].split('\n'):
             if i == len(tokens):
             # find the next token that needs a part of speech assignment
             while tokens[i].non_printing or tokens[i].is_para:
                 i += 1
-            # TNT output for tokens is the token, some spaces, and the
-            # POS tag.
-            tokens[i].pos = line.split()[1]
+            # TNT output for tokens is the token and at least one token
+            # and probability.
+            #
+            # an example of a token "living" with multiple alternative
+            # tags is:
+            #
+            # living NN 8.941239e-01 VBG 8.748627e-02 JJ 1.838984e-02
+            #
+            # Get just the tag and probability values in a list
+            tag_prob_list = line.split()[1:]
+            # The following line produces two iterators over
+            # tag_prob_list that are NOT independent of each other,
+            # which means that when map calls each to provide arguments
+            # to the PosTag namedtuple constructor, they will alternate
+            # elements from tag_prob_list.
+            tokens[i].pos = map(PosTag, *([iter(tag_prob_list)] * 2))
             i += 1