david_walker avatar david_walker committed 7db736a

Token.pos was a single Penn Treebank token type, such as 'NN'. With this checkin, it becomes a list of PosTag namedtuple objects, each of which has a token type and a probability value. In most cases there will be only a single entry in the list, but there can be three or more. This change is necessary because the parser fails to parse some sentences given only the highest-probability part-of-speech tag for each token, but succeeds if lower-probability alternatives are present.

Comments (0)

Files changed (1)

 import subprocess
 import tempfile
 import codecs
+import collections
+
+PosTag = collections.namedtuple('PosTag', 'pos prob')
 
 TNT_BIN = '/home/david/delphin/bin/tnt'
 TRIGRAM_PATH = '/home/david/delphin/components/tnt/models/wsj.tnt'
         token_file_writer.flush()
 
         # Execute TNT; capture stderr so it doesn't pollute the console
-        process = subprocess.Popen([TNT_BIN, TRIGRAM_PATH, token_file.name],
+        # the option '-z100' requests that alternative tags be emitted
+        # if they have probability at least one hundredth the best one.
+        process = subprocess.Popen([TNT_BIN, '-z100', TRIGRAM_PATH,
+                                    token_file.name],
                                    stdin=subprocess.PIPE,
                                    stdout=subprocess.PIPE,
                                    stderr=subprocess.PIPE)
 
         # add part of speech tag to tokens, being careful to align the
-        # pos assignments with the printable tokens we sent
+        # pos assignments with the printable tokens we sent.
         i = 0
         for line in process.communicate()[0].split('\n'):
             if i == len(tokens):
             # find the next token that needs a part of speech assignment
             while tokens[i].non_printing or tokens[i].is_para:
                 i += 1
-            # TNT output for tokens is the token, some spaces, and the
-            # POS tag.
-            tokens[i].pos = line.split()[1]
+            # TNT output for tokens is the token and at least one token
+            # and probability.
+            #
+            # an example of a token "living" with multiple alternative
+            # tags is:
+            #
+            # living NN 8.941239e-01 VBG 8.748627e-02 JJ 1.838984e-02
+            #
+            # Get just the tag and probability values in a list
+            tag_prob_list = line.split()[1:]
+
+            # The following line produces two iterators over
+            # tag_prob_list that are NOT independent of each other,
+            # which means that when map calls each to provide arguments
+            # to the PosTag namedtuple constructor, they will alternate
+            # elements from tag_prob_list.
+            tokens[i].pos = map(PosTag, *([iter(tag_prob_list)] * 2))
             i += 1
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.