1. david_walker
  2. Kiva Editor's Assistant


david_walker  committed 64385e6

don't split at apostrophes because the simple approach of having a list of words which can contain them ("you'll" etc.) fails to account for names, some of which can contain multiple apostrophes (e.g. "Ng'ang'a").

disable SpellDigitsRule because there are too many exceptions where it shouldn't be applied that it causes more work than it saves.

  • Participants
  • Parent commits 7c8d5d8
  • Branches default

Comments (0)

Files changed (1)

File rules.py

View file
  • Ignore whitespace
     Avoid splitting numeric punctuation, e.g., 11,000.34 should not be
     split at the comma or the decimal. Also avoid splitting at
-    apostrophes in contractions.
+    apostrophes.
     # this is the same as Token.delimited_decimal_re except that
         re.U | re.X)
-    # TODO: names containing apostrophes are not recognized here. Note
-    # some names may contain more than one apostrophe, e.g. Ng'ang'a.
-    _contraction_endings = [u't', u's', u'd', u'll']
     def __init__(self):
         """Set rule priority and name. """
         Rule.__init__(self, 60, 1.0,
                     # Found punctuation character, and it is not
                     # embedded within a number as a thousands separator
                     # or a decimal point. Check to see if it is an
-                    # apostrophe in a contraction.
-                    if (char == u"'" and
-                        token.str[i + 1:] in PunctSplitRule._contraction_endings):
-                            continue
+                    # apostrophe.
+                    if char == u"'":
+                        continue
                     # Create a transform to split the token at this
                     # point.
         return transforms
+# This rule is currently disabled because it turns out to be more
+# trouble than it's worth: there are too many exceptions that it isn't
+# aware of. For example, times shouldn't be spelled ("7 a.m. to 9 p.m.")
+# and numbers in lists should either all be spelled or all in digits
+# ("children aged 3, 7, and 10").
 class SpellDigitsRule(Rule):
     """Spell out numbers 1..9.
     def __init__(self):
         """Set rule priority and name. """
         Rule.__init__(self, 80, 1.0, "Spell out single digit  numbers.")
+        # DISABLED, see comment at head of class
+        self.enabled = False
     def get_transforms(self, tokens):
         """Return an array of transform objects."""
+        # DISABLED, see comment at head of class
+        self.enabled = False
+        return []
         self.tokens = tokens
         transforms = []