Kiva Editor's Assistant / report / progress.tex

Full commit
david_walker bcc8036 

\title{Kiva Editor's Assistant\\Progress Report}
        David Walker \\

\ensuremath{\langle\!} #1 #2 \ensuremath{\!\rangle}

The Kiva Editor's Assistant (KEA) provides automated support for microfinance charity's volunteer editors, whose job is to clean up English-language loan descriptions that are generally written by non-native English speakers. KEA iteratively applies rules to its input; these rules perform not only surface structure tasks (such as tokenizing, tagging, determining sentence boundaries, applying regular expression search/replace operations, and expanding ISO currency abbreviations) but also deep structure analysis to do things like correcting pluralization of phrases. The user is presented with the edited result and a report of the changes made. This paper describes the current status of the project and its implementation goals.

The Kiva Editor's Assistant (KEA) uses TnT and PET's "cheap" parser loaded with the ERG with the goal of providing automated support for microfinance charity's volunteer editors, whose task is to clean up English-language loan descriptions that are generally written by non-native English speakers. KEA uses multiple passes to tokenize input and apply rules that do things ranging from simple regular expression search/replace operations, to expanding (only the first occurence of) ISO currency abbreviations, to correcting pluralization of phrases like "is a 20 years old farmer" while being aware that a phrase like "is a farmer who is 20 years old" should be left untouched, which is handled by consulting the ERG parse tree.


KEA uses a rule-based approach to processing text.  Rules are grouped in phases and by priority; all the rules of a given phase are run in priority order until the first pass which produces no changes to the text occurs, at which time the next phase is invoked. When the last phase no longer produces changes to the text, KEA converts its tokens into output text and delivers that along with a report of the changes made to the original text.

The following sections describe the rules in priority order.

\subsection{Initial Phase}
\subsubsection{Splitting at Spaces}
KEA initially creates one token that encompasses the entire input, then subdivides that token. The first subdivision is performed by splitting the input text at spaces, resulting in one token for every space-delimited stretch of characters.  Each token records the beginning and ending offset into the original text that it represents, and this indexing information is preserved when splitting (or, later, merging) tokens.

\subsubsection{Splitting at Newlines}

Every run of one or more newline characters is converted into a single paragraph delimiter token. In addition to providing a hard paragraph break, these tokens are also used to provide a clue to the sentence boundary detection rule.

\subsubsection{Splitting at Dots}

A common error in the source text is a lack of spaces after sentence-final punctuation, leading to sentences of the form ``One sentence.And another.'' This rule splits periods into separate tokens, unless they are part of a URL, an abbreviation, or a decimal number.

The preceding example, after whitespace split, would result in these tokens:

\tkn{One}{0:3} \tkn{sentence.And}{4:16} \tkn{another.}{17:25} 

After splitting at dots, the resulting tokens are:

\tkn{One}{0:3} \tkn{sentence}{4:12} \tkn{.}{12:13} \tkn{And}{13:16} \tkn{another}{17:24} \tkn{.}{24:25} 

\subsubsection{Converting Decimal Delimiters}

Nearly all loans use American-style delimiters for decimal numbers, e.g., ``12,345.67''.  A very small number of loans use European-style delimiters, e.g. ``12.345,67''. For consistency, this rule converts European-style decimal numbers into American-style.

\subsubsection{Splitting at Other Punctuation}

This rule splits tokens at punctuation other than dots.  It is constructed to avoid splitting tokens at numeric punctuation (thousands and decimal delimiters), apostrophes used as contractions and possessive markers (but not as quotes), and asterisks that appear at the start of words (these occur in loans as footnote markers).

This rule also avoids splitting hyphenated words; with the way the token lattice is currently constructed, PET will fail to parse separate tokens like ``47'' ``year'' ``-'' ``old'' but will succeed with a sequence like ``47'' ``year-old''.

\subsubsection{Alphanumeric Split}

This rule splits one token into two for a number of cases, illustrated in table \ref{tab:alpha_split}:

 input      &  output      \\
 10am       &  10 am       \\
 10.00am    &  10.00 am    \\
 10:00am    &  10:00 am    \\
 10:00a.m.  &  10:00 a.m.  \\
 500foo     &  500 foo     \\
 bar200     &  bar 200     \\
 ksh.1000   &  ksh. 1000   \\
 1,200.     &  1,200 .     \\
 1200.      &  1200 .      \\
\caption{Alphanumeric token splitting.}

\subsubsection{Regular Expression Search and Replace }

The user must have the opportunity to review every (non-whitespace) change that KEA has made to the original text. This requirement has an impact on the regular expression search and replace (regex) rules: it would be convenient to have the regex rules operate on the original all-encompassing token before any whitespace split occurs, but doing so would render these changes internal to that single token, which has the disadvantage that textual changes would have to be reconstructed by performing a diff operation between the original and the generated output.  

Maintaining a link between individual tokens and the indexes of the original text they were initialized from not only makes generating a change report a simpler task but also makes for much easier debugging of the system, as tracking the changes to tokens as the result of successive rule applications is straightforward.

For example, consider the sentence ``Pat joined in the year 2009.''  This is tokenized as:

\tkn{Pat}{0:3} \tkn{joined}{4:10} \tkn{in}{11:13} \tkn{the}{14:17} \tkn{year}{18:22} \tkn{2009}{23:27} \tkn{.}{27:28} 

and is changed by a regex rule to:

\tkn{Pat}{0:3}  \tkn{joined}{4:10}  \tkn{in}{11:13}  \tkn{2009}{23:27}  \tkn{.}{27:28} 

The regex rule processor uses dynamic programming to compute the Levenshtein distance at the token level between the source and target strings, which allows it to perform the minimal number of insertions, deletions, and edits to the source tokens to transform them so they represent the target strings.

\subsubsection{Spelling Single Digits}

As a purely stylistic measure, single digits are spelled out, unless they indicate a percentage (7\%), are part of a list [e.g. 1) 2) 3)], or indicate an amount of currency (\$7).  Items for further work include recognizing when single digits are part of an address, or occur in a list that contains multi-digit numbers. A frequent example of the latter is a list of children's ages, such as ``Pat has 3 children aged 4, 8, and 12.'' In this case the ``3'' should be spelled out, but the ``4'' and ``8'' should not.

\subsubsection{Delimiting Currency}

Currency amounts larger than four digits are delimited with commas separating thousands. Identifying a number as being a currency amount requires recognizing currency symbols like ``\$", ISO abbreviations such as ``PHP'' (Philippine Peso), and local currency terms, such as ``/='' (a Ugandan abbreviation for shilling).

\subsubsection{Concatenating Numbers}

This is the first of the rules that merges tokens. It searches for consecutive numeric tokens that either have spaces where thousands separators should be, or spaces in addition to thousands separators. 

\subsubsection{Currency Abbreviation Placement}

This rule swaps the position of an ISO currency abbreviation and a following number, unless the abbreviation is also preceded by a number. For example, 

\tkn{PHP}{0:3} \tkn{1000}{4:8}


\tkn{1000}{0:3} \tkn{PHP}{4:8}

\subsubsection{ISO Currency Abbreviation expansion}

The initial occurence of an ISO currency abbreviation is spelled out.  For example, ``one 5000 KES loan and another 7000 KES loan'' becomes ``one 5000 Kenyan Shilling (KES) loan and another 7000 KES loan''.

\subsection{POS Phase Rules}

Rules in this phase depend on the TnT part-of-speech tagger having supplied POS tags to all tokens.  Before processing any of the rules in this phase, KEA writes the text of each token to its own line in a temporary file, then invokes TnT on it, using the WSJ PTB tag set.  Early experimentation showed that providing PET with only the highest-priority tag for each token led to parse failures, so KEA launches TnT as a child process with command-line parameters that request not just the highest-probability tag but all tags with at least one hundredth that probability for each token.

\subsubsection{Sentence Boundary Detection}

This rule looks for a POS tag indicating sentence-final punctuation, as well as terminating sentences when a hard newline is found.  The latter is of course generally \emph{not} a reliable indicator of a sentence boundary, but in Kiva loan description text, it is quite reliable.

\subsection{Parsed Phase Rules}

These rules expect that a parse can at least be attempted, which is why this phase must follow the part of speech tagging phase. For performance reasons, sentences are only parsed on demand.  KEA uses PET (specifically the binary ``cheap'') as an XML-RPC server; if the server isn't detected, KEA will launch it as a daemon.

KEA uses PET Input Chart (PIC) input mode, which requires an XML document\footnote{See}. 
To request the parse of a sentence, KEA converts a sequence of tokens to an XML file using the document type definition pic.dtd from the Heart of Gold\footnote{See} and supplies that to PET. Currently only the w, surface, and pos elements are used; see \ref{fig:pic} for an example. 

A high priority work item is to either extend the KEA code to create a richer token lattice, or to outsource that job to Heart of Gold's TnTpiXML tokenizer, since some test inputs fail to parse in a reasonable time using the current simplistic token lattice, but are manageable using a lattice created by TnTpiXML.  

In either case, there is another opportunity to improve the lattice that is unique to this application. Every loan description contains the name of at least one borrower; these names are available in an HTML table on the Kiva editing web page.  A future work item, then, is to extract the HTML table contents and use those to mark the names in the token lattice as named entities. As an example, see the handling of the name ``Kim Novak'' at

\subsubsection{Age Expression}

The first (and currently only) parse phase rule searches sentences for expressions like ``x years old'', then examines the parse provided by PET to determine if the pluralization and hyphenation needs to be corrected.  Currrently it considers eleven cases. 

The first three cases are considered correct usages and are not changed:

\item Pat is 47 years old.
\item Pat is a 47-year-old farmer.
\item {Pat is 1 year old.}

\flushleft{These cases have the correct pluralization, but lack one or more hyphens:}

\item Pat is a 47 year old farmer.
\item Pat is a 47 year old lady who is a farmer.
\item Pat is a 47-year old farmer.
\item Pat is a 47 year-old farmer.

The remaining cases all have incorrect pluralization and, except for the last, which are missing one or more hyphens:

\item Pat is a 47 years old farmer. \label{it:years}
\item Pat is a 47-years old farmer.
\item Pat is a 47 years-old farmer.
\item Pat is a 47-years-old farmer.

Consider case \ref{it:years}. Having determined that the sentence contains the sequence ``years old'', KEA requests a parse tree (figure \ref{fig:farmer}). Seeing that the grandparent of the \tkn{years}{12:17} token is $\tt plur\_noun\_orule$ and that its great-grandparent's sibling is $\tt npadv$, which satisfy the criteria for case \ref{it:years}, KEA changes the tokens \tkn{47}{9:11} \tkn{years}{12:17} \tkn{old}{18:21} to \tkn{47-year-old}{9:11}.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE pet-input-chart SYSTEM "/home/david/Projects/Kiva-dev/pic.dtd">
  <w id="W1" cend="4" cstart="1">
    <pos tag="NNP" prio="1.000000e+00" />
  <w id="W2" cend="7" cstart="5">
    <pos tag="VBZ" prio="1.000000e+00" />
  <w id="W3" cend="9" cstart="8">
    <pos tag="DT" prio="1.000000e+00" />
  <w id="W4" cend="12" cstart="10">
    <pos tag="CD" prio="1.000000e+00" />
  <w id="W5" cend="18" cstart="13">
    <pos tag="NNS" prio="1.000000e+00" />
  <w id="W6" cend="22" cstart="19">
    <pos tag="JJ" prio="1.000000e+00" />
  <w id="W7" cend="29" cstart="23">
    <pos tag="NN" prio="1.000000e+00" />
  <w id="W8" cend="31" cstart="30">
    <pos tag="." prio="1.000000e+00" />
\caption{Pet Input Chart for ``Pat is a 47 years old farmer.''}

          <Pat 0:3>
        <is 4:6>
            <a 7:8>
                <47 9:11>
                  <years 12:17>
                    <old 18:21>
                      <farmer 22:28>
                        <. 28:29>
\caption{Parse tree for ``Pat is a 47 years old farmer.''}


For example, given the initial tokens: 

\tkn{10}{0:2} \tkn{000}{3:6} \tkn{PHP}{7:10}

the concatenation rule will alter them (notice the conservation of the indexes to the original text ranging from 0:6):

\tkn{10000}{0:6} \tkn{PHP}{7:10} 

then the currency delimiting rule will fire, producing:

\tkn{10,000}{0:6} \tkn{PHP}{7:10} 

 input      &  output      \\
 10 000       &  10000       \\
\caption{Concatenating numeric tokens.}

\section{Future Work}
- improve token lattice with Heart of Gold
- pluralizing currency names
- user-friendly GUI
- support for other O/Ss