Kiva Editor's Assistant / report / progress.tex

\title{Kiva Editor's Assistant\\Progress Report}
        David Walker \\



% Use 1 inch margins

%\ensuremath{\!\langle\!} #1 #2 \ensuremath{\!\rangle\!}



The Kiva Editor's Assistant (KEA) provides automated support for microfinance charity's volunteer editors, whose job is to clean up English-language loan descriptions that are generally written by non-native English speakers. KEA iteratively applies rules to its input; these rules perform not only surface structure tasks (such as tokenizing, tagging, determining sentence boundaries, applying regular expression search/replace operations, and expanding ISO currency abbreviations) but also deep structure analysis to do things like correcting pluralization of phrases. The user is presented with the edited result and a report of the changes made. This paper describes the current status of the project and its implementation goals.

The Kiva Editor's Assistant (KEA) uses TnT\cite{website:tnt} and PET's "cheap" parser\cite{website:delphin} loaded with the ERG with the goal of providing automated support for microfinance charity's volunteer editors, whose task is to clean up English-language loan descriptions that are generally written by non-native English speakers. KEA uses multiple passes to tokenize input and apply rules that do things ranging from simple regular expression search/replace operations, to expanding (only the first occurence of) ISO currency abbreviations, to correcting pluralization of phrases like "is a 20 years old farmer" while being aware that a phrase like "is a farmer who is 20 years old" should be left untouched, which is handled by consulting the ERG parse tree.


KEA uses a rule-based approach to processing text.  Rules are grouped in phases and by priority; all the rules of a given phase are run in priority order until the first pass which produces no changes to the text occurs, at which time the next phase is invoked. When the last phase no longer produces changes to the text, KEA converts its tokens into output text and delivers that along with a report of the changes made to the original text.

The following sections describe the rules in priority order.

\subsection{Initial Phase}
\subsubsection{Space Splitting Rule}
KEA initially creates one token that encompasses the entire input, then subdivides that token. The first subdivision is performed by splitting the input text at spaces, resulting in one token for every space-delimited stretch of characters.  Each token records the beginning and ending offset into the original text that it represents, and this indexing information is preserved when splitting (or, later, merging) tokens.

\subsubsection{Newline Splitting Rule}

Every run of one or more newline characters is converted into a single paragraph delimiter token. In addition to providing a hard paragraph break, these tokens are also used to provide a clue to the sentence boundary detection rule.

\subsubsection{Dot Splitting Rule}

A common error in the source text is a lack of spaces after sentence-final punctuation, leading to sentences of the form ``One sentence.And another.'' This rule splits periods into separate tokens, unless they are part of a URL, an abbreviation, or a decimal number.

The preceding example, after whitespace split, would result in these tokens\footnote{For clarity only two features of the tokens are shown: their text and the offsets into the original text. Many additional features accompany the actual token objects, including part-of-speech tags and various boolean features describing their type.}:

\tkn{One}{0:3} \tkn{sentence.And}{4:16} \tkn{another.}{17:25} 

After splitting at dots, the resulting tokens are:

\tkn{One}{0:3} \tkn{sentence}{4:12} \tkn{.}{12:13} \tkn{And}{13:16} \tkn{another}{17:24} \tkn{.}{24:25} 

\subsubsection{Decimal Delimiters Conversion Rule}

Nearly all loans use American-style delimiters for decimal numbers, e.g., ``12,345.67''.  A very small number of loans use European-style delimiters, e.g. ``12.345,67''. For consistency, this rule converts European-style decimal numbers into American-style.

\subsubsection{Punctuation Splitting Rule}

This rule splits tokens at punctuation other than dots.  It is constructed to avoid splitting tokens at numeric punctuation (thousands and decimal delimiters), apostrophes used as contractions and possessive markers (but not as quotes), and asterisks that appear at the start of words (these occur in loans as footnote markers).

This rule also avoids splitting hyphenated words; with the way the token lattice is currently constructed, PET will fail to parse separate tokens like ``47'' ``year'' ``-'' ``old'' but will succeed with a sequence like ``47'' ``year-old''.

\subsubsection{Alphanumeric Splitting Rule}

This rule splits one token into two for a number of cases, illustrated in table \ref{tab:alpha_split}:

 input      &  output      \\
 10am       &  10 am       \\
 10.00am    &  10.00 am    \\
 10:00am    &  10:00 am    \\
 10:00a.m.  &  10:00 a.m.  \\
 500foo     &  500 foo     \\
 bar200     &  bar 200     \\
 ksh.1000   &  ksh. 1000   \\
 1,200.     &  1,200 .     \\
 1200.      &  1200 .      \\
\caption{Alphanumeric token splitting.}

\subsubsection{Regular Expression Replacing Rule}

The user must have the opportunity to review every (non-whitespace) change that KEA has made to the original text. This requirement has an impact on the regular expression search and replace (regex) rules: it would be convenient to have the regex rules operate on the original all-encompassing token before any whitespace split occurs, but doing so would render these changes internal to that single token, which has the disadvantage that textual changes would have to be reconstructed by performing a diff operation between the original and the generated output.  

Maintaining a link between individual tokens and the indexes of the original text they were initialized from not only makes generating a change report a simpler task but also makes for much easier debugging of the system, as tracking the changes to tokens as the result of successive rule applications is straightforward.

For example, consider the sentence ``Pat joined in the year 2009.''  This is tokenized as:

\tkn{Pat}{0:3} \tkn{joined}{4:10} \tkn{in}{11:13} \tkn{the}{14:17} \tkn{year}{18:22} \tkn{2009}{23:27} \tkn{.}{27:28} 

and is changed by a regex rule to:

\tkn{Pat}{0:3}  \tkn{joined}{4:10}  \tkn{in}{11:13}  \tkn{2009}{23:27}  \tkn{.}{27:28} 

The regex rule processor uses dynamic programming to compute the Levenshtein distance at the token level between the source and target strings, which allows it to perform the minimal number of insertions, deletions, and edits to the source tokens to transform them so they represent the target strings.

\subsubsection{Single Digit Spelling Rule}

As a purely stylistic measure, single digits are spelled out, unless they indicate a percentage (7\%), are part of a list [e.g. 1) 2) 3)], or indicate an amount of currency (\$7).  Items for further work include recognizing when single digits are part of an address, or occur in a list that contains multi-digit numbers. A frequent example of the latter is a list of children's ages, such as ``Pat has 3 children aged 4, 8, and 12.'' In this case the ``3'' should be spelled out, but the ``4'' and ``8'' should not.

\subsubsection{Currency Delimiting Rule}

Currency amounts larger than four digits are delimited with commas separating thousands. Identifying a number as being a currency amount requires recognizing currency symbols like ``\$'', ISO abbreviations such as ``PHP'' (Philippine Peso), and local currency terms, such as ``/='' (a Ugandan abbreviation for shilling).

\subsubsection{Number Concatenating Rule}

This is the first of the rules that merges tokens. It searches for consecutive numeric tokens that either have spaces where thousands separators should be, or spaces in addition to thousands separators. 

\subsubsection{Currency Abbreviation Ordering Rule}

This rule swaps the position of an ISO currency abbreviation and a following number, unless the abbreviation is also preceded by a number. For example, 

\tkn{PHP}{0:3} \tkn{1000}{4:8}


\tkn{1000}{0:3} \tkn{PHP}{4:8}

\subsubsection{Currency Abbreviation Expanding Rule}

The initial occurence of an ISO currency abbreviation is spelled out.  For example, ``one 5000 KES loan and another 7000 KES loan'' becomes ``one 5000 Kenyan Shilling (KES) loan and another 7000 KES loan''.

As an example of how these rules work together, consider the sentence ``A kshs.50 000 loan.''

\item The initial all-encompassing token is created. 
\tkn{A kshs.50 000 loan.}{0:19} 

\item The Space Splitting Rule splits the initial token. 
\tkn{A}{0:1} \tkn{kshs.50}{2:9} \tkn{000}{10:13} \tkn{loan.}{14:19}

\item The Dot Splitting Rule splits the token \tkn{loan}{14:18}. 
\tkn{A}{0:1} \tkn{kshs.50}{2:9} \tkn{000}{10:13} \tkn{loan}{14:18} \tkn{.}{18:19}

\item The Alphanumeric Splitting Rule separates ``kshs.'' from ``50'' in the token \tkn{kshs.50}{2:9}.\footnote{The Dot Splitting Rule doesn't perform this split because it doesn't involve itself with tokens that contain digits: some alphanumeric sequences, such as ``2nd'' or ``65-year-old'' should not be split; all this specialized knowledge is collected in the Alphanumeric Splitting Rule.}
\tkn{A}{0:1} \tkn{kshs.}{2:7} \tkn{50}{7:9} \tkn{000}{10:13} \tkn{loan}{14:18} \tkn{.}{18:19}

\item Now that ``kshs.'' stands apart from any digits, the Dot Splitting Rule applies and separates it at the dot. 
\tkn{A}{0:1} \tkn{kshs}{2:6} \tkn{.}{6:7} \tkn{50}{7:9} \tkn{000}{10:13} \tkn{loan}{14:18} \tkn{.}{18:19}

\item The Regular Expression Replacing Rule transforms the adjacent tokens \tkn{kshs}{2:6} \tkn{.}{6:7}, which form the regional abbreviation for Kenyan Shillings, into the ISO standard currency abbreviation KES. 
\tkn{A}{0:1} \tkn{KES}{2:6} \tkn{50}{7:9} \tkn{000}{10:13} \tkn{loan}{14:18} \tkn{.}{18:19} 

\item The Number Concatenating Rule merges \tkn{50}{7:9} \tkn{000}{10:13} into \tkn{50000}{7:13}.
\tkn{A}{0:1} \tkn{KES}{2:6} \tkn{50000}{7:13} \tkn{loan}{14:18} \tkn{.}{18:19} 

\item The Currency Delimiting rule, finding a five-digit number adjacent to an ISO currency abbreviation, adds a comma to separate the thousands, producing \tkn{50,000}{7:13}.
\tkn{A}{0:1} \tkn{KES}{2:6} \tkn{50,000}{7:13} \tkn{loan}{14:18} \tkn{.}{18:19} 

\item The Currency Abbreviation Ordering Rule swaps the tokens \tkn{KES}{2:6} \tkn{50,000}{7:13}. Note that it is the entire token objects, complete with indexes to the original text, that are swapped, rather than just their string values. The change reporting code can take advantage of this to indicate to the user that these elements were transposed. This is an example of the advantage of splitting the initial token by whitespace before performing regular expression search and replace operations so that the indexes into the original text can be preserved. The fact that this sort of operation, a swap, was performed, could not be recovered and described to the user if the change reporting code merely relied on a Levenshtein (or even Damerau-Levenshtein, since the next step inserts tokens between the ones just swapped) distance computation.
\tkn{A}{0:1} \tkn{50,000}{7:13} \tkn{KES}{2:6} \tkn{loan}{14:18} \tkn{.}{18:19} 

\item Finally, the Currency Abbreviation Expanding Rule expands ``KES''.
\tkn{A}{0:1} \tkn{50,000}{7:13} \tkn{Kenyan}{} \tkn{Shilling}{} \tkn{(}{} \tkn{KES}{2:6} \tkn{)}{} \tkn{loan}{14:18} \tkn{.}{18:19} 

\subsection{POS Phase Rules}

Rules in this phase depend on the TnT part-of-speech tagger having supplied POS tags to all tokens.  Before processing any of the rules in this phase, KEA writes the text of each token to its own line in a temporary file, then invokes TnT on it, using the WSJ PTB tag set.  Early experimentation showed that providing PET with only the highest-priority tag for each token led to parse failures, so KEA launches TnT as a child process with command-line parameters that request not just the highest-probability tag but all tags with at least one hundredth that probability for each token.

\subsubsection{Sentence Boundary Detecting Rule}

This rule looks for a POS tag indicating sentence-final punctuation, as well as terminating sentences when a hard newline is found.  The latter is of course generally \emph{not} a reliable indicator of a sentence boundary, but in Kiva loan description text, it is quite reliable.

\subsection{Parsed Phase Rules}

These rules expect that a parse can at least be attempted, which is why this phase must follow the part of speech tagging phase. For performance reasons, sentences are only parsed on demand.  KEA uses PET (specifically the binary ``cheap'') as an XML-RPC server; if the server isn't detected, KEA will launch it as a daemon.

KEA uses PET Input Chart (PIC) input mode, which requires an XML document\footnote{See}. 
To request the parse of a sentence, KEA converts a sequence of tokens to an XML file using the document type definition pic.dtd from the Heart of Gold\footnote{See} and supplies that to PET. Currently only the w, surface, and pos elements are used; see Appendix \ref{apdx:pic} for an example. 

A high priority work item is to either extend the KEA code to create a richer token lattice, or to outsource that job to Heart of Gold's TnTpiXML tokenizer, since some test inputs fail to parse in a reasonable time using the current simplistic token lattice, but are manageable using a lattice created by TnTpiXML.  

In either case, there is another opportunity to improve the lattice that is unique to this application. Every loan description contains the name of at least one borrower; these names are available in an HTML table on the Kiva editing web page.  A future work item, then, is to extract the HTML table contents and use those to mark the names in the token lattice as named entities.\footnote{See the handling of the name ``Kim Novak'' at for an example.}

\subsubsection{Years-Old Rule}

The first (and currently only) parse phase rule searches sentences for expressions like ``x years old'', then examines the parse provided by PET to determine if the pluralization and hyphenation needs to be corrected.   

The first three cases are considered correct usages and are not changed:

\item Pat is 47 years old.
\item Pat is a 47-year-old farmer.
\item Pat is 1 year old.

\flushleft{These cases have the correct pluralization, but lack one or more hyphens:}

\item Pat is a 47 year old farmer.
\item Pat is a 47-year old farmer.
\item Pat is a 47 year-old farmer.

The remaining cases all have incorrect pluralization and, except for the last, are missing one or more hyphens:

\item Pat is a 47 years old farmer. \label{it:years}
\item Pat is a 47-years old farmer.
\item Pat is a 47 years-old farmer.
\item Pat is a 47-years-old farmer.

Consider case \ref{it:years}. Having determined that the sentence contains the sequence ``years old'', KEA requests a parse tree (see Appendix \ref{apdx:farmer}). Seeing that the grandparent of the \tkn{years}{12:17} token is $\tt plur\_noun\_orule$ and that its great-grandparent's sibling is $\tt npadv$, which satisfy the criteria for case \ref{it:years}, KEA merges the tokens \tkn{47}{9:11} \tkn{years}{12:17} \tkn{old}{18:21} into \tkn{47-year-old}{9:21}.

\section{Future Work}
- improve token lattice with Heart of Gold
- pluralizing currency names
- user-friendly GUI
- support for other O/Ss


\section{Sample PET Input Chart}
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE pet-input-chart SYSTEM "pic.dtd">
  <w id="W1" cend="4" cstart="1">
    <pos tag="NNP" prio="1.000000e+00" />
  <w id="W2" cend="7" cstart="5">
    <pos tag="VBZ" prio="1.000000e+00" />
  <w id="W3" cend="9" cstart="8">
    <pos tag="DT" prio="1.000000e+00" />
  <w id="W4" cend="12" cstart="10">
    <pos tag="CD" prio="1.000000e+00" />
  <w id="W5" cend="18" cstart="13">
    <pos tag="NNS" prio="1.000000e+00" />
  <w id="W6" cend="22" cstart="19">
    <pos tag="JJ" prio="1.000000e+00" />
  <w id="W7" cend="29" cstart="23">
    <pos tag="NN" prio="1.000000e+00" />
  <w id="W8" cend="31" cstart="30">
    <pos tag="." prio="1.000000e+00" />
\caption{Pet Input Chart for ``Pat is a 47 years old farmer.''}

\section{Sample Parse Tree}
          <Pat 0:3>
        <is 4:6>
            <a 7:8>
                <47 9:11>
                  <years 12:17>
                    <old 18:21>
                      <farmer 22:28>
                        <. 28:29>
\caption{Parse tree for ``Pat is a 47 years old farmer.''}