Source

Kiva Editor's Assistant / report / progress.tex

\title{Kiva Editor's Assistant\\Progress Report}
\author{
        David Walker \\
        david.walker64@gmail.com
}
\date{\today}

\documentclass[12pt]{article}

\renewcommand{\textfraction}{0.05}
\renewcommand{\topfraction}{0.95}
\renewcommand{\bottomfraction}{0.95}
\renewcommand{\floatpagefraction}{0.35}
\setcounter{totalnumber}{5}

% Use 1 inch margins
\addtolength{\oddsidemargin}{-.875in}
\addtolength{\evensidemargin}{-.875in}
\addtolength{\textwidth}{1.75in}
\addtolength{\topmargin}{-.875in}
\addtolength{\textheight}{1.75in}

\newcommand{\tkn}[2]{
\textbar#1~#2~\hspace{-0.3em}\textbar
}

\newcommand{\tkno}[1]{
\textbar#1~\hspace{-0.3em}\textbar
}

\newenvironment{compactenumerate}%
  {\begin{enumerate}%
    \setlength{\itemsep}{0pt}%
    \setlength{\parskip}{0pt}}%
  {\end{enumerate}}


\begin{document}
\maketitle

\begin{abstract}
The Kiva Editor's Assistant (KEA) provides automated support for microfinance charity Kiva.org's volunteer editors, whose job is to clean up English-language loan descriptions that are generally written by non-native English speakers. KEA iteratively applies rules to its input; these rules perform not only surface structure tasks (such as tokenizing, tagging, determining sentence boundaries, applying regular expression search/replace operations, and expanding ISO currency abbreviations) but also deep structure analysis using DELPH-IN technology to do things like correcting pluralization of phrases. The user is presented with the edited result and a report of the changes made. This paper describes the current status of the project and its implementation goals.
\end{abstract}

\section{Introduction}
The Kiva Editor's Assistant (KEA) uses TnT\cite{website:tnt} and PET's "cheap" parser\cite{website:delphin} loaded with the ERG with the goal of providing automated support for microfinance charity Kiva.org's volunteer editors, whose task is to clean up English-language loan descriptions that are generally written by non-native English speakers. KEA uses multiple passes to tokenize input and apply rules that do things ranging from simple regular expression search/replace operations, to expanding (only the first occurence of) ISO currency abbreviations, to correcting pluralization of phrases like "is a 20 years old farmer" while being aware that a phrase like "is a farmer who is 20 years old" should be left untouched, which is handled by consulting the ERG parse tree.

\section{Rules}

KEA uses a rule-based approach to processing text.  Rules are grouped in phases and by priority; all the rules of a given phase are run in priority order until the first pass which produces no changes to the text occurs, at which time the next phase is invoked. When the last phase no longer produces changes to the text, KEA converts its tokens into output text and delivers that along with a report of the changes made to the original text.

The following sections describe the rules in priority order.

\subsection{Initial Phase}
\subsubsection{Space Splitting Rule}
KEA initially creates one token that encompasses the entire input, then subdivides that token. The first subdivision is performed by splitting the input text at spaces, resulting in one token for every space-delimited stretch of characters.  Each token records the beginning and ending offset into the original text that it represents, and this indexing information is preserved when splitting (or, later, merging) tokens.

\subsubsection{Newline Splitting Rule}

Every run of one or more newline characters is converted into a single paragraph delimiter token. In addition to providing a hard paragraph break, these tokens are also used to provide a clue to the sentence boundary detection rule.

\subsubsection{Dot Splitting Rule}

A common error in the source text is a lack of spaces after sentence-final punctuation, leading to sentences of the form ``One sentence.And another.'' This rule splits periods into separate tokens, unless they are part of a URL, an abbreviation, or a decimal number.

The preceding example, after whitespace split, would result in these tokens\footnote{For clarity only two features of the tokens are shown: their text and the offsets into the original text. Many additional features accompany the actual token objects, including part-of-speech tags and various boolean features describing their type.}:

\medskip
\tkn{One}{0:3} \tkn{sentence.And}{4:16} \tkn{another.}{17:25} 
\begin{flushleft}
after splitting at dots, the resulting tokens are:
\end{flushleft}

\tkn{One}{0:3} \tkn{sentence}{4:12} \tkn{.}{12:13} \tkn{And}{13:16} \tkn{another}{17:24} \tkn{.}{24:25} 
\medskip

\subsubsection{Decimal Delimiters Conversion Rule}

Nearly all loans use American-style delimiters for decimal numbers, e.g., ``12,345.67''.  A very small number of loans use European-style delimiters, e.g. ``12.345,67''. For consistency, this rule converts European-style decimal numbers into American-style.

\subsubsection{Punctuation Splitting Rule}

This rule splits tokens at punctuation other than dots.  It is constructed to avoid splitting tokens at numeric punctuation (thousands and decimal delimiters), apostrophes used as contractions and possessive markers (but not as quotes), and asterisks that appear at the start of words (these occur in loans as footnote markers).

This rule also avoids splitting hyphenated words; with the way the token lattice is currently constructed, PET will fail to parse separate tokens like \tkno{47} \tkno{year} \tkno{-} \tkno{old} but will succeed with a sequence like \tkno{47} \tkno{year-old}.

\subsubsection{Alphanumeric Splitting Rule}

This rule splits one token into two for a number of cases, illustrated in table \ref{tab:alpha_split}:

\begin{table}[h]
\begin{center}
\begin{tabular}{|l|l|}
 input      &  output      \\
\hline
 10am       &  10 am       \\
 10.00am    &  10.00 am    \\
 10:00am    &  10:00 am    \\
 10:00a.m.  &  10:00 a.m.  \\
 500foo     &  500 foo     \\
 bar200     &  bar 200     \\
 ksh.1000   &  ksh. 1000   \\
 1,200.     &  1,200 .     \\
 1200.      &  1200 .      \\
\end{tabular}
\caption{Alphanumeric token splitting.}
\label{tab:alpha_split}
\end{center}
\end{table}



\subsubsection{Regular Expression Replacing Rule}

The user must have the opportunity to review every (non-whitespace) change that KEA has made to the original text. This requirement has an impact on the regular expression search and replace (regex) rules: it would be convenient to have the regex rules operate on the original all-encompassing token before any whitespace split occurs, but doing so would render these changes internal to that single token, which has the disadvantage that textual changes would have to be reconstructed by performing a diff operation between the original and the generated output.  

Maintaining a link between individual tokens and the indexes of the original text they were initialized from not only makes generating a change report a simpler task but also makes for much easier debugging of the system, as tracking the changes to tokens as the result of successive rule applications is straightforward.

For example, consider the sentence ``Pat joined in the year 2009.''  This is tokenized as:

\medskip
\tkn{Pat}{0:3} \tkn{joined}{4:10} \tkn{in}{11:13} \tkn{the}{14:17} \tkn{year}{18:22} \tkn{2009}{23:27} \tkn{.}{27:28} 
\begin{flushleft}
and is changed by a regex rule to:
\end{flushleft}

\tkn{Pat}{0:3}  \tkn{joined}{4:10}  \tkn{in}{11:13}  \tkn{2009}{23:27}  \tkn{.}{27:28} 
\medskip

The regex rule processor uses dynamic programming to compute the Levenshtein distance at the token level between the source and target strings, which allows it to perform the minimal number of insertions, deletions, and edits to the source tokens to transform them so they represent the target strings.

\subsubsection{Single Digit Spelling Rule}

As a purely stylistic measure, single digits are spelled out, unless they indicate a percentage (7\%), are part of a list [e.g. 1) 2) 3)], or indicate an amount of currency (\$7).  Items for further work include recognizing when single digits are part of an address, or occur in a list that contains multi-digit numbers. A frequent example of the latter is a list of children's ages, such as ``Pat has 3 children aged 4, 8, and 12.'' In this case the ``3'' should be spelled out, but the ``4'' and ``8'' should not.

\subsubsection{Currency Delimiting Rule}
\label{sec:delim_currency}

Currency amounts larger than four digits are delimited with commas separating thousands. Identifying a number as being a currency amount requires recognizing currency symbols like ``\$'', ISO abbreviations such as ``PHP'' (Philippine Peso), and local currency terms, such as ``/='' (a Ugandan abbreviation for shilling).

\subsubsection{Number Concatenating Rule}
\label{sec:concat_num}

This is the first of the rules that merges tokens. It searches for consecutive numeric tokens that either have spaces where thousands separators should be, or spaces in addition to thousands separators. 

\subsubsection{Currency Abbreviation Ordering Rule}

This rule swaps the position of an ISO currency abbreviation and a following number, unless the abbreviation is also preceded by a number. For example, 

\medskip
\tkn{PHP}{0:3} \tkn{1000}{4:8}
\begin{flushleft}
becomes:
\end{flushleft}

\tkn{1000}{4:8} \tkn{PHP}{0:3}
\medskip

\subsubsection{Currency Abbreviation Expanding Rule}

The initial occurence of an ISO currency abbreviation is spelled out.  For example, ``one 5000 KES loan and another 7000 KES loan'' becomes ``one 5000 Kenyan Shilling (KES) loan and another 7000 KES loan''.

As an example of how this and the previously described rules work together, consider the sentence ``A kshs.50 000 loan.''

\begin{enumerate}
\item The initial all-encompassing token is created. 
\\
\tkn{A kshs.50 000 loan.}{0:19} 

\item The Space Splitting Rule splits the initial token. 
\\
\tkn{A}{0:1} \tkn{kshs.50}{2:9} \tkn{000}{10:13} \tkn{loan.}{14:19}

\item The Dot Splitting Rule splits the token \tkn{loan.}{14:19}. 
\\
\tkn{A}{0:1} \tkn{kshs.50}{2:9} \tkn{000}{10:13} \tkn{loan}{14:18} \tkn{.}{18:19}

\item The Alphanumeric Splitting Rule separates ``kshs.'' from ``50'' in the token \tkn{kshs.50}{2:9}.\footnote{The Dot Splitting Rule doesn't perform this split because it doesn't involve itself with tokens that contain digits: some alphanumeric sequences, such as ``2nd'' or ``65-year-old'' should not be split; all this specialized knowledge is collected in the Alphanumeric Splitting Rule.}
\\ 
\tkn{A}{0:1} \tkn{kshs.}{2:7} \tkn{50}{7:9} \tkn{000}{10:13} \tkn{loan}{14:18} \tkn{.}{18:19}

\item Now that ``kshs.'' stands apart from any digits, the Dot Splitting Rule applies and separates it at the dot. 
\\
\tkn{A}{0:1} \tkn{kshs}{2:6} \tkn{.}{6:7} \tkn{50}{7:9} \tkn{000}{10:13} \tkn{loan}{14:18} \tkn{.}{18:19}

\item The Regular Expression Replacing Rule transforms the adjacent tokens \tkn{kshs}{2:6} \tkn{.}{6:7}, which form the regional abbreviation for Kenyan Shillings, into the ISO standard currency abbreviation KES. 
\\
\tkn{A}{0:1} \tkn{KES}{2:6} \tkn{50}{7:9} \tkn{000}{10:13} \tkn{loan}{14:18} \tkn{.}{18:19} 

\item The Number Concatenating Rule merges \tkn{50}{7:9} \tkn{000}{10:13} into \tkn{50000}{7:13}.
\\
\tkn{A}{0:1} \tkn{KES}{2:6} \tkn{50000}{7:13} \tkn{loan}{14:18} \tkn{.}{18:19} 

\item The Currency Delimiting rule, finding a five-digit number adjacent to an ISO currency abbreviation, adds a comma to separate the thousands, producing \tkn{50,000}{7:13}.
\\
\tkn{A}{0:1} \tkn{KES}{2:6} \tkn{50,000}{7:13} \tkn{loan}{14:18} \tkn{.}{18:19} 

\item The Currency Abbreviation Ordering Rule swaps the tokens \tkn{KES}{2:6} \tkn{50,000}{7:13}. Note that it is the entire token objects, complete with indexes to the original text, that are swapped, rather than just their string values. The change reporting code can take advantage of this to indicate to the user that these elements were transposed. This is an example of the advantage of splitting the initial token by whitespace before performing regular expression search and replace operations so that the indexes into the original text can be preserved. The fact that this sort of operation, a swap, was performed, could not be recovered and described to the user if the change reporting code merely relied on a Levenshtein (or even Damerau-Levenshtein, since the next step inserts tokens between the ones just swapped) distance computation.
\\
\tkn{A}{0:1} \tkn{50,000}{7:13} \tkn{KES}{2:6} \tkn{loan}{14:18} \tkn{.}{18:19} 

\item Finally, the Currency Abbreviation Expanding Rule expands ``KES''.
\\
\tkn{A}{0:1} \tkn{50,000}{7:13} \tkno{Kenyan} \tkno{Shilling} \tkno{(} \tkn{KES}{2:6} \tkno{)} \tkn{loan}{14:18} \tkn{.}{18:19} 
\end{enumerate}

\subsection{POS Phase Rules}

Rules in this phase depend on the TnT part-of-speech tagger having supplied POS tags to all tokens.  Before processing any of the rules in this phase, KEA writes the text of each token to its own line in a temporary file, then invokes TnT on it, using the WSJ PTB tag set.  Early experimentation showed that providing PET with only the highest-priority tag for each token led to parse failures, so KEA launches TnT as a child process with command-line parameters that request not just the highest-probability tag but all tags with at least one hundredth that probability for each token.

\subsubsection{Sentence Boundary Detecting Rule}

This rule looks for a POS tag indicating sentence-final punctuation, as well as terminating sentences when a hard newline is found.  The latter is of course generally \emph{not} a reliable indicator of a sentence boundary, but in Kiva loan description text, it is quite reliable.  The rule surrounds each sentence with tokens containing features indicating they are sentence delimiters.

\subsection{Parsed Phase Rules}

These rules expect that a parse can at least be attempted, which is why this phase must follow the part of speech tagging phase. For performance reasons, sentences are only parsed on demand.  KEA uses PET (specifically the binary ``cheap'') as an XML-RPC server; if the server isn't detected, KEA will launch it as a daemon.

KEA uses PET Input Chart (PIC) input mode, which requires an XML document\footnote{See http://moin.delph-in.net/PetInputChart}. 
To request the parse of a sentence, KEA converts a sequence of tokens to an XML file using the document type definition pic.dtd from the Heart of Gold\footnote{See http://moin.delph-in.net/HeartofgoldTop} and supplies that to PET. Currently only the w, surface, and pos elements are used; see Appendix \ref{apdx:pic} for an example. 

A high priority work item is to either extend the KEA code to create a richer token lattice, or to outsource that job to Heart of Gold's TnTpiXML tokenizer, since some test inputs fail to parse in a reasonable time using the current simplistic token lattice, but are manageable using a lattice created by TnTpiXML.  

In either case, there is another opportunity to improve the lattice that is unique to this application. Every loan description contains the name of at least one borrower; these names are available in an HTML table on the Kiva editing web page.  A future work item, then, is to extract the HTML table contents and use those to mark the names in the token lattice as named entities.\footnote{See the handling of the name ``Kim Novak'' at http://moin.delph-in.net/PetInputChart for an example.}

\subsubsection{Years-Old Rule}

The first (and currently only) parse phase rule searches sentences for expressions like ``x years old'', then examines the parse provided by PET to determine if the pluralization and hyphenation needs to be corrected.   


The first three cases are considered correct usages and are not changed:

\begin{enumerate}
\item Pat is 47 years old.
\item Pat is a 47-year-old farmer.
\item Pat is 1 year old.
\end{enumerate}

These cases have the correct pluralization, but lack one or more hyphens:

\begin{enumerate}
\setcounter{enumi}{3}
\item Pat is a 47 year old farmer.
\item Pat is a 47-year old farmer.
\item Pat is a 47 year-old farmer.
\end{enumerate}

The remaining cases all have incorrect pluralization and, except for the last, are missing one or more hyphens:

\begin{enumerate}
\setcounter{enumi}{7}
\item Pat is a 47 years old farmer. \label{it:years}
\item Pat is a 47-years old farmer.
\item Pat is a 47 years-old farmer.
\item Pat is a 47-years-old farmer.
\end{enumerate}

Consider case \ref{it:years}. Having determined that the sentence contains the sequence ``years old'', KEA requests a parse tree (see Appendix \ref{apdx:farmer}). Seeing that the grandparent of the \tkn{years}{12:17} token is $\tt plur\_noun\_orule$ and that its great-grandparent's sibling is $\tt npadv$, which satisfy the criteria for case \ref{it:years}, KEA merges the tokens \tkn{47}{9:11} \tkn{years}{12:17} \tkn{old}{18:21} into \tkn{47-year-old}{9:21}.


\section{Future Work}
Much work remains to be done before KEA is ready for a production environment.  This section summarizes the highest-priority work items.

The most critical work item is to improve token lattice generated by KEA, by incorporating techniques from (or using as a standalone tokenizer) Heart of Gold's TnTpiXML module. This will improve the likelihood of successful parses from PET and also greatly reduce the parser's time and memory requirements. Doing so will also facilitate the introduction of more parse-phase rules, which are required for robust correction of syntactical errors. Automating the correction of these errors will save volunteer editors from doing a lot of mechanical work, allowing them to concentrate on the tasks that require the most human judgement.

Another area that must be improved is KEA's user interface.  Currently it is a command-line driven program, and the author's setup uses a web-browser plugin to invoke an instance of the Emacs editor client on the loan description text in the Kiva editor's web page's TEXTAREA. The Emacs client detects its method of invocation and launches KEA, supplying the loan description text, then replaces its buffer's contents with KEA's output. Finally, when the user has completed editing the loan and closes the Emacs client, the browser plugin replaces the contents of the TEXTAREA with the edited text.

While perfectly workable as a development environment, this setup is not practical for production use. To be useful to the general audience of volunteer editors, KEA must require minimal setup and operate unobtrusively. Options for doing so include using a web browser plugin to seamlessly incorporate KEA's functionality in a modified version of the loan-editing web page, or producing a standalone GUI version which uses copy-and-paste transfer with the web page.

KEA currently runs only in the Linux environment, which is not widely used amongst Kiva volunteers. Therefore another critical work item is to render KEA portable and provide a simple installation package that works on Mac OS X and Windows. The challenge here will be producing TnT and PET binaries compatible with these operating systems; while source for both of these components is available and should in principal compile on any platform, ultimate success may hinge on the availablility of platform-compatible versions of the libraries on which cheap and TnT depend.

Other work items are:
\begin{itemize}
\item Recognizing when single digits are part of an address or are being used to enumerate items in a list; these should not be spelled out but currently are.
\item Spelling out sentence-initial numbers.
\item Pluralizing expanded ISO currency names as appropriate.
\end{itemize}


\pagebreak
\appendix

\section{Sample PET Input Chart}
\label{apdx:pic}
\begin{figure}[!ht]
  \centering
\begin{verbatim}
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE pet-input-chart SYSTEM "pic.dtd">
<pet-input-chart>
  <w id="W1" cend="4" cstart="1">
    <surface>Pat</surface>
    <pos tag="NNP" prio="1.000000e+00" />
  </w>
  <w id="W2" cend="7" cstart="5">
    <surface>is</surface>
    <pos tag="VBZ" prio="1.000000e+00" />
  </w>
  <w id="W3" cend="9" cstart="8">
    <surface>a</surface>
    <pos tag="DT" prio="1.000000e+00" />
  </w>
  <w id="W4" cend="12" cstart="10">
    <surface>47</surface>
    <pos tag="CD" prio="1.000000e+00" />
  </w>
  <w id="W5" cend="18" cstart="13">
    <surface>years</surface>
    <pos tag="NNS" prio="1.000000e+00" />
  </w>
  <w id="W6" cend="22" cstart="19">
    <surface>old</surface>
    <pos tag="JJ" prio="1.000000e+00" />
  </w>
  <w id="W7" cend="29" cstart="23">
    <surface>farmer</surface>
    <pos tag="NN" prio="1.000000e+00" />
  </w>
  <w id="W8" cend="31" cstart="30">
    <surface>.</surface>
    <pos tag="." prio="1.000000e+00" />
  </w>
</pet-input-chart>
\end{verbatim}
\caption{Pet Input Chart for ``Pat is a 47 years old farmer.''}
\label{fig:pic}
\end{figure}

\pagebreak
\section{Sample Parse Tree}
\label{apdx:farmer}
\begin{figure}[!ht]
\begin{verbatim}
root_informal
  subjh
    proper_np
      sing_noun_irule
        pat
          <Pat 0:3>
    hcomp
      be_c_is
        <is 4:6>
      npadv_mnp
        adjh_s_xp
          a_one_adj
            <a 7:8>
          nadj_rr
            measure_np
              generic_card_ne
                <47 9:11>
              plur_noun_orule
                year_n1
                  <years 12:17>
            npadv
              proper_np
                adjn
                  old_a1
                    <old 18:21>
                  noun_n_cmpnd
                    farmer_n1
                      <farmer 22:28>
                    sing_noun_irule
                      generic_date_ne
                        <. 28:29>
\end{verbatim}
\caption{Parse tree for ``Pat is a 47 years old farmer.''}
\label{fig:farmer}
\end{figure}


\bibliographystyle{plain}
\bibliography{kea}

\end{document}
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.