david_walker avatar david_walker committed 54d4b4d

completed draft of first progress report

Comments (0)

Files changed (1)

report/progress.tex

 \renewcommand{\floatpagefraction}{0.35}
 \setcounter{totalnumber}{5}
 
-
 % Use 1 inch margins
 \addtolength{\oddsidemargin}{-.875in}
 \addtolength{\evensidemargin}{-.875in}
 \addtolength{\textheight}{1.75in}
 
 \newcommand{\tkn}[2]{
-%\ensuremath{\!\langle\!} #1 #2 \ensuremath{\!\rangle\!}
 \textbar#1~#2~\hspace{-0.3em}\textbar
 }
 
+\newcommand{\tkno}[1]{
+\textbar#1~\hspace{-0.3em}\textbar
+}
+
 \newenvironment{compactenumerate}%
   {\begin{enumerate}%
     \setlength{\itemsep}{0pt}%
 \maketitle
 
 \begin{abstract}
-The Kiva Editor's Assistant (KEA) provides automated support for microfinance charity Kiva.org's volunteer editors, whose job is to clean up English-language loan descriptions that are generally written by non-native English speakers. KEA iteratively applies rules to its input; these rules perform not only surface structure tasks (such as tokenizing, tagging, determining sentence boundaries, applying regular expression search/replace operations, and expanding ISO currency abbreviations) but also deep structure analysis to do things like correcting pluralization of phrases. The user is presented with the edited result and a report of the changes made. This paper describes the current status of the project and its implementation goals.
+The Kiva Editor's Assistant (KEA) provides automated support for microfinance charity Kiva.org's volunteer editors, whose job is to clean up English-language loan descriptions that are generally written by non-native English speakers. KEA iteratively applies rules to its input; these rules perform not only surface structure tasks (such as tokenizing, tagging, determining sentence boundaries, applying regular expression search/replace operations, and expanding ISO currency abbreviations) but also deep structure analysis using DELPH-IN technology to do things like correcting pluralization of phrases. The user is presented with the edited result and a report of the changes made. This paper describes the current status of the project and its implementation goals.
 \end{abstract}
 
 \section{Introduction}
 
 The preceding example, after whitespace split, would result in these tokens\footnote{For clarity only two features of the tokens are shown: their text and the offsets into the original text. Many additional features accompany the actual token objects, including part-of-speech tags and various boolean features describing their type.}:
 
+\medskip
 \tkn{One}{0:3} \tkn{sentence.And}{4:16} \tkn{another.}{17:25} 
-
-After splitting at dots, the resulting tokens are:
+\begin{flushleft}
+after splitting at dots, the resulting tokens are:
+\end{flushleft}
 
 \tkn{One}{0:3} \tkn{sentence}{4:12} \tkn{.}{12:13} \tkn{And}{13:16} \tkn{another}{17:24} \tkn{.}{24:25} 
+\medskip
 
 \subsubsection{Decimal Delimiters Conversion Rule}
 
 
 This rule splits tokens at punctuation other than dots.  It is constructed to avoid splitting tokens at numeric punctuation (thousands and decimal delimiters), apostrophes used as contractions and possessive markers (but not as quotes), and asterisks that appear at the start of words (these occur in loans as footnote markers).
 
-This rule also avoids splitting hyphenated words; with the way the token lattice is currently constructed, PET will fail to parse separate tokens like ``47'' ``year'' ``-'' ``old'' but will succeed with a sequence like ``47'' ``year-old''.
+This rule also avoids splitting hyphenated words; with the way the token lattice is currently constructed, PET will fail to parse separate tokens like \tkno{47} \tkno{year} \tkno{-} \tkno{old} but will succeed with a sequence like \tkno{47} \tkno{year-old}.
 
 \subsubsection{Alphanumeric Splitting Rule}
 
 
 For example, consider the sentence ``Pat joined in the year 2009.''  This is tokenized as:
 
+\medskip
 \tkn{Pat}{0:3} \tkn{joined}{4:10} \tkn{in}{11:13} \tkn{the}{14:17} \tkn{year}{18:22} \tkn{2009}{23:27} \tkn{.}{27:28} 
-
+\begin{flushleft}
 and is changed by a regex rule to:
+\end{flushleft}
 
 \tkn{Pat}{0:3}  \tkn{joined}{4:10}  \tkn{in}{11:13}  \tkn{2009}{23:27}  \tkn{.}{27:28} 
+\medskip
 
 The regex rule processor uses dynamic programming to compute the Levenshtein distance at the token level between the source and target strings, which allows it to perform the minimal number of insertions, deletions, and edits to the source tokens to transform them so they represent the target strings.
 
 
 This rule swaps the position of an ISO currency abbreviation and a following number, unless the abbreviation is also preceded by a number. For example, 
 
+\medskip
 \tkn{PHP}{0:3} \tkn{1000}{4:8}
+\begin{flushleft}
+becomes:
+\end{flushleft}
 
-becomes:
-
-\tkn{1000}{0:3} \tkn{PHP}{4:8}
+\tkn{1000}{4:8} \tkn{PHP}{0:3}
+\medskip
 
 \subsubsection{Currency Abbreviation Expanding Rule}
 
 The initial occurence of an ISO currency abbreviation is spelled out.  For example, ``one 5000 KES loan and another 7000 KES loan'' becomes ``one 5000 Kenyan Shilling (KES) loan and another 7000 KES loan''.
 
-As an example of how these rules work together, consider the sentence ``A kshs.50 000 loan.''
+As an example of how this and the previously described rules work together, consider the sentence ``A kshs.50 000 loan.''
 
 \begin{enumerate}
 \item The initial all-encompassing token is created. 
 \\
 \tkn{A}{0:1} \tkn{kshs.50}{2:9} \tkn{000}{10:13} \tkn{loan.}{14:19}
 
-\item The Dot Splitting Rule splits the token \tkn{loan}{14:18}. 
+\item The Dot Splitting Rule splits the token \tkn{loan.}{14:19}. 
 \\
 \tkn{A}{0:1} \tkn{kshs.50}{2:9} \tkn{000}{10:13} \tkn{loan}{14:18} \tkn{.}{18:19}
 
 
 \item Finally, the Currency Abbreviation Expanding Rule expands ``KES''.
 \\
-\tkn{A}{0:1} \tkn{50,000}{7:13} \tkn{Kenyan}{} \tkn{Shilling}{} \tkn{(}{} \tkn{KES}{2:6} \tkn{)}{} \tkn{loan}{14:18} \tkn{.}{18:19} 
+\tkn{A}{0:1} \tkn{50,000}{7:13} \tkno{Kenyan} \tkno{Shilling} \tkno{(} \tkn{KES}{2:6} \tkno{)} \tkn{loan}{14:18} \tkn{.}{18:19} 
 \end{enumerate}
 
 \subsection{POS Phase Rules}
 
 \subsubsection{Sentence Boundary Detecting Rule}
 
-This rule looks for a POS tag indicating sentence-final punctuation, as well as terminating sentences when a hard newline is found.  The latter is of course generally \emph{not} a reliable indicator of a sentence boundary, but in Kiva loan description text, it is quite reliable.
+This rule looks for a POS tag indicating sentence-final punctuation, as well as terminating sentences when a hard newline is found.  The latter is of course generally \emph{not} a reliable indicator of a sentence boundary, but in Kiva loan description text, it is quite reliable.  The rule surrounds each sentence with tokens containing features indicating they are sentence delimiters.
 
 \subsection{Parsed Phase Rules}
 
 \item Pat is 1 year old.
 \end{enumerate}
 
-\flushleft{These cases have the correct pluralization, but lack one or more hyphens:}
+These cases have the correct pluralization, but lack one or more hyphens:
 
 \begin{enumerate}
 \setcounter{enumi}{3}
 
 
 \section{Future Work}
-- improve token lattice with Heart of Gold
-- pluralizing currency names
-- user-friendly GUI
-- support for other O/Ss
+Much work remains to be done before KEA is ready for a production environment.  This section summarizes the highest-priority work items.
+
+The most critical work item is to improve token lattice generated by KEA, by incorporating techniques from (or using as a standalone tokenizer) Heart of Gold's TnTpiXML module. This will improve the likelihood of successful parses from PET and also greatly reduce the parser's time and memory requirements. Doing so will also facilitate the introduction of more parse-phase rules, which are required for robust correction of syntactical errors. Automating the correction of these errors will save volunteer editors from doing a lot of mechanical work, allowing them to concentrate on the tasks that require the most human judgement.
+
+Another area that must be improved is KEA's user interface.  Currently it is a command-line driven program, and the author's setup uses a web-browser plugin to invoke an instance of the Emacs editor client on the loan description text in the Kiva editor's web page's TEXTAREA. The Emacs client detects its method of invocation and launches KEA, supplying the loan description text, then replaces its buffer's contents with KEA's output. Finally, when the user has completed editing the loan and closes the Emacs client, the browser plugin replaces the contents of the TEXTAREA with the edited text.
+
+While perfectly workable as a development environment, this setup is not practical for production use. To be useful to the general audience of volunteer editors, KEA must require minimal setup and operate unobtrusively. Options for doing so include using a web browser plugin to seamlessly incorporate KEA's functionality in a modified version of the loan-editing web page, or producing a standalone GUI version which uses copy-and-paste transfer with the web page.
+
+KEA currently runs only in the Linux environment, which is not widely used amongst Kiva volunteers. Therefore another critical work item is to render KEA portable and provide a simple installation package that works on Mac OS X and Windows. The challenge here will be producing TnT and PET binaries compatible with these operating systems; while source for both of these components is available and should in principal compile on any platform, ultimate success may hinge on the availablility of platform-compatible versions of the libraries on which cheap and TnT depend.
+
+Other work items are:
+\begin{itemize}
+\item Recognizing when single digits are part of an address or are being used to enumerate items in a list; these should not be spelled out but currently are.
+\item Spelling out sentence-initial numbers.
+\item Pluralizing expanded ISO currency names as appropriate.
+\end{itemize}
+
 
 \pagebreak
 \appendix
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.