Cleaner/CleanupStep enhancement to keep CRs and NLs, and small changes

Issue #1304 resolved
Kuro Kurosaka (BH Lab) created an issue

Currently, Cleaner, the class that implements the main logic of the Cleanup Step, is implemented so its run method always calls normalizeWhitespace method, which squashes all consecutive whitespaces into a single ASCII space. Under the current implementation, a whitespace is one of: ASCII space, tab (\t), CR (\r), and NL (\n). This approach is problematic because it erases the line-breaks that exist in some documents on purpose. For, example, a line-break entered by Shift+Enter on Word is converted to \n by OpenXMLFilter, and if Cleanup Step is applied, the line-break is gone and the intended visual effect is gone. There is a need to have an option to skip this.

(Discussion and Proposal)

One fix I considered was adding a new option to skip the whitespace normalization completely. But I found the current code has this comment in the run method:

                // normalized whitespace.
                // all subsequent steps assume only single spaces
                normalizeWhitespace(tu, srcSeg, targetLocale);

After reviewing the rest of code, it's the normalizePunctuation method seems to be the one that assumes normalizeWhitespace has been called. Instead of not calling the method at all, I’d like to propose to have a new option to keep CRs and NLs intact, and only consider the runs of consecutive space and tabs and convert them to a single ASCII space.

If this idea is acceptable to the Okapi team, I have an implementation that I'd like to submit and make a PR.

Side note: the current code doesn't treat NBSP or IDEOGRAPHIC SPACE as a whitespace. This should probably be considered for the Cleaner to be useful for wider audience. My implementation does not address this.

Comments (3)

  1. jhargrave-straker

    Kuro that sounds like a good solution to me. Updating to use the better Unicode support in Java 11 would also be appreciated!

  2. Log in to comment