Regex Filter fails with empty lines

Issue #71 resolved
Former user created an issue

Original [issue 71](https://code.google.com/p/okapi/issues/detail?id=71) created by @ysavourel on 2009-05-12T23:21:09.000Z:

As I was writing a plain-text filter, I noticed Regex Filter unable of handling this input:

"Line 1
n
nLine 2
n
nLine 3
n
nLine 4"

The rule is "^(.\*?)$", options Pattern.MULTILINE to extract lines. It creates a TEXT\_UNIT event for "Line 1", and then DOCUMENT\_PART for the rest of the text.

I wrote a simple failing test for net.sf.okapi.filters.regex.tests: @ Test public void testEmptyLines() { String inputText = "Line 1
n
nLine 2
n
nLine 3
n
nLine 4";

Parameters params = new Parameters(); Rule rule = new Rule(); rule.setRuleType(Rule.RULETYPE\_CONTENT); params.regexOptions = Pattern.MULTILINE;

rule.setExpression("^(.\*?)$");

rule.setSourceGroup(1); params.rules.add(rule); filter.setParameters(params);

FilterTestDriver testDriver = new FilterTestDriver(); testDriver.setDisplayLevel(2); testDriver.setShowSkeleton(true);

filter.open(new RawDocument(inputText, "en")); if ( !testDriver.process(filter) ) Assert.fail(); filter.close(); }

Actually I fixed the issue with the code below. In my case it creates TextUnits for non-empty lines, and DocumentParts for empty lines in-between subsequent line-breaks. The code has passed the 2 old tests for Regex Filter and the one I added.

1. In net.sf.okapi.filters.regex line 77 a private boolean is added to the class RegexFilter:

private boolean lastRangeWasEmpty = false;

2. In net.sf.okapi.filters.regex from line 107 down the method next() is modified:

public Event next () { Cancel if requested if ( canceled ) { parseState = 0; queue.clear(); queue.add(new Event(EventType.CANCELED)); }

Process queue if it's not empty yet if ( queue.size() > 0 ) { return nextEvent(); }

Get the first best match among the rules trying to match expression Rule bestRule; int bestPosition = inputText.length()+99; MatchResult result = null;

Matcher m = null; if (lastRangeWasEmpty) startSearch++;

while ( true ) { bestRule = null; for ( Rule rule : params.rules ) { m = rule.pattern.matcher(inputText); if ( m.find(startSearch) ) { if ( m.start() < bestPosition ) { bestPosition = m.start(); bestRule = rule; } } }

if ( bestRule != null ) { Get the matching result result = m.toMatchResult(); lastRangeWasEmpty = (result.start() == result.end());

if ( result.start() < inputText.length() ) { Process the match we just found return processMatch(bestRule, result); } else break; Done } else break; Done }

Else: Send end of the skeleton if needed if ( startSearch <= inputText.length() ) { Treat strings outside rules TODO: implement extract string out of rules Send the skeleton addSkeletonToQueue(inputText.substring(startSkl, inputText.length()), true); }

Any group to close automatically? closeGroups();

End finally set the end Set the ending call Ending ending = new Ending(String.format("%d", ++otherId)); queue.add(new Event(EventType.END\_DOCUMENT, ending)); return nextEvent(); }

Comments (3)

  1. Log in to comment