Regex Filter fails with empty lines
Original [issue 71](https://code.google.com/p/okapi/issues/detail?id=71) created by @ysavourel on 2009-05-12T23:21:09.000Z:
As I was writing a plain-text filter, I noticed Regex Filter unable of handling this input:
"Line 1
n
nLine 2
n
nLine 3
n
nLine 4"
The rule is "^(.\*?)$", options Pattern.MULTILINE to extract lines. It creates a TEXT\_UNIT event for "Line 1", and then DOCUMENT\_PART for the rest of the text.
I wrote a simple failing test for net.sf.okapi.filters.regex.tests:
@ Test
public void testEmptyLines() {
String inputText = "Line 1
n
nLine 2
n
nLine 3
n
nLine 4";
Parameters params = new Parameters(); Rule rule = new Rule(); rule.setRuleType(Rule.RULETYPE\_CONTENT); params.regexOptions = Pattern.MULTILINE;
rule.setExpression("^(.\*?)$");
rule.setSourceGroup(1); params.rules.add(rule); filter.setParameters(params);
FilterTestDriver testDriver = new FilterTestDriver(); testDriver.setDisplayLevel(2); testDriver.setShowSkeleton(true);
filter.open(new RawDocument(inputText, "en")); if ( !testDriver.process(filter) ) Assert.fail(); filter.close(); }
Actually I fixed the issue with the code below. In my case it creates TextUnits for non-empty lines, and DocumentParts for empty lines in-between subsequent line-breaks. The code has passed the 2 old tests for Regex Filter and the one I added.
1. In net.sf.okapi.filters.regex line 77 a private boolean is added to the class RegexFilter:
private boolean lastRangeWasEmpty = false;
2. In net.sf.okapi.filters.regex from line 107 down the method next() is modified:
public Event next () { Cancel if requested if ( canceled ) { parseState = 0; queue.clear(); queue.add(new Event(EventType.CANCELED)); }
Process queue if it's not empty yet if ( queue.size() > 0 ) { return nextEvent(); }
Get the first best match among the rules trying to match expression Rule bestRule; int bestPosition = inputText.length()+99; MatchResult result = null;
Matcher m = null; if (lastRangeWasEmpty) startSearch++;
while ( true ) { bestRule = null; for ( Rule rule : params.rules ) { m = rule.pattern.matcher(inputText); if ( m.find(startSearch) ) { if ( m.start() < bestPosition ) { bestPosition = m.start(); bestRule = rule; } } }
if ( bestRule != null ) { Get the matching result result = m.toMatchResult(); lastRangeWasEmpty = (result.start() == result.end());
if ( result.start() < inputText.length() ) { Process the match we just found return processMatch(bestRule, result); } else break; Done } else break; Done }
Else: Send end of the skeleton if needed if ( startSearch <= inputText.length() ) { Treat strings outside rules TODO: implement extract string out of rules Send the skeleton addSkeletonToQueue(inputText.substring(startSkl, inputText.length()), true); }
Any group to close automatically? closeGroups();
End finally set the end Set the ending call Ending ending = new Ending(String.format("%d", ++otherId)); queue.add(new Event(EventType.END\_DOCUMENT, ending)); return nextEvent(); }
Comments (3)
-
Account Deleted -
Account Deleted Comment [2.](https://code.google.com/p/okapi/issues/detail?id=71#c2) originally posted by @ysavourel on 2009-05-12T23:28:41.000Z:
-
Account Deleted - changed status to resolved
Comment [3.](https://code.google.com/p/okapi/issues/detail?id=71#c3) originally posted by @ysavourel on 2009-05-14T14:10:08.000Z:
- Log in to comment
Comment [1.](https://code.google.com/p/okapi/issues/detail?id=71#c1) originally posted by @ysavourel on 2009-05-12T23:27:57.000Z:
Also lastRangeWasEmpty = false needs to be added in several spots (probably where startSearch is set to 0)