SRXSegmenter does not handle parts covered by previous match
Original issue 426 created by @ysavourel on 2014-12-05T16:07:00.000Z:
See https://groups.yahoo.com/neo/groups/okapitools/conversations/topics/4478 for details.
The second rule really means „.“, i.e. always break.
Then, yes, I think there is a problem in the code: we don’t check the rule on the parts of the text included in the previous match.
Changing the code in SRXSegmenter.java from this:
m = rule.pattern.matcher(codedText);
while ( m.find() ) {
int n = m.start()+m.group(1).length();
if ( n > codedText.length() ) continue;
To this:
m = rule.pattern.matcher(codedText);
int start = 0;
while ( m.find(start) ) {
int n = m.start()+m.group(1).length();
start++;
if ( n > codedText.length() ) continue;
Should resolve this.
But there is side effect in the Aligner step tests.
Comments (2)
-
Account Deleted -
Account Deleted - changed status to resolved
Comment 2. originally posted by @ysavourel on 2014-12-07T16:39:30.000Z:
This issue was closed by revision af5c6a381dcc.
- Log in to comment
Comment 1. originally posted by @ysavourel on 2014-12-07T16:37:24.000Z:
Aligner tests shown issue with the first solution (e.g. for pattern like "1.2.3. ". A better one:
int start = 0;
int prevStart = -1;
while (( start != prevStart ) && m.find(start) ) {
int n = m.start()+m.group(1).length();
// Set next start
prevStart = start;
start = n;
...
It passes all existing tests and additional ones.
I'll push this soon.