SRXSegmenter does not handle parts covered by previous match

Issue #426 resolved
Former user created an issue

Original issue 426 created by @ysavourel on 2014-12-05T16:07:00.000Z:

See https://groups.yahoo.com/neo/groups/okapitools/conversations/topics/4478 for details.

The second rule really means „.“, i.e. always break.

Then, yes, I think there is a problem in the code: we don’t check the rule on the parts of the text included in the previous match.

Changing the code in SRXSegmenter.java from this:

m = rule.pattern.matcher(codedText);
while ( m.find() ) {
int n = m.start()+m.group(1).length();
if ( n > codedText.length() ) continue;

To this:

m = rule.pattern.matcher(codedText);
int start = 0;
while ( m.find(start) ) {
int n = m.start()+m.group(1).length();
start++;
if ( n > codedText.length() ) continue;

Should resolve this.

But there is side effect in the Aligner step tests.

Comments (2)

  1. Former user Account Deleted

    Comment 1. originally posted by @ysavourel on 2014-12-07T16:37:24.000Z:

    Aligner tests shown issue with the first solution (e.g. for pattern like "1.2.3. ". A better one:

    int start = 0;
    int prevStart = -1;
    while (( start != prevStart ) && m.find(start) ) {
    int n = m.start()+m.group(1).length();
    // Set next start
    prevStart = start;
    start = n;
    ...

    It passes all existing tests and additional ones.
    I'll push this soon.

  2. Log in to comment