SRX issue with empty "beforebreak" in rule

Issue #456 resolved
Former user created an issue

Original issue 456 created by @ysavourel on 2015-04-16T19:46:26.000Z:

Original post:

[[
Now I stumbled on another issue that might be a bug. Please have a look at the following two simple rules:

<rule break="no">
<beforebreak></beforebreak>
<afterbreak>[a-zA-Z0-9]</afterbreak>
</rule>

<rule break="yes">
<beforebreak>\.</beforebreak>
<afterbreak></afterbreak>
</rule>

The first rule means: Don’t break if there is a non-whitespace character (after a full stop).
The second rule is obvious: Do break if there is a full stop.

Given is the following example that includes a German date: First sentence. Second sentence 01.07.2014.
I expect two segments to be created: [First sentence.][ Second sentence 01.07.2014.]
But I get: [First sentence.][ Second sentence 01.][07.][2014.]

The reason seems to be that „beforebreak“ of the first rule is empty. If I change it to „\.“ the segmentation works fine.

I also took a look at the source code and guess that there is a bug in SRXDocument. The implementation of „compileRules“ does not make a difference if beforebreak or afterbreak is empty. But if beforebreak is empty „AUTO_INLINECODES“ should probably not be added to the pattern.
]]

The issue was confirmed.

Thread:
https://groups.yahoo.com/neo/groups/okapitools/conversations/messages/4536

Comments (10)

  1. Jim Hargrave

    The SRX spec says that BeforeBreak and AfterBreak rules content must be a valid regex. If we strictly follow the spec both of the rules below are illegal. However, you commonly see rules like this:

    <rule break="yes">
        <beforebreak>\n</beforebreak>
        <afterbreak></afterbreak>
    </rule>
    

    And this:

    <rule break="no">
    <beforebreak></beforebreak>
    <afterbreak>[a-zA-Z0-9]</afterbreak>
    </rule>
    

    Should we accept these types of rules as a defacto standard? How do we interpret the empty rules? The underlying srx engine must convert them to some type of regex in a consistent way.

  2. Okapi Framework repo owner

    fix issue 456 - srx with empty rules...

    it's unclear the full implications of this change. The SRX spec does not allow epty rules. But common use implies they should be treated the same as a regex "match anything" or "match anything or nothing"

    → <<cset fede24627089>>

  3. YvesS

    I've send a question to Rodolfo, the implementer of SRX for Swordfish (and the specification editor). Let's see what he think the behavior should be.

  4. YvesS

    Here is Rodolfo's answer:

    =====

    Check the rules for Japanese in the example included in the SRX 2.0 specification, at http://www.gala-global.org/oscarStandards/srx/srx20.html#AppSample

    The sample shows an empty <afterbreak> element meaning that anything matches so you have to break segments after any ideographic punctuation mark included in <beforebreak>.

    You can also have empty <beforebreak> elements, as they are defined exactly like <afterbreak>. Nevertheless, it is better to simply omit the element when there is no regular expression to use (<rule> needs to have at least one child, it is not required to have both <beforebreak> and <afterbreak>)

    =====

  5. Jim Hargrave

    If I read this correctly then an empty before/afterBreak should be implemented as the regex (.|\n|). Which means match anything or nothing. Should I merge my changes?

    Note we would have one failing unit test if we make the default regex the same for before and after Break.

  6. Jim Hargrave

    Actually a cleaner option would simply be to remove the group associated with the empty rule part. This would be faster over all and avoid any strange side effects the "match anything" regex might have.

  7. Jim Hargrave

    Our SrxSegmenter (engine) isn't wired to deal with single "clause" rules using code like below. Looks like we have to have some kind of regex. But I'm still having trouble making the ticket test pass with the match anything or nothing ".|\n|" regex (does work with the anything ".|\n").

    if (rule.before.isEmpty()) {
      // must add empty group to maintain group count
      pattern = "()" + afterPattern;
     } else if (rule.after.isEmpty()) {
    // must add empty group to maintain group count
     pattern = beforePattern + "()";
    } else {
     pattern = beforePattern + afterPattern;
    }
    
  8. Log in to comment