SRX issue with empty "beforebreak" in rule

Issue #456 resolved

Former user created an issue 2015-04-16

Original issue 456 created by @ysavourel on 2015-04-16T19:46:26.000Z:

Original post:

[[
Now I stumbled on another issue that might be a bug. Please have a look at the following two simple rules:

The first rule means: Don’t break if there is a non-whitespace character (after a full stop).
The second rule is obvious: Do break if there is a full stop.

Given is the following example that includes a German date: First sentence. Second sentence 01.07.2014.
I expect two segments to be created: [First sentence.][ Second sentence 01.07.2014.]
But I get: [First sentence.][ Second sentence 01.][07.][2014.]

The reason seems to be that „beforebreak“ of the first rule is empty. If I change it to „\.“ the segmentation works fine.

I also took a look at the source code and guess that there is a bug in SRXDocument. The implementation of „compileRules“ does not make a difference if beforebreak or afterbreak is empty. But if beforebreak is empty „AUTO_INLINECODES“ should probably not be added to the pattern.
]]

The issue was confirmed.

Thread:
https://groups.yahoo.com/neo/groups/okapitools/conversations/messages/4536

Comments (10)

Former user Account Deleted
- changed status to open
Comment 1. originally posted by @ysavourel on 2015-04-16T19:49:36.000Z:
- 2015-04-16T19:49:36+00:00
ysavourel
- assigned issue to
  
  Jim Hargrave
- edited description
- 2015-04-23T15:48:23+00:00
Jim Hargrave
The SRX spec says that BeforeBreak and AfterBreak rules content must be a valid regex. If we strictly follow the spec both of the rules below are illegal. However, you commonly see rules like this:
```
<rule break="yes">
    <beforebreak>\n</beforebreak>
    <afterbreak></afterbreak>
</rule>
```
And this:
```
<rule break="no">
<beforebreak></beforebreak>
<afterbreak>[a-zA-Z0-9]</afterbreak>
</rule>
```
Should we accept these types of rules as a defacto standard? How do we interpret the empty rules? The underlying srx engine must convert them to some type of regex in a consistent way.
- 2015-05-15T18:33:11+00:00
Okapi Framework repo owner
- changed status to resolved
fix issue 456 - srx with empty rules...

it's unclear the full implications of this change. The SRX spec does not allow epty rules. But common use implies they should be treated the same as a regex "match anything" or "match anything or nothing"

→ <<cset fede24627089>>
- 2015-05-15T19:21:44+00:00
ysavourel
I've send a question to Rodolfo, the implementer of SRX for Swordfish (and the specification editor). Let's see what he think the behavior should be.
- 2015-05-18T17:02:23+00:00
ysavourel
Here is Rodolfo's answer:

=====

Check the rules for Japanese in the example included in the SRX 2.0 specification, at http://www.gala-global.org/oscarStandards/srx/srx20.html#AppSample

The sample shows an empty <afterbreak> element meaning that anything matches so you have to break segments after any ideographic punctuation mark included in <beforebreak>.

You can also have empty <beforebreak> elements, as they are defined exactly like <afterbreak>. Nevertheless, it is better to simply omit the element when there is no regular expression to use (<rule> needs to have at least one child, it is not required to have both <beforebreak> and <afterbreak>)

=====
- 2015-05-18T20:49:49+00:00
Jim Hargrave
If I read this correctly then an empty before/afterBreak should be implemented as the regex (.|\n|). Which means match anything or nothing. Should I merge my changes?

Note we would have one failing unit test if we make the default regex the same for before and after Break.
- 2015-05-18T21:17:11+00:00
Jim Hargrave
Actually a cleaner option would simply be to remove the group associated with the empty rule part. This would be faster over all and avoid any strange side effects the "match anything" regex might have.
- 2015-05-18T21:22:38+00:00
ysavourel
If that behave the same, yes.
- 2015-05-18T21:24:28+00:00
Jim Hargrave
Our SrxSegmenter (engine) isn't wired to deal with single "clause" rules using code like below. Looks like we have to have some kind of regex. But I'm still having trouble making the ticket test pass with the match anything or nothing ".|\n|" regex (does work with the anything ".|\n").
```
if (rule.before.isEmpty()) {
  // must add empty group to maintain group count
  pattern = "()" + afterPattern;
 } else if (rule.after.isEmpty()) {
// must add empty group to maintain group count
 pattern = beforePattern + "()";
} else {
 pattern = beforePattern + afterPattern;
}
```
- 2015-05-18T22:51:38+00:00
Log in to comment

Assignee: Jim Hargrave

Type: bug

Priority: minor

Status: resolved

Milestone: –

Version: –

Votes: 0

Watchers: 3