Ratel: Segmentation not possible when inline codes present
Original issue 445 created by m...@sebastianebert.com on 2015-02-23T15:04:23.000Z:
What steps will reproduce the problem?
Segment the following Text in Ratel:
This is the first sentence.<x0/>This is the second sentence.
Using the following rules:
Before break: \.
After break: (<\w\d+/?>)
What is the expected output?
[This is the first sentence.][<x0/>This is the second sentence.]
What do you see instead?
[This is the first sentence.<x0/>This is the second sentence.]
So the break rule does not seem to work when the <x0/>-Tag. The regexp is working (checked it on other tools), but segmentation seems to fail.
What version of the product are you using? On what operating system?
0.26
Comments (4)
-
Account Deleted -
Account Deleted Comment 2. originally posted by m...@sebastianebert.com on 2015-02-24T20:33:43.000Z:
I think you might be right. I tested the following example:
First sentence.<x0/>Sencond sentence.
Before break: \.
After break: \sresults in:
[first sentence.<x0/>Sencond sentence.]However, if I put a space either before or after the tag, segmentation works:
[first sentence.][ <x0/>Sencond sentence.]So I would propose not to strip the tags and reinserting them. In my case the source is an IDML file and <x0/> represents a line break. By the current behaviour it's not possible to do an adequate segmentation. I get hughe segments (whole paragraphs) consisting of multiple sentences. Any chance to change this?
-
Account Deleted Comment 3. originally posted by @ysavourel on 2015-02-27T06:30:44.000Z:
I think this change is unlikely to be made, since I believe that something close to the opposite change has previously been made to arrive at the current behavior (see issue
#169).The real issue is SRX itself, which doesn't actually specify a method for matching against an inline code. Your regex -- which matches the literal text "<w0>" -- won't match real codes if it was used as part of a segmentation step in a processing pipeline. SRX is a broken standard, basically.
I noticed, however, that if I try these rules, I get the result you want:
Before break: \.
After break:ie, the "after break" rule is the empty string. This produces
[This is the first sentence.][<x0/>This is the second sentence.]for me.
-
- changed status to wontfix
See "The real issue is SRX itself, which doesn't actually specify a method for matching against an inline code. Your regex -- which matches the literal text "<w0>" -- won't match real codes if it was used as part of a segmentation step in a processing pipeline. SRX is a broken standard, basically. "
- Log in to comment
Comment 1. originally posted by @ysavourel on 2015-02-24T20:05:05.000Z:
I believe the segmentation library strips codes from the content before applying the rules, then reinserts them after segmentation.