Ratel: Segmentation not possible when inline codes present

Issue #445 wontfix
Former user created an issue

Original issue 445 created by m...@sebastianebert.com on 2015-02-23T15:04:23.000Z:

What steps will reproduce the problem?

Segment the following Text in Ratel:
This is the first sentence.<x0/>This is the second sentence.

Using the following rules:
Before break: \.
After break: (<\w\d+/?>)

What is the expected output?
[This is the first sentence.][<x0/>This is the second sentence.]

What do you see instead?
[This is the first sentence.<x0/>This is the second sentence.]

So the break rule does not seem to work when the <x0/>-Tag. The regexp is working (checked it on other tools), but segmentation seems to fail.

What version of the product are you using? On what operating system?
0.26

Comments (4)

  1. Former user Account Deleted

    Comment 1. originally posted by @ysavourel on 2015-02-24T20:05:05.000Z:

    I believe the segmentation library strips codes from the content before applying the rules, then reinserts them after segmentation.

  2. Former user Account Deleted

    Comment 2. originally posted by m...@sebastianebert.com on 2015-02-24T20:33:43.000Z:

    I think you might be right. I tested the following example:

    First sentence.<x0/>Sencond sentence.

    Before break: \.
    After break: \s

    results in:
    [first sentence.<x0/>Sencond sentence.]

    However, if I put a space either before or after the tag, segmentation works:
    [first sentence.][ <x0/>Sencond sentence.]

    So I would propose not to strip the tags and reinserting them. In my case the source is an IDML file and <x0/> represents a line break. By the current behaviour it's not possible to do an adequate segmentation. I get hughe segments (whole paragraphs) consisting of multiple sentences. Any chance to change this?

  3. Former user Account Deleted

    Comment 3. originally posted by @ysavourel on 2015-02-27T06:30:44.000Z:

    I think this change is unlikely to be made, since I believe that something close to the opposite change has previously been made to arrive at the current behavior (see issue #169).

    The real issue is SRX itself, which doesn't actually specify a method for matching against an inline code. Your regex -- which matches the literal text "<w0>" -- won't match real codes if it was used as part of a segmentation step in a processing pipeline. SRX is a broken standard, basically.

    I noticed, however, that if I try these rules, I get the result you want:
    Before break: \.
    After break:

    ie, the "after break" rule is the empty string. This produces
    [This is the first sentence.][<x0/>This is the second sentence.]

    for me.

  4. Jim Hargrave (OLD)

    See "The real issue is SRX itself, which doesn't actually specify a method for matching against an inline code. Your regex -- which matches the literal text "<w0>" -- won't match real codes if it was used as part of a segmentation step in a processing pipeline. SRX is a broken standard, basically. "

  5. Log in to comment