Ratel: Segmentation not possible when inline codes present

Former user Account Deleted

Comment 1. originally posted by @ysavourel on 2015-02-24T20:05:05.000Z:

I believe the segmentation library strips codes from the content before applying the rules, then reinserts them after segmentation.

2015-02-24T20:05:05+00:00

Former user Account Deleted

Comment 2. originally posted by m...@sebastianebert.com on 2015-02-24T20:33:43.000Z:

I think you might be right. I tested the following example:

First sentence.<x0/>Sencond sentence.

Before break: \.
After break: \s

results in:
[first sentence.<x0/>Sencond sentence.]

However, if I put a space either before or after the tag, segmentation works:
[first sentence.][ <x0/>Sencond sentence.]

So I would propose not to strip the tags and reinserting them. In my case the source is an IDML file and <x0/> represents a line break. By the current behaviour it's not possible to do an adequate segmentation. I get hughe segments (whole paragraphs) consisting of multiple sentences. Any chance to change this?

2015-02-24T20:33:43+00:00

Former user Account Deleted

Comment 3. originally posted by @ysavourel on 2015-02-27T06:30:44.000Z:

I think this change is unlikely to be made, since I believe that something close to the opposite change has previously been made to arrive at the current behavior (see issue ~~#169~~).

The real issue is SRX itself, which doesn't actually specify a method for matching against an inline code. Your regex -- which matches the literal text "<w0>" -- won't match real codes if it was used as part of a segmentation step in a processing pipeline. SRX is a broken standard, basically.

I noticed, however, that if I try these rules, I get the result you want:
Before break: \.
After break:

ie, the "after break" rule is the empty string. This produces
[This is the first sentence.][<x0/>This is the second sentence.]

for me.

2015-02-27T06:30:44+00:00

Jim Hargrave (OLD)

changed status to wontfix

See "The real issue is SRX itself, which doesn't actually specify a method for matching against an inline code. Your regex -- which matches the literal text "<w0>" -- won't match real codes if it was used as part of a segmentation step in a processing pipeline. SRX is a broken standard, basically. "

2021-03-10T19:29:56+00:00

Comments (4)