Segmentation: Isolated markers problem in stripping process

Email thread for the proposal to replace isolated markers by space when stripping codes during the segmentation process, to resolve the "Text.<x/>text" issue.

==============================================

From: Alvaro Reneses Sent: Tuesday, July 28, 2015 8:53 PM Subject: Re: Isolated markers problem in stripping process

Sorry for the delay!

I also think that the approach suggested by Jim would be perfect, but that it may take too much effort at this point. A basic implementation, as Giuseppe said, would be enough for our short-mid term goals.

I would just like to add one remark to the conversation:

I'm not so concerned as Chase and Giuseppe about the possibility of making a wrong replacement in a <x> tag. This change is just considered for segmentation purposes; after this step, it should be replaced again for the original <x> tag.

Therefore, even in the case we would replace a <x> tag encoding an inline marker which should not be replaced (something that we have not seen in our tests), there are just a few specific rules that may be negatively affected. For example, let's consider the case of decimal numbers:

123.<x>456 → 123. 456 → [123.] [456] (Expected: [1234.456])

But... what is the probability of having a <x> tag in the middle of a number, if that number appears all together in the original document? The only possible explanations are that the <x> is replacing metadata, which should have been cleaned before by the filters; or that it is a formatting element (as <b> or <i> in HTML), but I cannot recall any formatting element encoded into an isolated marker.

Summarizing, my point of view is that in the vast majority of the segmentation rules, adding a whitespace does not affect negatively the result; and finding one <x> tag representing some entity that it's not separating the text is complicated. On the other hand, there are several issues that will be solved by this replacement. Considering both improvements and potential problems, I have no doubt about the need to implement it.

Happy Tuesday!

On Tue, Jul 28, 2015 at 9:56 AM, Giuseppe Silvano wrote: I confirm what Chase said, we can wait a bit in order to allow a better implementation, it's not so urgent. Thank you!

On Tue, Jul 28, 2015 at 12:49 AM, Jim Hargrave wrote: That would be appreciated! I am very close to finalizing the code in the srx_bug branch.

Jim

On 07/27/2015 04:47 PM, Chase Tingley wrote: Ok, let's start there then. It can be a stepping-stone to more advanced functionality later.

Jim, my assumption is that we would want to wait until you merge your changes before attempting this work. What do you think? (I don't think this is the highest priority on our end, so you wouldn't be blocking us if you hold off a bit longer.)

On Mon, Jul 27, 2015 at 1:22 PM, Jim Hargrave wrote: I don't have a problem adding this as a general option to the segmenter. As long as it is turned off by default.

Jim

On 07/27/2015 07:42 AM, Giuseppe Silvano wrote: I like this idea too, I think it would be the optimal solution.
However I think it requires lot of development effort, and so a lot of time too.

While we discuss and work on this optimal solution, I suggest anyway to change the default behavior for <x> tags to make them treated as whitespaces.

I say this because on almost all the segments I analyzed, <x> tags could be properly considered text separators. I understand that in some cases this is wrong, but what I see is that in the vast majority of the cases <x> are text separator, and only in a very few cases they are not.

What do you think about this?

Hope to be helpful in the discussion, bye!

On Fri, Jul 24, 2015 at 6:45 PM, Chase Tingley wrote: I am looping Alvaro and Giuseppe back in to the discussion.

I like Jim's idea. It would address the small tickling fear that I have that "isolated tag" is not specific enough as a trigger to segment intelligently. And it would allow us to selectively apply this behavior in different content types and contexts.

Of course, it would require a bit more intelligence in the filters, but that's where it belongs.

Let's talk about it next week. (I will probably dial in late, as usual.)

ct

On Fri, Jul 24, 2015 at 8:53 AM, Jim Hargrave wrote: Instead of a global segmentation option would it be better to label any inline codes that should be treated as whitespace? I think it would allow more nuanced control. The filters should understand the formatting and pass that info down stream. By "label" we could add another field to Code or rely on a specific type "whitespace". I would prefer a new field.

Keep in mind that I haven't merged my segmentation changes yet. Some of the inline code logic has changed, but looking at these changes I don't think there will be any conflicts.

Can we discuss this in next weeks meeting? I think we may be able to make this feature more general.

Jim On 07/24/2015 07:55 AM, Yves Savourel wrote: Hi Chase,

I can’t think of any issue with an option for this.

-yves

From: Chase Sent: Friday, July 24, 2015 1:11 AM Subject: Fwd: Isolated markers problem in stripping process

Hi guys,

I'm forwarding this one to get your thoughts on it, since you've spent a lot more time thinking about segmentation recently than I have.

I think Alvaro's argument is an interesting one. What he's proposing would be optional behavior. What do you think?

---------- Forwarded message ---------- From: Alvaro Reneses Date: Thu, Jul 16, 2015 at 2:46 AM Subject: Isolated markers problem in stripping process Good morning, Chase,

I send you a recap email about the problem with the isolated markers that we commented yesterday.

Problem In the segmentation step, when the tags are stripped, the segmentation is performed like this:

Example.<x> Example. → Example. Example. → [Example.] [Example.] → [Example.<x>] [Example.]

This is the desired output. However, when a linebreak, image or any other inline element is encoded as an <x> and there are no additional spaces, the output is:

Example.<x>Example. → Example.Example. → [Example.Example.] → [Example.<x>Example.]

As we can see, the inline element is being directly stripped, so the segmentation is not performed.

Solution In the tag stripping step, substitute the isolated markers by an space. This step will not be taking into account in the segmentation because it will be trimmed by the rules.

Implementation First of all, we added a new option in SRXSegmenter: private boolean spaceIsolatedMarker; Including it into the reset

@Override
public void reset () {
   currentLanguageCode = null;
   rules = new ArrayList<CompiledRule>();
   maskRule = null;
   splits = null;
   segmentSubFlows = true; // SRX default
   cascade = false; // There is no SRX default for this
   includeStartCodes = false; // SRX default
   includeEndCodes = true; // SRX default
   includeIsolatedCodes = false; // SRX default
   oneSegmentIncludesAll = false; // Extension
   trimLeadingWS = false; // Extension IN TEST (was true for StringInfo)
   trimTrailingWS = false; // Extension IN TEST (was true for StringInfo)
   useJavaRegex = false; // Extension
   trimCodes = false; // Extension IN TEST (was false for StringInfo) NOT USED for now
   spaceIsolatedMarker = false;
   icuRegex.reset();
}

And creating a setter

public void setSpaceIsolatedMarker(boolean spaceIsolatedMarker) {
   this.spaceIsolatedMarker = spaceIsolatedMarker;
}

Then, we modified the removeCodes function (in TextUnitUtil):

public static String removeCodes (String codedText, boolean spaceIsolatedCodes) {
   StringBuilder tmp = new StringBuilder();
   for (int i=0; i<codedText.length(); i++) {
      switch (codedText.charAt(i)) {
         case TextFragment.MARKER_ISOLATED: // If it is an inline element, replace it by an space
            if (spaceIsolatedCodes)
               tmp.append(" ");
         case TextFragment.MARKER_OPENING:
         case TextFragment.MARKER_CLOSING:
            i++; // skip index marker as well
            break;
         default:
            tmp.append(codedText.charAt(i));
            break;
      }
   }
   return tmp.toString();          
}

And we added a new method to be retro-compatible. public static String removeCodes (String codedText) { return removeCodes(codedText, false); } And finally, we correct the break indexes taking this changes into account:

// Adjust the split positions for in-line codes inclusion/exclusion options
// And create the list of final splits at the same time
finalSplits = new ArrayList<Integer>();
if ( hasCode ) { // Do this only if we have in-line codes
   // All breaks are before codes, as we restore a code at its original pos, and if 
   // there's a break at that pos, the code will always find itself after the break
   int lastPos = 0;
   int correctValue = 1;
   for ( int pos : splits.keySet() ) {

     ..........

      // Correct the spaces that we have added with the isolated markers
      if (spaceIsolatedMarker) {
         for (int i=0; i < codePositions.size(); i++) {
            int codePos = codePositions.get(i) + i*2;
            if (codePos < lastPos) continue;
            if (codePos > pos  ||  codePos >= codedText.length()) break;
            if (codedText.charAt(codePos) == (char) TextFragment.MARKER_ISOLATED)
               pos -= correctValue++;
         }
      }
      // Store the updated position
      finalSplits.add(pos);
      // Update last position
      lastPos = pos;
   }
}

Do you think that it will possible to implement this, or a similar approach, in Okapi? We want to use this behavior, but it will be way easier if it would be directly included in Okapi, rather than having to update our fork each time. We do not have any need to implement it also in Longhorn, in fact we are trying to use Okapi directly without it. Regards!

Comments (2)