XLIFFFilter: `<ph></ph>` are coded as two TagType.PLACEHOLDER with TextFragment.MARKER_ISOLATED

Create issue
Issue #966 resolved
hao created an issue

Hi Okapi,

In short:

<ph></ph> is coded to two TagType.PLACEHOLDER with TextFragment.MARKER_ISOLATED, and it’s not possible to retain the OPEN/CLOSE pair tags from it when processing TextFragment.codes from Events.

Backgrounds:

we’re using Okapi to parse XLIFF, extract text and InlineElements(MARKER_OPENING/ MARKER_CLOSING Elements) and Markers(MARKER_ISOLATED), send them to our Machine-Translation Service, get the results back, merge back translation with InlineElements and Markers to Okapi.

Recently we find out an issue that xliff tag <ph></ph> are coded as two TagType.PLACEHOLDER but not a OPENING and CLOSING. We understand that <ph> stands for PLACEHOLDER, but it this case (<ph></ph>), should it be an OPEN and CLOSE element?

Details:

Xliff:

<source>
    <g ctype="x-html-p" id="1" dgo:tag_name="p">
        <ph ctype="image" id="2" htm:src="B9BD5C75F6951B0.gif" htm:width="350" htm:height="350" htm:border="2px solid rgb(255, 0, 0)" htm:float="left" htm:margin="10px">
            <sub ctype="x-html-img-alt">display</sub>
        </ph>
    </g>
</source>
<target xml:lang="de-DE">
    <ph ctype="image" id="2" htm:src="B9BD5C75F6951B0.gif" htm:width="350" htm:height="350" htm:border="2px solid rgb(255, 0, 0)" htm:float="left" htm:margin="10px">
        <sub ctype="x-html-img-alt">
            <g ctype="x-html-p" id="1" dgo:tag_name="p">Anzeige</sub>
        </ph>
    </g>
</target>

TextFragment.codes of the source section:

0 = {Code@21675} ""
 tagType = {TextFragment$TagType@21682} "OPENING"
 outerData = {StringBuilder@21685} "<g ctype="x-html-p" id="1" dgo:tag_name="p">"
1 = {Code@21656} ""
 tagType = {TextFragment$TagType@21244} "PLACEHOLDER"
 outerData = {StringBuilder@21693} "<ph ctype="image" id="2" htm:src="B9BD5C75F6951B0.gif" htm:width="350" htm:height="350" htm:border="2px solid rgb(255, 0, 0)" htm:float="left" htm:margin="10px"><sub ctype="x-html-img-alt">"
2 = {Code@21676} ""
 tagType = {TextFragment$TagType@21244} "PLACEHOLDER"
 outerData = {StringBuilder@21700} "</sub></ph>"
3 = {Code@21677} ""
 tagType = {TextFragment$TagType@21721} "CLOSING"
 outerData = {StringBuilder@21724} "</g>"

Question:

  1. Should<ph></ph> be coded as "PLACEHOLDER" or "OPENING" and "CLOSING" tags?
  2. If it’s right to make <ph></ph> PLACEHOLDERs, is it possible to retain the OPENING and CLOSING information from TextFragment?

Thank you very much!

Hao

Comments (11)

  1. YvesS

    <ph></ph> should definitely be seen as a PLACEHOLDER code normally. I think in this case the sub element causes the split into 2.

    But it should not be an OPENING and a CLOSING.

  2. Patrick Huy

    Thank you for your insights @YvesS can you share some insights/ideas on how to handle the situation? If <ph></ph> would map to a single PLACEHOLDER everything would be fine. <ph><sub> and </sub></ph> each being parsed as PLACEHOLDER makes it impossible for our application to know that they need to be treated as a pair in order for Okapi not to produce broken XML.

    It seems that for us it would be ideal if

    Begin <ph><sub>subflow</sub></ph> End

    Could be parsed as 2 segments (or similar structure?)

    Begin <PLACEHOLDER/> End

    and subflow

    with we current way we get

    Begin <PLACEHOLDER outerData="<ph><sub>"/>subflow<PLACEHOLDER outerData="</ph></sub>"/> End

    And we can’t tell that swapping the two PLACEHOLDERS is an illegal operation.

    An alternative could also be an indicator inside the TextFragment that a subflow begins and ends.

    When the inner <sub> is omitted everything is better and <ph></ph> is a single PLACEHOLDER which works great for us.

  3. YvesS

    Yes, full support for sub as a separate segment would be ideal. It’s just not implemented in the current version of the XLIFF reader/filter. One of the reason is that it’s a recursive feature.

    A half-way solution would be to support plain text sub content, Not ideal, but better than nothing.

  4. Patrick Huy

    I also noticed that the Wiki here https://okapiframework.org/wiki/index.php/XLIFF_Filter states that

    The content of the <sub> element is currently not supported as text. Any element found inside a <bpt>, <ept>, <ph>, and <it> (including <sub>) is included in the code of the parent inline element. A warning is generated when a <sub> element is detected.

    This seems to be different from what we are seeing (the text is included “as text” but not marked as being in a subflow in the Okapi object model.)

  5. YvesS

    Looks like that may work. but shouldn’t we have append(tagType… with tagType set to the proper marker by the calling statement? Instead of overwriting it inside the function?
    I have to say that I have not looked at this code in a long time and cannot remember all the intricacies of that part. But you should feel free to submit a PR, you have tests, so that helps trust a bit the change.

  6. Log in to comment