WARNING when dataRefEnd is used with subFlowsStart

Issue #12 wontfix
Patrice Ferrot
created an issue

When using the XLIFFReader to read the following document:

<?xml version="1.0"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en-US" trgLang="fr-FR">
<file id="1" canResegment="no">
<unit id="2">
<ignorable id="5">
<source>&lt;a href="https://browser-update.org/update.html" target="_blank"></source>
</ignorable>
<segment canResegment="yes">
<source></source>
</segment>
</unit>
<unit id="1">
<originalData>
<data id="d1">&lt;/a></data>
</originalData>
<segment id="6">
<source><mrk id="6_sid" translate="no" value="GET_SUPPORTED_BROWSER_BODY_2"></mrk>To have the most optimal experience on A360’s cloud-based application, we recommend to <pc id="1__6_ph" subFlowsStart="2" dataRefEnd="d1">download an advanced browser</pc></source>
</segment>
</unit>
</file>
</xliff>

I get the following warning:

WARNING: Error in <file> id='1', <unit> id='1'
Last element read: '{urn:oasis:names:tc:xliff:document:2.0}pc':
Both 'dataRefStart' and 'dataRefEnd' should be present or absent.

I could not find in the XLIFF 2.0 spec a word about not being allowed to use dataRefEnd with subFlowsStart rather than dataRefStart. Did I miss something?

Also, the above XLIFF 2.0 snippet was generated using the Okapi XLIFF toolkit itself...

Thanks, Patrice

Comments (15)

  1. Patrice Ferrot reporter

    Note that I could workaround this by adding a fake whitespace dataRefStart (see the <data id="d1"> </data> in the snippet below).

    But this is obviously not ideal as I added a whitespace that does not exist in the source file...

    Thank you for providing Okapi!

    <?xml version="1.0"?>
    <xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en-US" trgLang="fr-FR">
    <file id="1" canResegment="no">
    <unit id="2">
    <ignorable id="5">
    <source>&lt;a href="https://browser-update.org/update.html target="_blank"></source>
    </ignorable>
    <segment canResegment="yes">
    <source></source>
    </segment>
    </unit>
    <unit id="1">
    <originalData>
    <data id="d1"> </data>
    <data id="d2">&lt;/a></data>
    </originalData>
    <segment id="6">
    <source><mrk id="6_sid" translate="no" value="GET_SUPPORTED_BROWSER_BODY_2"></mrk>To have the most optimal experience on A360’s cloud-based application, we recommend to <pc id="1__6_ph" subFlowsStart="2" dataRefEnd="d2" dataRefStart="d1">download an advanced browser</pc></source>
    </segment>
    </unit>
    </file>
    </xliff>
    
  2. Yves Savourel

    It seems the issue is the use of <pc> with only dataRefEnd: The <pc> element is used for spanning codes, so if it has an closing code pointed by dataRefEnd, it should probably have a corresponding starting code too, otherwise it would be a standalone code and <ph> would be the element to use instead of <pc>. (There is a processing requirement that says "Extractors MUST NOT use the <pc> element to represent standalone codes.").

    This is not an error because both dataRefStart and dataRefEnd are optional, but it is a very unusual case to use only one of them, and it probably indicates that something is not "normal".

    And looking at the example, IMHO, it does reflect something not quite right with the extraction. The sub-flow is used to store the start code. Normally "...a sub-flow is a section of text embedded inside an inline code, or inside another section of text".

    A better way to store the same data would be:

    <xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en-US" trgLang="fr-FR">
     <file id="1" canResegment="no">
      <unit id="1">
       <originalData>
        <data id="d1">&lt;a href="https://browser-update.org/update.html" target="_blank"></data>
        <data id="d2">&lt;/a></data>
       </originalData>
       <segment id="6">
        <source><mrk id="6_sid" translate="no" value="GET_SUPPORTED_BROWSER_BODY_2"></mrk>To have the most optimal experience on A360’s cloud-based application, we recommend to <pc id="1__6_ph" dataRefStart="d1" dataRefEnd="d2">download an advanced browser</pc></source>
       </segment>
      </unit>
     </file>
    </xliff>
    

    There is also an unusual use of <mrk> in the example:

    <mrk id="6_sid" translate="no" value="GET_SUPPORTED_BROWSER_BODY_2"></mrk>
    

    It seems this should be either a <ph> rather than a <mrk>. or it would be marked up differently:

    <mrk id="6_sid" translate="no">GET_SUPPORTED_BROWSER_BODY_2</mrk>
    

    The purpose of the Translate Annotation is to delimit a span of content and say if it's translatable or not. Technically, value is not used in such annotation. A tool may find the annotation with an empty content and assume it is not used (since it delimits nothing) and remove it.

    A possibly safer notation could be:

    <xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en-US" trgLang="fr-FR">
     <file id="1" canResegment="no">
      <unit id="1">
       <originalData>
        <data id="d0">GET_SUPPORTED_BROWSER_BODY_2</data>
        <data id="d1">&lt;a href="https://browser-update.org/update.html" target="_blank"></data>
        <data id="d2">&lt;/a></data>
       </originalData>
       <segment id="6">
        <source><ph id="6_sid" dataRef="d0"/>To have the most optimal experience on A360’s cloud-based application, we recommend to <pc id="1__6_ph" dataRefStart="d1" dataRefEnd="d2">download an advanced browser</pc></source>
       </segment>
      </unit>
     </file>
    </xliff>
    

    To go back to the warning, I suppose we could have some kind of option to not generate it, but I'd like to get the feedback from other XLIFF2 users like @Martin Wunderlich, @Chase Tingley, @Jim Hargrave, or @David Filip on this.

  3. Patrice Ferrot reporter

    Thanks a lot for the great feedback!

    Let me try to give some context to this whole story: I am trying to build a library that can basically segment files thanks to Okapi filters and generate the resulting XLIFF 2.0 file. Then that same library could be reused to generate the target file given the same source file and the fully translated XLIFF 2.0 file.

    Hopefully I am not reinventing the wheel. This seems like a fairly standard use case, but I could not find anything doing this out of the box.

    Back to our story:

    I agree with you regarding the not "normal" use of a subflow, but the way I automated the whole thing, my code basically detects if a Code contains referred content, and will generate a subflow in that case. I could enhance the logic to do what you suggest for the example above, but what in cases where subflows are still required, e.g. for the title attribute of the <a> tag in HTML. Not sure I can avoid using subflows in that case. By the way, in the example above, my code generated a subflow because the URL (https://browser-update.org/update.html) is a reference to some property in the Code that I need to "manually" replace (I hope that you understand what I mean...).

    Thanks for your suggestions regarding the <mrk>. I can definitely have the value within as content instead of an attribute value. Not sure about using <ph> though: I understood that <mrk> were precisely there to give extra information to e.g. the translator. I would not want that piece of info to be e.g. stored in some TM or anything like that.

    Thanks again!

  4. Chase Tingley

    Hi Patrice!

    Hopefully I am not reinventing the wheel. This seems like a fairly standard use case, but I could not find anything doing this out of the box.

    If you haven't seen it already, maybe take a look at the Matecat-Filters project, which does something similar but uses XLIFF 1.2. It might be possible to adapt.

  5. Patrice Ferrot reporter

    Hi Chase!

    Thanks for the link, I did not know that project!

    Regarding the XLIFF 2.0 discussion, I wanted to illustrate what I meant in my previous comment with the HTML title attribute requiring a subflow.

    With the following basic HTML document:

    <html>
    <head>
    <title>This is the title</title>
    </head>
    <body>
    A basic HTML document with a link to <a href="http://www.google.com" title="Link to Google">google.com</a>.
    </body>
    </html>
    

    My code generates the following XLIFF 2.0 (note that I have implemented the change that Yves suggested for the <mrk> tag and that I disabled the whitspace workaround that I described above):

    <?xml version="1.0"?>
    <xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en-US" trgLang="fr-FR">
        <file id="1" canResegment="no">
            <unit id="2">
                <segment id="7">
                    <source>
                        <mrk id="7_type" translate="no">title</mrk>Link to Google</source>
                </segment>
            </unit>
            <unit id="3">
                <ignorable id="8">
                    <source>&lt;a href="http://www.google.com <ph id="1__8_ph" subFlows="2"/>></source>
                </ignorable>
                <segment canResegment="yes">
                    <source/>
                </segment>
            </unit>
            <unit id="1">
                <originalData>
                    <data id="d1">&lt;/a></data>
                </originalData>
                <segment id="4">
                    <source>
                        <mrk id="4_type" translate="no">title</mrk>This is the title</source>
                </segment>
                <segment id="9">
                    <source>A basic HTML document with a link to <pc id="1__9_ph" subFlowsStart="3" dataRefEnd="d1">google.com</pc>.</source>
                </segment>
            </unit>
        </file>
    </xliff>
    

    Note the "cascade" of subflows (unit 1 --> unit 3 --> unit 2). Does that look like a valid use of subflows?

    Thanks, Patrice

  6. Yves Savourel

    This is how Rainbow would represent your example:

    <xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en-us" trgLang="fr-fr">
     <file id="f1" original="example.html">
      <unit id="tu1">
       <segment>
        <source>This is the title</source>
       </segment>
      </unit>
      <unit id="tu3">
       <segment>
        <source>Link to Google</source>
       </segment>
      </unit>
      <unit id="tu2">
       <originalData>
        <data id="d1">[#$dp6]</data>
        <data id="d2">&lt;/a></data>
       </originalData>
       <segment>
        <source>A basic HTML document with a link to <pc id="1" canCopy="no" canDelete="no" subFlowsStart="tu3" dataRefEnd="d2" dataRefStart="d1">google.com</pc>.</source>
       </segment>
      </unit>
     </file>
    </xliff>
    

    In our case all the non-inline codes (like <title>) are left in the "skeleton" file. The start tag for the anchor is also stored there and we just place a reference to it in the XLIFF document ([#$dp6]). I'm not saying it is the best way to do this, but the idea is to generate an XLIFF with as few codes as possible.

    A better output, IMO, would be something like this:

    <xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en-us" trgLang="fr-fr">
     <file id="f1" original="example.html">
      <unit id="1">
       <segment>
        <source>This is the title</source>
       </segment>
      </unit>
      <unit id="3">
       <segment>
        <source>Link to Google</source>
       </segment>
      </unit>
      <unit id="2">
       <originalData>
        <data id="d1"><a href="http://www.google.com" title="[#REF:unit-3]"></data>
        <data id="d2">&lt;/a></data>
       </originalData>
       <segment>
        <source>A basic HTML document with a link to <pc id="1" canCopy="no" canDelete="no" subFlowsStart="3" dataRefEnd="d2" dataRefStart="d1">google.com</pc>.</source>
       </segment>
      </unit>
     </file>
    </xliff>
    

    Where [#REF:unit-3] would be your own reference (with whatever syntax you need) to place back the sub-flow text on merge.

    It seems sometimes you keep the HTML structural element (like <title>) along with the extracted content, sometimes you don't (like <body>, etc.), so you have some ways to re-construct the original without storing it in the document.

    In any case, using several cascading sub-flows, IMO, is not a good idea. I would try to always keep the following guidelines:

    • Avoid or limit the sub-flows
    • Avoid ignorable if possible
    • Avoid using <mrk> to store codes
    • Keep the inline codes at a minimum

    I hope this helps.

  7. Patrice Ferrot reporter

    I am precisely trying to avoid having something like [#$dp6] in the XLIFF 2.0 as this is of no use to the translator for example, whereas the actual placholder value might be of interest in some situations.

    Your suggestion with [#REF:unit-3] makes sense, but I was hopping to be able to come up with something standard and not having to use custom references (so that it could be used by other tools).

    Anyway, thanks again for your input, I will see what I can do!

  8. Chase Tingley

    In answer to the original question, I think in this case the warning helped identify a situation where the XLIFF representation could have been better, so in that sense it was useful. I'd be fine with removing it we found a more typical use case that hit the same problem, but I don't think this was it.

  9. David Filip

    I second @Chase Tingley and @Yves Savourel that the warning should not be removed although @Patrice Ferrot is right that the co-occurrence of dataRefStart and dataRefEnd is not explicitly enforced in the spec. It is kind of implied by the <pc> PR

    Extractors MUST NOT use the <pc> element to represent standalone codes.

    Since <pc> always represents a pair code there should always be data reference-able from both dataRefEnd and dataRefStart. We shouldn't say that a pc with a dataRefEnd missing when dataRefStart used is absolutely invalid, but it's always worth a Warning.

    I wanted to comment on the other aspects that developed in the discussion. I support the intentions @Patrice Ferrot is trying to achieve in his representation albeit not the currently proposed representations ;-). I think cascaded subflows are fine if they serve a need and that there is value in not hiding the subflow relationship in a private referencing mechanism within the original data. I also fully support not hiding potentially important context from translators in the skeleton.. On the other hand I agree with the other principles Yves listed: Avoiding ignorables is a good principle. Ignorables are good for extra whitespaces, representations of XML illegal characters, stuff that was in the Extracted segment before resegmentation and you need to handle it somehow after you applied your segmentation. Most of the time it doesn't make sense to use them if you control Extraction.. Neither segments nor ignorables should be used for storing codes, also not in mrk or sm em spans.

    This

    &lt;a href="http://www.google.com
    

    should be stored as originalData rather then within an ignorable.

    This

    <mrk id="7_type" translate="no">title</mrk>
    

    is simply wrong You could use the fs module http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-os.html#fs to convey this information like this

    <unit id="2" fs:fs="title">
                <segment id="7">
                    <source>
                        Link to Google
                    </source>
                </segment>
            </unit>
    

    If you want to stay in core, the title tag data should be referenced as original data from a pc. You can also use the equivStart/equivEnd or dispStart/dispEnd attributes that can kind of replace the use of the stored original data.. They're core and you can use them to facilitate roundtrip. fs is intended just for making previews, not to support roundtrip..

    I hope this helps

  10. Patrice Ferrot reporter

    Hi David,

    Sorry for the late reply, I got distracted by other projects.

    Thanks a lot for your message and thanks again to Chase and Yves! I actually agree with everything you said in this thread. XLIFF 2.0 is still very new to me and everything here was very enlightening.

    My generated XLIFF 2.0 looks like this now:

    <?xml version="1.0"?>
    <xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en-US" trgLang="fr-FR">
        <file id="1" canResegment="no">
            <unit id="2">
                <segment id="7">
                    <source>Link to Google</source>
                </segment>
            </unit>
            <unit id="1">
                <originalData>
                    <data id="d1">&lt;a href="http://www.google.com" title="[#f=1/u=2]"></data>
                    <data id="d2">&lt;/a></data>
                </originalData>
                <segment id="4">
                    <source>This is the title</source>
                </segment>
                <segment id="9">
                    <source>A basic HTML document with a link to <pc id="1__9_ph" dataRefEnd="d2" dataRefStart="d1">google.com</pc>.</source>
                </segment>
            </unit>
        </file>
    </xliff>
    

    According to your comments, it seems more in line with the XLIFF 2.0 "philosophy". My original XLIFF 2.0 was valid, but I understand and agree that what's valid is not always good!

    Regarding my use of <mrk>, I actually wanted to use it as kind of a metadata holder. I will use the metadata module instead, that will be much cleaner (and simply correct).

    Thanks again, Patrice

  11. David Filip

    Looks good to me too, just two more philosophical remarks.. 1) <mrk> (equivalent with <sm/>/<em/> pairs) tags were designed as an extensible annotation mechanism. They are used by core annotations, as well as module and extension based annotations. So they're sort of designed to express metadata.. BUT there is a general principle http://docs.oasis-open.org/xliff/xliff-core/v2.0/os/xliff-core-v2.0-os.html#d0e10828 that doesn't let you express features already expressed by core or modules with custom mechanisms. This is true for both extended <mrk> and <mda:metadata>. Also <mrk> is meant to hold metadata as attribute values (or point to module or extension based metadata outside the inline content) rather than enclose the metadata mixed with the payload content. This would be analogical to the bad <bpt>/<ept> design that was used in TMX and XLIFF 1.2 that XLIFF 2 abandoned. You are supposed to enclose with markers payload that you want to enrich with metadata rather than use it to hide metadata disguised as non-translatable content. That was the chief offense of your previous example ;-) I am just calling out the principle to warn you not to use <mda:metadata> for something that can be done with core or modules.. If you use the <mda:metadata> to roundtrip formatting information you would certainly infringe on several core and module features ;-) which would make it semantically invalid, even though it would syntactically validate.. 2) Obviously the pointing mechanism u're using from the data is private (because the data specification doesn't specify a pointing mechanism) but I like that you're using the XLIFF fragid syntax which makes it transparent.

  12. Log in to comment