XLIFF filter show pre translation message in different language when overwrite target language

Issue #936 new
Former user created an issue

We can see this bug from testOutputOverrideTargetlanguage test in XLIFFFilterTest.java

Reproduce the issue:

step1: Use the content:

<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<xliff version=\"1.2\">\r
<file source-language=\"en\" target-language=\"fr\" datatype=\"x-test\" original=\"file.ext\">
\r<body>
<trans-unit id=\"1\">
<source xml:lang=\"en\">en message</source>
<target xml:lang=\"fr\">fr message</target>
</trans-unit>
<trans-unit id=\"2\">
<source xml:lang=\"en\">en message2</source>
<target>fr message2</target>
</trans-unit></body></file></xliff>

step2: Create a xliff filter but set filter.getParameters().setOverrideTargetLanguage(true).

step3: Use xliff to generateOutput for other language like "de"

Observe what we get:

<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<xliff version=\"1.2\">\r
<file source-language=\"en\" target-language=\"de\" datatype=\"x-test\" original=\"file.ext\">
\r<body>
<trans-unit id=\"1\">
<source xml:lang=\"en\">en message</source>
<target xml:lang=\"de\">fr message</target>
</trans-unit>
<trans-unit id=\"2\">
<source xml:lang=\"en\">en message2</source>
<target>fr message2</target>
</trans-unit></body></file></xliff>

You can see <target xml:lang=\"de\">fr message</target>.
The "de" target shouldn't contains "fr" message.

Comments (9)

  1. Mihai Nita

    Removed \ in front of " and some \r

    <?xml version="1.0" encoding="UTF-8"?>
    <xliff version="1.2">
    <file source-language="en" target-language="fr" datatype="x-test" original="file.ext">
    <body>
    
    <trans-unit id="1">
      <source xml:lang="en">en message</source>
      <target xml:lang="fr">fr message</target>
    </trans-unit>
    
    <trans-unit id="2">
      <source xml:lang="en">en message2</source>
      <target>fr message2</target>
    </trans-unit>
    
    </body>
    </file>
    </xliff>
    

  2. Mihai Nita

    What is the operation used to generate that output?
    Extract? Merge? Do something in tikal?

    Just creating a filter with an input will not create an output.
    Can you provide some steps that we can follow to reproduce this?

    Thank you,
    Mihai

  3. Chenhui Zhou

    Thanks for the refactor poor input!

    Sorry to use “generate“, I was just copied the term in the test. To be more precise “Extract” could reproduce the issue.

    I couldn’t find the way in tikal to overwrite XLIFF filter’s parameter to set “overrideTargetLanguage”, the default value is false so filter will always take the target-language in <file> as document target language. To avoid this, I just remove the target0language in <file> .

    input.xlf:

    <?xml version="1.0" encoding="UTF-8"?>
    <xliff version="1.2">
    <file source-language="en" datatype="x-test" original="file.ext">
    <body>
    <trans-unit id="1">
    <source xml:lang="en">en message</source>
    <target xml:lang="fr">fr message</target>
    </trans-unit>
    <trans-unit id="2">
    <source xml:lang="en">en message2</source>
    <target>fr message2</target>
    </trans-unit>
    </body>
    </file>
    </xliff>
    

    Then use tikal to extract:

    $./tikal.sh -x -tl de -od . input.xlf
    

    And this is the output:

    <?xml version="1.0" encoding="UTF-8"?>
    <xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:okp="okapi-framework:xliff-extensions" xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsxlf="http://www.w3.org/ns/its-xliff/" its:version="2.0">
    <file original="file.ext" source-language="en" target-language="de" datatype="x-test" okp:inputEncoding="UTF-8">
    <body>
    <trans-unit id="1">
    <source xml:lang="en">en message</source>
    <target xml:lang="de">fr message</target>
    </trans-unit>
    <trans-unit id="2">
    <source xml:lang="en">en message2</source>
    <target xml:lang="de">fr message2</target>
    </trans-unit>
    </body>
    </file>
    </xliff>
    

    So my concern is “fr message“ should not appear under a <target> whose target-language is “de“.

    Hope this can help describe the issue.

    Thanks!

  4. Mihai Nita

    It looks like the root cause is that the XLIFFFilter “does not understand” TextUnit(s) with multiple locales.

    It loads the text in the first <target> tag, ignores the xml:lang even if present, and declares the target locale to be the file level one.
    The file level target locale is either the one declared in <file> target-language attribute (without setOverrideTargetLanguage) or the one declared in the filter.

    See attached code that reproduces the problem.

    The skeleton is also messed up if we set setOverrideTargetLanguage). I don’t know what that would do to a merge operation:

    ===== *, setOverrideTargetLanguage(false) =====
    
    setOverrideTargetLanguage(false) =====
        skeleton : 
            <trans-unit id="tu2" restype="x-paragraph"[#$$self$@%locQualityIssuesRef][#$$self$@%provenanceRecordsRef][#$$self$@%approved]>
              <source xml:lang="en"[#$$self$@%locQualityIssuesRef][#$$self$@%provenanceRecordsRef]>[#$$self$]</source>
              [@#$SEGSRC$#@]<target xml:lang="de"[#$$self$@%mtConfidence][#$$self$@%locQualityIssuesRef][#$$self$@%provenanceRecordsRef]>[#$$self$]</target>
              <target xml:lang="es"[#$$self$@%mtConfidence][#$$self$@%locQualityIssuesRef][#$$self$@%provenanceRecordsRef]>A second Spanish text (2).</target>
              <target xml:lang="fr"[#$$self$@%mtConfidence][#$$self$@%locQualityIssuesRef][#$$self$@%provenanceRecordsRef]>A second French text (3).</target>
              <target xml:lang="ja"[#$$self$@%mtConfidence][#$$self$@%locQualityIssuesRef][#$$self$@%provenanceRecordsRef]>A second Japanese text (4).</target>
              [@#$ALTTRANS$#@][@#$NOTE$#@]
            </trans-unit>
    
    ===== es, setOverrideTargetLanguage(true) =====
    
        skeleton : 
            <trans-unit id="tu2" restype="x-paragraph"[#$$self$@%locQualityIssuesRef][#$$self$@%provenanceRecordsRef][#$$self$@%approved]>
              <source xml:lang="en"[#$$self$@%locQualityIssuesRef][#$$self$@%provenanceRecordsRef]>[#$$self$]</source>
              [@#$SEGSRC$#@]<target xml:lang="es"[#$$self$@%mtConfidence][#$$self$@%locQualityIssuesRef][#$$self$@%provenanceRecordsRef]>[#$$self$]</target>
              <target xml:lang="es"[#$$self$@%mtConfidence][#$$self$@%locQualityIssuesRef][#$$self$@%provenanceRecordsRef]>A second Spanish text (2).</target>
              <target xml:lang="es"[#$$self$@%mtConfidence][#$$self$@%locQualityIssuesRef][#$$self$@%provenanceRecordsRef]>A second French text (3).</target>
              <target xml:lang="es"[#$$self$@%mtConfidence][#$$self$@%locQualityIssuesRef][#$$self$@%provenanceRecordsRef]>A second Japanese text (4).</target>
              [@#$ALTTRANS$#@][@#$NOTE$#@]
            </trans-unit>
    

    The XLIFFWriter is also unable to write multilingual TextUnit(s). Also see attached code.
    I did not check to see what XLIFFSkeletonWriter does.

  5. Mihai Nita

    It does not look like a quick fix (something that can be done a week or two before a release :-)

    But we can try do define what would be the desired behavior.
    We have several “knobs”:

    • file level target locale (attribute target-language in <file>). Optional.
    • the RawDocument target locale (propagated to XLIFFFilter). Optional(?)
    • setOverrideTargetLanguage (if called / true / false).
    • the xml:lang attributes on <target> . Can also be missing, so: Optional.

    What do we expect o see in the TextUnit, and what do we expect to see in skeleton.


    Step 2 would be to decide what the merge behavior would be.

  6. Mihai Nita

    I agree that in general working with multilingual xliff files is messy (file management becomes a pain), and I am not aware of any company doing it.

    I've seen cases where a client sends a file (partially) translated into one language (let’s say French), and wants X more languages (in separate files) (let's say Spanish + German)
    A workaround for such cases would be to create separate projects: a French one with the original xliff, and a Spanish + German project with the original xliff and the target removed.

    So we can say: this is not supported, and leave it at that.

    Although it is a bit disappointing if we don’t properly support the standard, at least at read / write level (even if we don’t promise that all steps are multilingual-aware)

  7. Mihai Nita

    I wanted to fix this, wrote some unit tests, and wanted to make sure that what I do is conform to the spec.

    So:

    So my initial analysis of the problem is wrong. This is correct behavior:

    “XLIFFFilter “does not understand” TextUnit(s) with multiple locales

    I’ll look again and see if there is a problem with the merge.

  8. Mihai Nita

    I couldn’t find the way in tikal to overwrite XLIFF filter’s parameter to set “overrideTargetLanguage”, the default value is false so filter will always take the target-language in <file> as document target language.

    To avoid this, I just remove the target-language in <file> .

    In light of my fresh reading of the spec, this sounds suspicious.

    Buy removing the target-language in <file> it means the target language of the file becomes “Undefined”:
    A language code as described in the [RFC 4646], the successor to [RFC 3066]

    Default value: Undefined.
    http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#target-language

    And “Undefined” does not really mean “it can be anything”, in RFC 4646 that is a real locale, with the language code und

    So the rule saying that the xml:lang in <target> must be the same with target-language means that the xml:lang="fr" is invalid.
    The only valid value would be xml:lang="und".

    And I think that by specifying setOverrideTargetLanguage(true) we are basically saying “ignore all the target locales specified in the file and override them with what I’m telling you”

    So “junk” (French text in a German target) is not that surprising.

    I still have to think what a decent “error recovery” behavior should be.

  9. Log in to comment