Rainbow Kit Creation Step strips source segmentation from XLIFF

Issue #760 resolved
Chase Tingley created an issue

To reproduce:

  • Open Rainbow, add the attached segsource.xlf as an input document
  • Create a pipeline: Raw Documents to Filter Events, Rainbow Kit Creation Step
  • Execute the pipeline

Compare the source XLIFF to the output XLIFF produced in the work directory. The source contains <seg-source> data, but this has been stripped in the working XLIFF.

My expectation would be that if there is source segmentation, we would preserve it, unless it was overridden by an explicit segmentation step in the pipeline.

Comments (9)

  1. ysavourel

    No, I don't think it's intentional. This was simply coded long ago when 1) were rarely XLIFF and 2) most XLIFF where not segmented.

  2. Sebastian Ebert

    I can confirm this behavior. First of all, the <seg-source> is missing on the final xliff files (when they went through the post processing step). And second, additional source file content on the <target> element is also missing aftwards.

    I found this out when I wanted to translate MadCap Flare files.

    This was the original source file:

    <trans-unit id="1" restype="x-xml-h1" phase-name="pretrans">
        <source>Datensicherung</source>
            <seg-source><mrk mtype="seg" mid="1">Datensicherung</mrk></seg-source>
        <target state="translated">
                <mrk
                    mtype="seg"
                    mid="1"
                    MadCap:segmentStatus="Accepted"
                    MadCap:matchPercent="101">Backup dei dati</mrk></target></trans-unit>
    

    This is the result after the post processing:

    <trans-unit id="1" restype="x-xml-h1" phase-name="pretrans">
        <source>Datensicherung</source>
        <target state="translated">Backup dei dati</target>
    

    No matter what filter settings I use on Rainbow or what segmentation setting I try out, I am not able to produce a "done"-file that still includes the segmented source and the annotations/elements that once were placed on the target element on the source file.

    As a result, MadCap flare rejects to import the files.

  3. Denis Konovalyenko

    @tingley , @ysavourel , it seems that the behaviour is intentional as with the introduction of the net.sf.okapi.filters.xliff.Parameters#ALWAYSUSESEGSOURCE parameter the XLIFFFilter is not processing seg-sources by default (please refer to the related commit for more information).

  4. Chase Tingley reporter

    Oh interesting. @DenisKonovalyenko is correct for this example. The problem is that my segsource.xlf example has a mismatch between the contents of <source> and <seg-source> content. If the "Always use Segmented Source" option is not set (it is disabled by default), the XLIFF filter resolves this disagreement in favor of <source>. This can be observed in the form of warnings generated by tikal:

    $ tikal.sh -fc okf_xliff segsource.xlf -x
    -------------------------------------------------------------------------------
    Okapi Tikal - Localization Toolset
    Version: 2.0.37-SNAPSHOT
    -------------------------------------------------------------------------------
    Error: Cannot find filter configuration 'test1'
    Error: Cannot find filter with ID: test1. Cannot add configuration
    Extraction
    Source language: en-US
    Target language: es-ES
    Default input encoding: UTF-8
    Filter configuration: okf_xliff
    Output: /home/tingley/Downloads/segsource.xlf.xlf
    Input: /home/tingley/Downloads/segsource.xlf
    Error: The <seg-source> content for the entry id='NFDBB2FA9-tu1' is different from its <source>. The un-segmented content of <source> will be used.
    

    If I enable the option, that problem goes away, and the <seg-source> content appears in the extracted XLIFF.

    However, in the example I attached to this bug, this is only an issue because I constructed the file sloppily: there is a capitalization difference between the <source> and <seg-source> content! If you correct this error, as in the attached segsource-corrected.xlf file, then the <seg-source> content extracts correctly even with the option disabled.

  5. Chase Tingley reporter

    So @DenisKonovalyenko to answer your original question -- we need a new testcase to prove this is a real bug.

    I need to go back to the file that caused me to open this issue and see if it was the result of a broken file and whether the option would have helped.

    @eraser17 you can help with this too. In the example you posted above, I didn't see any difference between source and seg-source content. Can you confirm this? (If you have a testcase you can attach, that would be ideal.) Also, can you confirm that among the things you tried in Rainbow, this option was one of them?

  6. Log in to comment