XLIFFFilter, 3 consecutive identical Tags with different id attribute are not ouput correctly

Issue #353 new
Former user created an issue

Original issue 353 created by joerg.beisie...@translatissimo.de on 2013-07-22T13:05:47.000Z:

What steps will reproduce the problem?
1. Make sure to change root directory to file folder.
2. Extract the attached archive to a folder
3. Open the project file AuthorIt_debug_20130722.rnb
4. Make sure that the filter okf_xml@ AuthorIt_debug is selected
5. Execute Translation Kit Creation
6. Open the Translation Package AuIT_debug_20130722.rkp
7. Execute Translation Kit Post-Processing

What is the expected output?
<p class="none" id="5966" image="none" translate="yes">This chapter describes the XXX function, the <tref id="5952"/> series runtime system for control applications. It has been designed according to <tref id="5969"/> and is engineered with the programming and debugging system <tref id="5969"/>.</p><p id="5966">The XXX function is a licensed software package that generally enables the YYYY to run XXX applications and to communicate with <tref id="6558"/> for loading and debugging applications. ...

What do you see instead?
This chapter describes the XXX function, the <tref id="5952"/> series runtime system for control applications. It has been designed according to <tref id="5952"/> and is engineered with the programming and debugging system <tref id="5969"/>.</p><p id="5966">The XXX function is a licensed software package that generally enables the YYYY to run XXX applications and to communicate with <tref id="6558"/> for loading and debugging applications.
What version of the product are you using? On what operating system?
okapi-apps_win32-x86_64_0.22 (same issue in okapi-apps_win32-x86_64_0.21
Win7, 64bit

Please provide any additional information below.
The empty tag pair <ModifiedComments></ModifiedComments> is output as <ModifiedComments/>. This is correct as per standard XML processing, but AuthorIT does not tolerate it. The original order of attributes is lost, the attribute output is sorted. Why?

Comments (10)

  1. Former user Account Deleted

    Comment 1. originally posted by @ysavourel on 2013-07-22T13:44:29.000Z:

    3 consecutive identical Tags with different id
    attribute are not ouput correctly

    For the trans-unit id='2' I'm getting:

    This chapter describes the XXX function, the <ph id="1"><tref id="5952"/></ph>
    series runtime system for control applications. It has been designed according to <ph id="1"><tref id="5969"/></ph> and is engineered with the programming and debugging system <ph id="2"><tref id="6558"/></ph>.

    So the three <tref> tag have different IDs, unlike your output. The problem may be linked to the segmentation. Your test files didn't have the SRX document you are using, so I used the default one. Could you provide your SRX?

    The empty tag pair <ModifiedComments></ModifiedComments>
    is output as <ModifiedComments/>.
    This is correct as per standard XML processing,
    but AuthorIT does not tolerate it.

    We can try to have some option for this. But obviously it would be a lot better to have the problem fixed in AuthorIT.

    The original order of attributes is lost,
    the attribute output is sorted. Why?

    Not much we can do about this: When the original XML document is parsed the attributes of a given elements are put is some hash-table and we just put them back in the order we get them. Some XML parser will try to keep the original order, other will sort, we have no control over that unfortunately.

    -ys

  2. Former user Account Deleted

    Comment 2. originally posted by joerg.beisie...@translatissimo.de on 2013-07-22T14:34:24.000Z:

    Hello Yves,
    thank you very much for your fast response.
    Please find my comments below.

    BTW, you are doing a really great job with the rainbow tools. I am very happy with the ITS implementation and looking forward to further developments.

    Thank you very much.

    3 consecutive identical Tags with different id
    attribute are not ouput correctly

    For the trans-unit id='2' I'm getting:

    This chapter describes the XXX function, the <ph id="1"><tref id="5952"/></ph>
    series runtime system for control applications. It has been designed according to
    <ph id="1"><tref id="5969"/></ph> and is engineered with the programming and
    debugging system <ph id="2"><tref id="6558"/></ph>.

    So the three <tref> tag have different IDs, unlike your output. The problem may
    be linked to the segmentation. Your test files didn't have the SRX document you
    are using, so I used the default one. Could you provide your SRX?

    Please find it attached

    The empty tag pair <ModifiedComments></ModifiedComments>
    is output as <ModifiedComments/>.

    This is correct as per standard XML processing,
    but AuthorIT does not tolerate it.

    We can try to have some option for this.
    But obviously it would be a lot better to have the problem fixed in AuthorIT.

    Right you are, I am using a work-around by writing dummy text into the empty segments that I delete after target file creation.

    The original order of attributes is lost,
    the attribute output is sorted. Why?

    Not much we can do about this: When the original XML document is parsed
    the attributes of a given elements are put is some hash-table and we just
    put them back in the order we get them. Some XML parser will try to keep the
    original order, other will sort, we have no control over that unfortunately.

    Thank you for explaining, this is only a matter of "cosmetics", furtunately I had not issues with this so far.

    Regards,
    Jörg

  3. Former user Account Deleted

    Comment 3. originally posted by KFLi... on 2013-07-22T21:26:01.000Z:

    Hi Yves and Jörg,

    With the current configuration the code IDs start over for each segment. So for this TU 2, Seg 1 we have id=1 and for Seg 2 we have id=1 and id=2. I think this is normal behavior for the most part. There's an option in the segmentation step "Renumber code IDs" if you uncheck that option during kit-creation, the code IDs are sequential across segments and then it merges back ok. However, it seems we still need to update the TextUnitUtil.copySrcCodeDataToMatchingTrgCodes to account for the both cases.

    Fredrik

  4. Former user Account Deleted
    • changed status to open

    Comment 4. originally posted by @ysavourel on 2013-07-23T03:54:37.000Z:

    Right. That option is not the default and has been added recently.
    There may be a way to renumber back that was implemented along with it. I'll look at it.

  5. Former user Account Deleted

    Comment 5. originally posted by @ysavourel on 2013-07-23T04:21:35.000Z:

    There is an option in the Desegmentation step to restore the original ID. But we don't use the Desegmention step when we merge back: what happens when merging depends on extraction options and file formats.
    But there is code to do the restoring, so maybe we can capture the segmentation option and act on it when merging. I'll look at that, and also at your tentative fix Fredrik (maybe that is a better option).

    The two options: "Renumber code IDs" in Segmentation and "Restore original IDs to renumbered codes" are also not documented currently. We need to fix that too.

    I'm adding Chase in CC so he is updated.

  6. Former user Account Deleted

    Comment 6. originally posted by @ysavourel on 2013-07-23T04:58:59.000Z:

    Temporary workaround:

    Instead of using "Translation Kit Post-Processing" in Rainbow. You can merge with a custom pipeline that includes the Desegmentation.

    1) make sure the target language is set to your target language (you usually don't have to do this for post-processing because the manifest will provide the correct target to the Rainbow Translation Kit merging step. But here the Desegmentation step will need the correct info too.

    2) Go to "Edit/Execute Pipeline" and create a pipeline with the following steps:
    - Raw Document to Filter Events
    - Desegmentation
    - Rainbow Translation Kit Merging

    All defaults should be OK, except for "Restore original IDs to renumbered codes" which must be set.

    3) Click Execute

    That seems to generate the correct merged document, where the <tref>'s IDs are the correct ones.

  7. Former user Account Deleted

    Comment 7. originally posted by joerg.beisie...@translatissimo.de on 2013-07-23T07:54:57.000Z:

    Ives,
    thank you very much for your help and for providing a temporary workaround.
    As we are still in the testing phase I simply created a new Translation Kit unchecking the option "Renumber code IDs".
    Postprocessing now runs without warnings and returns all trefs with correct IDs (and many other inline elements on the "real" documents with much more code).
    AuthorIT now imports the documents back and we can start using this process for translation.

    Thanks a lot for your fast and efficient help!

    Regards,
    Jörg

  8. Former user Account Deleted

    Comment 8. originally posted by @ysavourel on 2013-08-13T06:10:34.000Z:

    Conclusion: for now the best solution seems to be to set a flag in the file or manifest indicating that the IDs have been renumbered.
    And fix the numbering just before the merge in the Merger code, calling the available method for this.
    We just need to find the time to do it.

  9. Log in to comment