XLIFF Word-Count Splitter does not preserve context-group metadata

Create issue
Issue #1156 resolved
Chase Tingley created an issue

To reproduce:

  • In Rainbow, add the attached file as an input document
  • Open the Pipeline Editor and add “XLIFF Word-Count Splitter”. Set the maximum word-count per part to “2”
  • Click execute

The source file contains several pieces of metadata embedded via <context-group> in nested groups. All of the TUs are within in the innermost group. However, after the split is performed, the inherited metadata is missing from the second file. The splitter preserves the nested group structure, but doesn’t preserve the context-group data.

Desired behavior: the context-group data should be retained in each split as part of the replicated group structure.

(Also, it looks like there’s a secondary bug where if the word count threshold divides evenly into the total word count of the file, an empty part is produced. In this case, there are 4 words in the source file, and splitting at 2 words produces 3 split parts, one of them containing no trans-units. This is a more unlikely edge case in the real world, so it doesn’t have to be fixed here unless it’s easy to add.)

Comments (4)

  1. Denis Konovalyenko

    @Chase Tingley the secondary bug with producing the extra empty part (without text units) is connected with how the document is read - one pass. There may probably be used some pre-reading of events with followed searching for text-unit tags. I can assume if no text unit is found, a new part mustn't be started in this case. That was not covered in the scope of the provided solution in pull request #625. So, I will create a new issue for this then.

  2. Log in to comment