Too many codes for underlying representation: part 2 (aka, the seq-U+E110)

Issue #615 new
Chase Tingley created an issue

The attached file is cleaned and very heavily stripped down -- I expect SDL Studio will no longer open it. However, it is based on a real file that included a single TU containing over 18,000 inline codes.

Parsing this file hits the same core issue as issue 293:

java.lang.ArrayIndexOutOfBoundsException: -57616
    at java.util.ArrayList.elementData(ArrayList.java:418)
    at java.util.ArrayList.get(ArrayList.java:431)
    at net.sf.okapi.common.resource.TextFragment.balanceMarkers(TextFragment.java:1950)
    at net.sf.okapi.common.resource.TextFragment.getCodedText(TextFragment.java:937)
    at net.sf.okapi.common.resource.TextFragment.insert(TextFragment.java:785)
    at net.sf.okapi.common.resource.TextFragment.append(TextFragment.java:586)
    at net.sf.okapi.common.resource.TextContainer.append(TextContainer.java:500)
    at net.sf.okapi.filters.xliff.XLIFFFilter.processContent(XLIFFFilter.java:2129)
    at net.sf.okapi.filters.xliff.XLIFFFilter.processSource(XLIFFFilter.java:1675)
    at net.sf.okapi.filters.xliff.XLIFFFilter.processTransUnit(XLIFFFilter.java:1367)
    at net.sf.okapi.filters.xliff.XLIFFFilter.read(XLIFFFilter.java:548)
    at net.sf.okapi.filters.xliff.XLIFFFilter.next(XLIFFFilter.java:293)

This is overflowing the inline code representation in TextFragment. Unlike in the IDML case, there's no obvious problem with the filter we can fix to mask this from happening, since this is how the SDLXLIFF was constructed. (This TU is one of the "hidden" TUs that Studio produces. It contains only markup and is not exposed for translation in the Studio UI.)

Fixing this would require changing the inline code serialization.

Comments (4)

  1. Chase Tingley reporter

    I've pushed code that will throw a clearer exception when we exceed the maximum allowed codes. This is currently defined (in a commit Jim put in a while back) as 6127, which is how many we can fit in on top of U+E110 before we run out of room in the private use range.

  2. stevebpdx

    Hello, I have been tasked with fixing this issue. My question is this:

    A single TU containing over 18,000 inline codes? And this is something SDL Studio generated? My initial reaction is that this is a problem with SDL Studio. Should the Okapi filter support 18,000 inline codes?

  3. Chase Tingley reporter

    Hi Steve,

    SDL Studio has a behavior where it writes out trans-units and then decides they are not translatable. Rather than marking them as translate="no", they just don't include any seg-source data, and the tool is supposed to infer that the the trans-unit should be ignored.

    For a long time Okapi exposed these as translatable segments, but this was fixed recently in issue #466. However, the fix works by parsing the trans-unit, deciding it's not translatable, and re-serializing it, so it still hits this condition.

  4. Log in to comment