Too many inline codes for underlying 2-char encoding (e.g. in some IDML paragraphs)

Issue #293 resolved
Former user created an issue

Original issue 293 created by polytrans2... on 2012-12-02T05:27:51.000Z:

What steps will reproduce the problem?

When I tried generating translation packages from an IDML file exported
from InDesign CS5.5 for Windows, I got the the following message:

ERROR: Error processing story file 'ue5'.
Error simplifiying codes.
-57616

And when I disabled "simplifiying codes", I got the following error:

ERROR: Error writing a text unit.

And a zero-length file was generated.

Comments (11)

  1. Former user Account Deleted

    Comment 2. originally posted by @ysavourel on 2013-01-17T13:16:35.000Z:

    The issue comes from something in one massive paragraph (the one with "No. 652 [Tribunale di Padova").
    I suspect the number of inline codes in it (11091) reaches some boundaries that cause a problem.
    I won't be able to detect the issue quickly, but i suspect that breaking the paragraph into several smaller paragraphs would possibly help (I don't have InDesign available to test at this time). Maybe that can get you moving in this specific case.
    I'll continue working on it.

  2. Former user Account Deleted

    Comment 3. originally posted by @ysavourel on 2013-01-17T19:06:08.000Z:

    For reference:

    Caused by: java.lang.ArrayIndexOutOfBoundsException: -57616
    at java.util.ArrayList.get(ArrayList.java:324)
    at java.util.Collections$UnmodifiableList.get(Collections.java:1152)
    at net.sf.okapi.common.resource.CodeSimplifier.prepare(CodeSimplifier.java:142)
    at net.sf.okapi.common.resource.CodeSimplifier.simplifyAll(CodeSimplifier.java:207)
    ... 16 more

    That is a pretty unusual index. Relevant line is:
    hCodeNode cn = new PhCodeNode(i, TextFragment.toIndex(codedText.charAt(i+1)), codedText.charAt(i+1), pCodes.get(TextFragment.toIndex(codedText.charAt(i+1))));

    So the index is coming from the codedText itself (via |pCodes.get(TextFragment.toIndex(codedText.charAt(i+1)))|. Given that TextFragment.toIndex() is implemented as |return ((int)index)-CHARBASE;|, it's probably a safe bet that the huge number of placeholders is overflowing the way that we write encode the code index.

  3. Former user Account Deleted

    Comment 4. originally posted by @ysavourel on 2013-01-17T20:06:50.000Z:

    Yes. It looks like we reach the limit for a TextFragment.
    The only solution for this I can think of would be to lower the CHARBASE value, but that would make the index character move out of the Unicode Private Use Area.
    That would impact also any existing stored data.

  4. Former user Account Deleted

    Comment 6. originally posted by @ysavourel on 2013-01-17T23:02:59.000Z:

    Another possible option would be to keep track of the number of codes and when we reach a given threshold we could trigger the Code Simplifier. That would reduce the number of markers. Then we would resume the parsing on the same Text Fragment.
    Probably not very easy to implement.

  5. Former user Account Deleted

    Comment 7. originally posted by @ysavourel on 2013-01-18T05:47:33.000Z:

    The solution you propose in comment 6. sounds like it would make things better, but the problem could still exist in pathological cases. And moving the CHARBASE value out of the private use area sounds is probably dangerous.

    I'm reading this code for the first time (and have only read part of it), so maybe this is a stupid suggestion, but would it be possible to change the way that the codes are written out so that the code index was not written out as part of the character itself? In other words, instead of writing out U+(CHARBASE + index), we could write out multiple characters: (MARKER_OPENING, (char index), MARKER_OPENING). Or some similar variant. The point is, write out the index as a codepoint, demarcated by private-use sentinels. That would break any stored data in this format, and maybe cause other side effects if the text fragment was processed in this format. However, at least it would expand the index count.

    Lastly, it's possible that in this case we could dodge the problem by improving the IDML filter. It's hard to tell just from looking at the markup (I also don't have indesign), but I know that InDesign is very verbose about righting out its character runs. It may be that some of the markup within this paragraph could be collapsed. At least, I hope so, because 11k inline codes is ridiculous.

  6. Jim Hargrave (OLD)

    New code has been added to stop adding inline codes to a TextUnit if the max number is reached. This is not a complete solution, as the filter should be updated to remove these pathological codes. However, this workaround does prevent the crash and may allow extraction and merge.

    boolean moreThanMaxCodes(TextFragment tf)
    TextFragment removeMoreThanMaxCodes(TextFragment tf)
    
  7. Chase Tingley

    Quick update here. With the use of the code simplifier and Jim's fix, we are allocating 6000 inline codes and then simplifying them to about ~600. However, since the simplifier doesn't renumber the merged codes, our inline code IDs count by ~10 up to the max (6126) and then stop. In other words, even though there are only ~600 codes in the segment we are still tripping the max code code because we are counting the allocated IDs, not the simplified codes.

    (Another way of saying this is that because we must first create a TextFragment in order to later simplify it, we are still hitting the threshold.)

  8. Log in to comment