OpenXML Filter: improve the segmentation quality and merge with complex script formatting

Issue #947 resolved
Denis Konovalyenko created an issue

The creation of this issue was inspired by issue #933.

We can leave only the meaningful formatting and boost the probability of consequential runs to be merged together on extraction thus improving the segmentation quality.

On extraction: if there is no complex script content ([\u0590-\u074F\u0780-\u07BF\u0900-\u109F\u1780-\u18AF\u200C-\u200F\u202A-\u202F\u2670-\u2671\uFB1D-\uFB4F]) specified in the original document and there is no w:cs specified, the complex script formatting can be removed (w:rFonts w:csTheme w:cs , w:iCs, w:bCs, w:szCs). Similar, if there is the complex script content only or the w:cs is specified, all non-complex script formatting can be removed (w:rFonts w:asciiTheme w:ascii w:hAnsiTheme w:hAnsi w:eastAsiaTheme w:eastAsia, w:b, w:i and w:sz).

On merge: if the translated content contains non-complex script characters and there is any complex script formatting present, related non-complex script formatting has to be specified (e.g. w:bCs will trigger adding w:b if true). Similar, if the translated run content contains complex script characters and there is any non-complex script formatting present, related complex script formatting has to be specified (e.g. w:b will trigger adding w:bCs).

By the way, there was some work done in relation to the w:rFonts run property already (for more information please refer to the corresponding part of the net.sf.okapi.filters.openxml.RunMerger).

Comments (4)

  1. Denis Konovalyenko reporter

    bPreferenceAggressiveCleanup conditional parameter does not take any effect on bold and italics run properties (w:b, w:I , w:iCs, w:bCs).

    Also, if it is set to false, then size run properties (w:sz, w:szCs) are groomed if the run content is not empty. If this parameter is true, then the size properties are cleaned without other conditions (the check for emptiness).

  2. Log in to comment