OpenXML Filter: improve the segmentation quality and merge with complex script formatting
The creation of this issue was inspired by issue #933.
We can leave only the meaningful formatting and boost the probability of consequential runs to be merged together on extraction thus improving the segmentation quality.
On extraction: if there is no complex script content ([\u0590-\u074F\u0780-\u07BF\u0900-\u109F\u1780-\u18AF\u200C-\u200F\u202A-\u202F\u2670-\u2671\uFB1D-\uFB4F]
) specified in the original document and there is no w:cs
specified, the complex script formatting can be removed (w:rFonts w:csTheme w:cs
, w:iCs
, w:bCs
, w:szCs
). Similar, if there is the complex script content only or the w:cs
is specified, all non-complex script formatting can be removed (w:rFonts w:asciiTheme w:ascii w:hAnsiTheme w:hAnsi w:eastAsiaTheme w:eastAsia
, w:b
, w:i
and w:sz
).
On merge: if the translated content contains non-complex script characters and there is any complex script formatting present, related non-complex script formatting has to be specified (e.g. w:bCs
will trigger adding w:b
if true). Similar, if the translated run content contains complex script characters and there is any non-complex script formatting present, related complex script formatting has to be specified (e.g. w:b
will trigger adding w:bCs
).
By the way, there was some work done in relation to the w:rFonts
run property already (for more information please refer to the corresponding part of the net.sf.okapi.filters.openxml.RunMerger
).
Comments (4)
-
reporter -
reporter A related pull request #646 was opened.
-
reporter - changed milestone to 1.45.0
-
assigned issue to
-
reporter - changed status to resolved
Pull request #646 was merged.
- Log in to comment
bPreferenceAggressiveCleanup
conditional parameter does not take any effect on bold and italics run properties (w:b
,w:I
,w:iCs
,w:bCs
).Also, if it is set to
false
, then size run properties (w:sz
,w:szCs
) are groomed if the run content is not empty. If this parameter istrue
, then the size properties are cleaned without other conditions (the check for emptiness).