Incorrect segmentation with inline code near break
Original issue 169 created by lukas.sta... on 2011-03-09T09:45:01.000Z:
tikal of okapi version 0.10 does not segment correctly docx files.
The attached srx file contains rule for dividing sentences to separated segments. But tikal does not separate sentences in some cases, e.g. if both are underlined and the second one is bold. The example documents are attached.
Using command (on Linux):
tikal.sh -x sentence.docx -seg test.srx
tikal.sh -x tst-alter.docx -seg test.srx
Comments (5)
-
Account Deleted -
Account Deleted - changed title to Incorrect segmentation
- changed status to open
Comment [2.](https://code.google.com/p/okapi/issues/detail?id=169#c2) originally posted by @ysavourel on 2011-08-04T13:39:37.000Z:
The
p{C} is probably handling the marker of an inline code. This needs to be looked at and possibly a generic meta-symbol used for inline markers. -
Account Deleted Comment [3.](https://code.google.com/p/okapi/issues/detail?id=169#c3) originally posted by @ysavourel on 2012-07-05T03:12:25.000Z:
-
Account Deleted - attached sentence.docx.xlf
- attached tst-alter.docx.xlf
Comment 4. originally posted by @ysavourel on 2013-05-18T22:00:55.000Z:
0.22-SNAPSHOT contains the fix. The segmenter now is able to work around inline codes on the prospective segment boundary. Attached are 2 xliff files for the provided docx samples. Both appear properly segmented. S pozdravem z Hradci Kralove, Sergej
-
Account Deleted - changed status to resolved
Comment 5. originally posted by @ysavourel on 2013-05-18T22:03:38.000Z:
- Log in to comment
Comment [1.](https://code.google.com/p/okapi/issues/detail?id=169#c1) originally posted by lukas.sta... on 2011-03-09T10:19:58.000Z:
There are probably some control characters between the sentences. Adding [
p{C}]\* to the rule leads to the required segmentation. <rule break="yes"> <beforebreak>(
.|
?|
!)</beforebreak> <afterbreak>[
p{Z}]+[
p{C}]\*
p{Lu}</afterbreak> </rule>