Incorrect segmentation with inline code near break

Issue #169 resolved
Former user created an issue

Original issue 169 created by lukas.sta... on 2011-03-09T09:45:01.000Z:

tikal of okapi version 0.10 does not segment correctly docx files.
The attached srx file contains rule for dividing sentences to separated segments. But tikal does not separate sentences in some cases, e.g. if both are underlined and the second one is bold. The example documents are attached.
Using command (on Linux):
tikal.sh -x sentence.docx -seg test.srx
tikal.sh -x tst-alter.docx -seg test.srx

Comments (5)

  1. Former user Account Deleted

    Comment [1.](https://code.google.com/p/okapi/issues/detail?id=169#c1) originally posted by lukas.sta... on 2011-03-09T10:19:58.000Z:

    There are probably some control characters between the sentences. Adding [
    p{C}]\* to the rule leads to the required segmentation. <rule break="yes"> <beforebreak>(
    .|
    ?|
    !)</beforebreak> <afterbreak>[
    p{Z}]+[
    p{C}]\*
    p{Lu}</afterbreak> </rule>

  2. Former user Account Deleted

    Comment 4. originally posted by @ysavourel on 2013-05-18T22:00:55.000Z:

    0.22-SNAPSHOT contains the fix. The segmenter now is able to work around inline codes on the prospective segment boundary. Attached are 2 xliff files for the provided docx samples. Both appear properly segmented. S pozdravem z Hradci Kralove, Sergej

  3. Log in to comment