Newlines are replaced with empty string in merge with tikal using openxml filter

Issue #678 new
Csaba Oravecz created an issue

We segment sentences in docx documents into separate lines, and merge the translated sentences back into original format with tikal using the okf_openxml filter. However, during the merge the newline separating the segments is deleted (instead of being replaced by a space). This does not happen with the openoffice or plaintext filters (formats), where a space is inserted in place of the newline. Is it intended behavior or might be some bug?

Comments (6)

  1. Chase Tingley
    • removed milestone

    Hi Csaba,

    Let me make sure I understand the process:

    • You extract text from the source docx
    • During translation, you insert newlines at various places to indicate where segment boundaries are
    • When you merge the translated file, these newlines have been removed.

    Is that correct? And the behavior that you'd like to see is the newlines are converted to spaces, rather than removed entirely?

    (I've cleared the Milestone field, which is for fixed bugs.)

  2. Csaba Oravecz reporter

    Hi, Yes, the extraction is done with tikal (-xm -seg) using for example the defaultSegmentation.srx and there we get the newlines inserted in the segmented output. When merging back using docx source and okf_openxml filter the newlines are indeed removed. Using a similar LibreOffice (odt) or a plaintext source (with okf_openoffice or okf_plaintext) the newlines are converted to spaces. I would like to get the same behaviour with the docx as well.

  3. Xavier Richez

    $> tikal.sh -fc okf_openxml -seg -xm example.docx -sl en

    $> tikal.sh -fc okf_openxml -seg -lm -sl en -ie utf8 -oe utf8 -overtrg -from example.docx.en example.docx

  4. Xavier Richez

    I could have put a message, but I think that no more explanations are needed ^^

    So far, it seems to occur only when using the -seg option.

  5. Log in to comment