- removed milestone
Newlines are replaced with empty string in merge with tikal using openxml filter
We segment sentences in docx documents into separate lines, and merge the translated sentences back into original format with tikal using the okf_openxml filter. However, during the merge the newline separating the segments is deleted (instead of being replaced by a space). This does not happen with the openoffice or plaintext filters (formats), where a space is inserted in place of the newline. Is it intended behavior or might be some bug?
Comments (6)
-
-
reporter Hi, Yes, the extraction is done with tikal (-xm -seg) using for example the defaultSegmentation.srx and there we get the newlines inserted in the segmented output. When merging back using docx source and okf_openxml filter the newlines are indeed removed. Using a similar LibreOffice (odt) or a plaintext source (with okf_openoffice or okf_plaintext) the newlines are converted to spaces. I would like to get the same behaviour with the docx as well.
-
- attached example.docx
$> tikal.sh -fc okf_openxml -seg -xm example.docx -sl en
$> tikal.sh -fc okf_openxml -seg -lm -sl en -ie utf8 -oe utf8 -overtrg -from example.docx.en example.docx
-
Oh, I see. Thanks for the tikal command, that's very helpful.
-
I could have put a message, but I think that no more explanations are needed ^^
So far, it seems to occur only when using the -seg option.
-
@tingley , any idea how to solve this ?
- Log in to comment
Hi Csaba,
Let me make sure I understand the process:
Is that correct? And the behavior that you'd like to see is the newlines are converted to spaces, rather than removed entirely?
(I've cleared the Milestone field, which is for fixed bugs.)