DOCX/OpenXML: Standalone tags have paired meaning and defined order but are translated as "<x/>"

Issue #337 new
Former user created an issue

Original issue 337 created by karlis.ged... on 2013-05-09T10:04:44.000Z:

I am using tikal for docx translation.

This will be slightly harder to explain.
The problems starts with tags like these:
<w:commentRangeStart w:id="0"/>
<w:commentRangeEnd w:id="0"/>

<w:bookmarkStart w:id="0" w:name="BookmarkName"/>
<w:bookmarkEnd w:id="0"/>

<w:fldChar w:fldCharType="begin"/>
<w:fldChar w:fldCharType="end"/>

<w:proofErr w:type="gramStart"/>(probably should be filtered as unnecessary 'smart tags')
<w:proofErr w:type="gramEnd"/>

They are being transformed to:
<x id="0">
<x id="1">

And there is no way of distinguishing them from standalone bookmarks, pictures and etc. that should are fixed to the nearby words/phrases and should move around in the text freely as the text is being translated.

I am not sure if these tags can be represented as pair tags, but as-is there is no guaranty that even a human translator would be able to translate the text so that the formatting is restored correctly.

To keep up with the issue formatting:
The expected result would be at least some way to distinguished that these tags should be in a specific order and contain specific words. Preferably normal pair tags.

The attached file contains a lot of these tags and the problems is easily repeatable if the "HEADING ONE" par should be translated to "ONE HEADING".
HEADING <x id="1"/>ONE<x id="2"/><x id="3"/>

Comments (3)

  1. Former user Account Deleted

    Comment 1. originally posted by @ysavourel on 2013-05-15T18:30:30.000Z:

    Yes, I think it would probably be good if the filter did a better job of mapping these tags to pairs where possible.

    I also agree w:proofErr should be filtered out. There are a few other things that are also cluttering up the segments we produce right now (<w:rsId> and some other tags related to spelling/grammatical errors). We should split that off into a separate bug. Comments is an interesting case. I think you could argue that a comment in the source document can be safely stripped in the target, but there may be some use case I haven't thought of.

  2. Former user Account Deleted

    Comment 2. originally posted by karlis.ged... on 2013-05-16T14:52:39.000Z:

    I would like to add that the comments themselves seem to be handled correctly(comments are translated from comment.xml like normal paragraphs and the pointers in teh document work), and translating them is useful.
    The actual problem here is that the comment is saved as a comment field witch marks the part of text the comment refers to(usually a single word) and a pointer to the comment itself in the middle.
    As they all are translated as independent standalone tags, the text can be translated so that the end of comment field is before the start and in those cases the generated xml is invalid and even if it would be valid it would be damaged(parts of information is removed).
    In short:
    Comments are useful, but the "comment range" handling should be improved with some indication that the tags should not be moved/mixed.

  3. Log in to comment