okapiframework / Okapi / issues / #160 - okf_XML its:withinTextRule - Incorrect XLIFF tagging

Former user Account Deleted

Comment [1.](https://code.google.com/p/okapi/issues/detail?id=160#c1) originally posted by @ysavourel on 2011-01-21T18:15:15.000Z:

The <ph> element is an inline code and reflects an XML element kept within the text.

If the output was using the g/x notation that code would be <x...> instead of <ph...>, so it may be correct depending on the context. (You can change the output notation in the Options if you are using Create Translation Package utility).

Would it be possible to have a example of the ITS rules and the XML input file? Thanks, -ys

2011-01-21T18:15:15+00:00

Former user Account Deleted

attached works.fprm
attached test.inx.xlf
attached problem.fprm
attached test.inx

Comment [2.](https://code.google.com/p/okapi/issues/detail?id=160#c2) originally posted by al...@yahoo.com on 2011-02-02T19:53:28.000Z:

Hi Yves,

I have attached the XML file being used. This is an Abode InDesign export file (INX) that is XML based. I have also attached my approach to extracting the text content (works.fprm) however it is not optimal. I am looking to group all text within a <cflo> tag and make txsr and pcnt tags inline. The attached (problem.fprm) rules describe my goal but the result is all content within <ph> tags.

One problem is that there are also inline codes that need to be hidden from translation. These codes are also grouped with the <ph> tag, so there is no way to expose one without exposing both. I'm trying to have a different tag for inline codeFinder matches and withinTextRules matches.

My ideal is to be able to expose text within <pcnt> tags, while keeping the <pcnt> and <txsr> tags inline (to prevent segmentation breaks), while also hiding the 'codes' described in the fprms.

It seems like a defect that the inlineTextRule(s) is not only putting the inline tag into <ph> or <x/g>, but also the content between the inlineTextRule identified tags.

The attached xlf is the result using the problem.fprm rules.

2011-02-02T19:53:28+00:00

Former user Account Deleted

Comment [3.](https://code.google.com/p/okapi/issues/detail?id=160#c3) originally posted by @ysavourel on 2011-02-02T20:35:32.000Z:

I think the problem is that you have to make the distinction between the extractable <pcnt> and the non-extractable one. And that seems to be done by the "c\_" at the front of the content.

Something like this:

<its:translateRule selector="\*" translate="no"/> <its:translateRule selector="cflo" translate="yes"/> <its:translateRule selector="SyPf|cMep|icEo" translate="no"/> <its:translateRule selector="pcnt[starts-with(.,'c\_')]" translate="yes"/> <its:withinTextRule selector="cflo/descendant::\*" withinText="yes"/>

seems to be working as you requested. You get:

<source xml:lang="en-us"><ph id="1"><SyPf omAL="b\_f" omBZ="U\_12" sdir="e\_L2Rd" sgON="e\_Txft" sorn="e\_horz"/></ph> <bpt id="2"><cMep></bpt><ph id="3"> </ph><bpt id="4"><pcnt></bpt><ph id="5"><![CDATA[]]></ph><ept id="4"></pcnt></ept><ph id="6"> </ph><ept id="2"></cMep></ept> <ph id="7"><icEo inAR="b\_f" inGP="b\_t"/></ph> <bpt id="8"><txsr crst="o\_u67" font="c\_Arial Black" prst="o\_u6b" ptfs="c\_Regular"></bpt><ph id="9"> </ph><bpt id="10"><pcnt></bpt><ph id="20">c\_</ph>This is a sample INX export of a InDesign file. Typically text flow is not broken up in <pcnt> tags, however, using a different <ept id="10"></pcnt></ept><ph id="11"> </ph><ept id="8"></txsr></ept> <bpt id="12"><txsr crst="o\_u67" font="c\_Arial Narrow" prst="o\_u6b" ptfs="c\_Italic"></bpt><ph id="13"> </ph><bpt id="14"><pcnt></bpt><ph id="21">c\_</ph>font inline<ept id="14"></pcnt></ept><ph id="15"> </ph><ept id="12"></txsr></ept> <bpt id="16"><txsr crst="o\_u67" font="c\_Arial Black" prst="o\_u6b" ptfs="c\_Regular"></bpt><ph id="17"> </ph><bpt id="18"><pcnt></bpt><ph id="22">c\_</ph> puts the alternate font text in a new <pcnt> tag, as well as the following text.

This is not desireable for translation as it breaks up a sentence and results in sentence fragments in a translation memory.<ept id="18"></pcnt></ept><ph id="19"> </ph><ept id="16"></txsr></ept></source>

But to be honest, it's highly unlikely that you'll manage to extract properly INX files using ITS. Except if they are very simple.

INX is one of those XML formats that is so bad for text extraction that you have to write a specific filter for it. You have c\_ all over the place and those weird processing instruction at the middle of the text...

I know INX is your input format, but, if you have the possibility, you may want to look at IDML as the export format for InDesign. We do have a filter specifically for it.

-ys

2011-02-02T20:35:32+00:00

Former user Account Deleted

changed status to resolved

Comment [4.](https://code.google.com/p/okapi/issues/detail?id=160#c4) originally posted by @ysavourel on 2011-05-26T12:35:50.000Z:

2011-05-26T12:35:50+00:00

Comments (4)