XML to HTML subfilter with regex replace missing data - Rainbow Translation Kit Creation to XLIFF

Issue #231 new
Former user created an issue

Original issue 231 created by esogasimm... on 2012-05-18T02:39:02.000Z:

What steps will reproduce the problem?

  1. With an XML File that has CDATA in it as follows:

<?xml version="1.0" encoding="UTF-8"?>
<message-templates>

<![CDATA[

<strong> this is a bolded text between html strong tags </strong>

Here we want to pull out the freemarker into tags, but pass through   
the text and the brace value: <\#if contactInfo??> or   
${contactInfo}</\#if>.

Here we ensure that we get the single else and handle the double quotes:
<#if smb.maxAlarmLevel == "notice">Important<#else>Urgent</#if>
information regarding ${sbsName}

Another test with lots of regex like text:
<#if commentAuthorURL?has_content>
<a href="${commentAuthorURL}">${commentAuthorName}</a>
<#else>${commentAuthorName}</#if>
<a href="${commentURL}">View all comments on this event</a>

Ensure we pull don't break early on parentheses:
${totalTaskCount} upcoming <#if (overdueTasks.size() > 0)>
(${overdueTasks.size()} overdue)</#if>

]]>

</message-templates>

  1. Use an custom XML filter that has an HTML subfilter:

assumeWellformed: true
preserve_whitespace: true
global_cdata_subfilter: okf_html@ spaces_freemarker_regex
attributes:
xml:lang:
ruleTypes: [ATTRIBUTE_WRITABLE]
xml🆔
ruleTypes: [ATTRIBUTE_ID]
id:
ruleTypes: [ATTRIBUTE_ID]
xml:space:
ruleTypes: [ATTRIBUTE_PRESERVE_WHITESPACE]
preserve: ['xml:space', EQUALS, preserve]
default: ['xml:space', EQUALS, default]

  1. And an HTML subfilter that has regex to find those freemarker tags (this regex appears to work fine in the regex editor):

assumeWellformed: true
preserve_whitespace: true
useCodeFinder: true
codeFinderRules: |-
#v1
count.i=1
rule0=(\A([^>]|([^&][^g][^t][^;]))*?(?<!\(\) )>)|((<|<)[\w!?/#@ ].*?(?<!\(\) )(>|\Z))


Notice that the pack1 files that are generated are missing the first part of the CDATA block, all the way up to the first less than sign in the CDATA block. Namely, we are missing this text:

<strong> this is a bolded text between html strong tags </strong>

Here we want to pull out the freemarker into tags, but pass through the text and the brace value:

<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:okp="okapi-framework:xliff-extensions">
<file original="/xml-freemarker.xml" source-language="en-us" target-language="fr-fr" datatype="xml">
<body>
<group id="P1BC00-1" resname="sub-filter:">
<trans-unit id="tu1" xml:space="preserve">
<source xml:lang="en-us"><x id="1"/> or
${contactInfo}<x id="2"/>.

Here we ensure that we get the single else and handle the double quotes:
<x id="3"/>Important<x id="4"/>Urgent<x id="5"/>
information regarding ${sbsName}

Another test with lots of regexee text:
<x id="6"/>
<g id="7">${commentAuthorName}</g>
<x id="8"/>${commentAuthorName}<x id="9"/>
<g id="10">View all comments on this event</g>

Ensure we pull don't break early on parentheses:
${totalTaskCount} upcoming <x id="11"/>
(${overdueTasks.size()} overdue)<x id="12"/>

</source>
</trans-unit>
</group>
</body>
</file>
</xliff>

What version of the product are you using? On what operating system?

Rainbow - Okapi Localization Toolbox
Version 6.0.17-Snapshot
Windows

Rainbow - Okapi Localization Toolbox
Version 0.17 gtk2 Linux x86 64

Please provide any additional information below.

Comments (4)

  1. Former user Account Deleted

    Comment [3.](https://code.google.com/p/okapi/issues/detail?id=231#c3) originally posted by gsin...@lingotek.com on 2012-05-29T17:41:24.000Z:

    Note, if you change the regex to look for one or more spaces, the subfilter and regex replace does in fact work (see [
    s]+?)

    However, if you omit this, or if you look for zero or more spaces [
    s]\*? it fails:

    ([
    s]+?
    A([^>]|([^&][^g][^t][^;]))\*?(?<!
    (
    ) )>)|((<|<)[
    w!?/\#@ ].\*?(?<!
    (
    ) )(>|
    Z)) <![CDATA[

    <strong> this is a bolded text between html strong tags </strong>

    Here we want to pull out the freemarker into tags, but pass through the text and the brace value: <\#if contactInfo??> or ${contactInfo}</\#if>.

    Here we ensure that we get the single else and handle the double quotes: <\#if smb.maxAlarmLevel == "notice">Important<\#else>Urgent</\#if> information regarding ${sbsName}

    Another test with lots of regex like text: <\#if commentAuthorURL?has\_content> <a href="${commentAuthorURL}">${commentAuthorName}</a> <\#else>${commentAuthorName}</\#if> <a href="${commentURL}">View all comments on this event</a>

    Ensure we pull don't break early on parentheses: ${totalTaskCount} upcoming <\#if (overdueTasks.size() > 0)> (${overdueTasks.size()} overdue)</\#if>

    ]]>

    YIELDS:

    <source xml:lang="en-us">

    <g id="1"> this is a bolded text between html strong tags </g>

    Here we want to pull out the freemarker into tags, but pass through the text and the brace value: <x id="2"/> or ${contactInfo}<x id="3"/>.

    Here we ensure that we get the single else and handle the double quotes: <x id="4"/>Important<x id="5"/>Urgent<x id="6"/> information regarding ${sbsName}

  2. Log in to comment