okf_html subfiltering results in & escaped after merging

Issue #827 new
Jack Cole created an issue

Explanation

When creating a translation kit and then merging the kit back into its original file, if ampersands exist in the original file's contents, they will be converted to &. Single quotes are also converted to '

I’ve attached a package containing this example.

Note: I’m on dev branch when doing this.

What should happen

If I create a translation kit with ampersands in the original file, and don’t change the work file, the done file should have those same ampersands unaltered.

Steps to reproduce

  1. Create a JSON file with some strings that contain &s, and other encoded HTML characters like &. You can use the example below
    {"example1": "Test &&&&&& & & & & < "}
  2. Open Rainbow and add the JSON file to the list of files
  3. Open input document properties for that JSON
  4. Click Create…
  5. Give it any name and hit OK
  6. In the JSON Filter Parameters Window, select Content Processing
  7. Select “Process text content with this sub-filter…” and input “okf_html”
  8. Hit OK, and hit OK again to close Input Document Properties
  9. Select Utilities > Translation Kit Creation
  10. Use XLIFF or XLIFF2 in the type of package and hit Execute
  11. Remove the JSON file and Add the Manifest file that was created
  12. Utilities > Translation Kit Post-Processing
  13. Execute
  14. Open done folder and look at resulting JSON file. Should look something like this
    {"example1": "Test &&&&&& & & & & < "}

Comments (8)

  1. Jack Cole reporter
    • edited description

    Explanation

    When creating a translation kit and then merging the kit back into its original file, if ampersands exist in the original file’s contents, they will be converted to &. Single quotes are also converted to '

    I’ve attached a package containing this example.

    Note: I’m on dev branch when doing this.

    What should happen

    If I create a translation kit with ampersands in the original file, and don’t change the work file, the done file should have those same ampersands unaltered.

    Steps to reproduce

    1. Create a JSON file with some strings that contain &s, and other encoded HTML characters like &. You can use the example below
      {"example1": "Test &&&&&& & & & & <"}
    2. Open Rainbow and add the JSON file to the list of files
    3. Open input document properties for that JSON
    4. Click Create…
    5. Give it any name and hit OK
    6. In the JSON Filter Parameters Window, select Content Processing
    7. Select “Process text content with this sub-filter…” and input “okf_html”
    8. Hit OK, and hit OK again to close Input Document Properties
    9. Select Utilities &gt; Translation Kit Creation
    10. Use XLIFF or XLIFF2 in the type of package and hit Execute
    11. Remove the JSON file and Add the Manifest file that was created
    12. Utilities &gt; Translation Kit Post-Processing
    13. Execute
    14. Open done folder and look at resulting JSON file. Should look something like this
      {&#34;example1&#34;: &#34;Test &amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp;&amp;amp; &amp;amp; &amp;amp; &amp;amp; &amp;amp; &amp;lt;&#34;}
  2. Jack Cole reporter

    Hey Clement,

    I loaded up rainbow after creating a filter confit, and saw the config in the list. I then set the JSON subfilter to okf_html-test_filter and tried all 4 levels of quote mode.

    Unfortunately, the & symbols were all still converted to &amp; in the done file.

  3. Kuro Kurosaka (BH Lab)

    In HTML, &&(or more) is illegal, I think, because & is a leading character of a character references and a character sequence that forms a valid character reference is expected to follow. The HTML Filter could just stop parsing when it sees an illegal sequence. But it actually tries to be lenient and it treats it as just am ampersand character (as if &amp;weere written). Now, we have an associated filter writer, that does the reverse conversion. When it sees an ampersand in the data, it has to write back a legal HTML sequence. Since an ampersand is supposed to written as &amp; in HTML, that what it does. The same situation applies to the less-than symbol. If < always becomes &lt;

    In other words, this happens because the filter is feeding a character sequence that is not HTML text to the HTML Filter. It is unavoidable in my opinion.

  4. Jack Cole reporter

    So when converting JSON to XLIFF, it’s escaping certain characters. That’s fine, it’s a limitation of the XML spec.

    But then shouldn't converting XLIFF to JSON perform the reverse operation? &amp; would be converted to &, and same with other symbols.

    Also, shouldn't this operation occur in the XLIFF toolkit, and not within the Okapi? The fact that this issue occurs with the okf_html filter implies this handling is done in the filter.

    If you tell the XLIFF toolkit that you want the text hello & goodbye inside a source element, it should write out <source>hello &amp; goodbye</source>. When reading the target, the string provided should be hello & goodbye. Isn’t this how XML libraries work in general? Since the ampersands are preserved when I don’t use the okf_html filter, I assume this is already happening.

  5. Kuro Kurosaka (BH Lab)

    HTML Subfilter is expecting the main filter to feed an HTML conforming document. But the stand-alone & is not. It decides to treat it as though the 5-character sequence &amp; were given. So when it was asked to write it back, it outputs &amp; But I don't understand why XLIFF just has &amp; instead of &amp;amp; that I thought it would have. I guess I’m not understanding the whole situation.

    But the main problem of subfiltering in general is that contents that is sent to a subfilter must be what the subfilter is designed to handle. If we unconditionally send anything found in the main document, bad things can happen.

  6. Jack Cole reporter

    I am using the XML Filter, and it appears to handle all this fine. Except the escapeGT option when set to "no" will still convert &gt; to > in the segment, but then when writing from the filter it will not convert it. So all &gt; are converted to > in the final translated XML when using the exact same source XML content. Should I report this as a separate bug, or is it related to this?

  7. Log in to comment