okf_html subfiltering results in & escaped after merging
Explanation
When creating a translation kit and then merging the kit back into its original file, if ampersands exist in the original file's contents, they will be converted to &
. Single quotes are also converted to '
I’ve attached a package containing this example.
Note: I’m on dev branch when doing this.
What should happen
If I create a translation kit with ampersands in the original file, and don’t change the work file, the done file should have those same ampersands unaltered.
Steps to reproduce
- Create a JSON file with some strings that contain &s, and other encoded HTML characters like &. You can use the example below
{"example1": "Test &&&&&& & & & & < "}
- Open Rainbow and add the JSON file to the list of files
- Open input document properties for that JSON
- Click Create…
- Give it any name and hit OK
- In the JSON Filter Parameters Window, select Content Processing
- Select “Process text content with this sub-filter…” and input “okf_html”
- Hit OK, and hit OK again to close Input Document Properties
- Select Utilities > Translation Kit Creation
- Use XLIFF or XLIFF2 in the type of package and hit Execute
- Remove the JSON file and Add the Manifest file that was created
- Utilities > Translation Kit Post-Processing
- Execute
- Open done folder and look at resulting JSON file. Should look something like this
{"example1": "Test &&&&&& & & & &amp; &lt; "}
Comments (8)
-
reporter -
reporter - edited description
Explanation
When creating a translation kit and then merging the kit back into its original file, if ampersands exist in the original file’s contents, they will be converted to &amp;. Single quotes are also converted to &
;#39I’ve attached a package containing this example.
Note: I’m on dev branch when doing this.
What should happen
If I create a translation kit with ampersands in the original file, and don’t change the work file, the done file should have those same ampersands unaltered.
Steps to reproduce
- Create a JSON file with some strings that contain &s, and other encoded HTML characters like &amp;. You can use the example below
{"example1": "Test &&&&&& & & & & <"} - Open Rainbow and add the JSON file to the list of files
- Open input document properties for that JSON
- Click Create…
- Give it any name and hit OK
- In the JSON Filter Parameters Window, select Content Processing
- Select “Process text content with this sub-filter…” and input “okf_html”
- Hit OK, and hit OK again to close Input Document Properties
- Select Utilities > Translation Kit Creation
- Use XLIFF or XLIFF2 in the type of package and hit Execute
- Remove the JSON file and Add the Manifest file that was created
- Utilities > Translation Kit Post-Processing
- Execute
- Open done folder and look at resulting JSON file. Should look something like this
{&;example1": "Test &amp;&amp;&amp;&amp;&amp;&amp; &amp; &amp; &amp; &amp; &lt;"}#34
-
Hi Jack,
Have you tried using the Quote mode option? http://okapiframework.org/wiki/index.php?title=HTML_Filter#Quote_Mode
Cheers
Clement
-
reporter Hey Clement,
I loaded up rainbow after creating a filter confit, and saw the config in the list. I then set the JSON subfilter to okf_html-test_filter and tried all 4 levels of quote mode.
Unfortunately, the
&
symbols were all still converted to&
in the done file. -
In HTML,
&&
(or more) is illegal, I think, because&
is a leading character of a character references and a character sequence that forms a valid character reference is expected to follow. The HTML Filter could just stop parsing when it sees an illegal sequence. But it actually tries to be lenient and it treats it as just am ampersand character (as if&
weere written). Now, we have an associated filter writer, that does the reverse conversion. When it sees an ampersand in the data, it has to write back a legal HTML sequence. Since an ampersand is supposed to written as&
in HTML, that what it does. The same situation applies to the less-than symbol. If<
always becomes<
In other words, this happens because the filter is feeding a character sequence that is not HTML text to the HTML Filter. It is unavoidable in my opinion.
-
reporter So when converting JSON to XLIFF, it’s escaping certain characters. That’s fine, it’s a limitation of the XML spec.
But then shouldn't converting XLIFF to JSON perform the reverse operation?
&
would be converted to&
, and same with other symbols.Also, shouldn't this operation occur in the XLIFF toolkit, and not within the Okapi? The fact that this issue occurs with the okf_html filter implies this handling is done in the filter.
If you tell the XLIFF toolkit that you want the text
hello & goodbye
inside a source element, it should write out<source>hello & goodbye</source>
. When reading the target, the string provided should behello & goodbye
. Isn’t this how XML libraries work in general? Since the ampersands are preserved when I don’t use the okf_html filter, I assume this is already happening. -
HTML Subfilter is expecting the main filter to feed an HTML conforming document. But the stand-alone
&
is not. It decides to treat it as though the 5-character sequence&
were given. So when it was asked to write it back, it outputs&
But I don't understand why XLIFF just has&
instead of&amp;
that I thought it would have. I guess I’m not understanding the whole situation.But the main problem of subfiltering in general is that contents that is sent to a subfilter must be what the subfilter is designed to handle. If we unconditionally send anything found in the main document, bad things can happen.
-
reporter I am using the XML Filter, and it appears to handle all this fine. Except the
escapeGT
option when set to "no" will still convert>
to>
in the segment, but then when writing from the filter it will not convert it. So all>
are converted to>
in the final translated XML when using the exact same source XML content. Should I report this as a separate bug, or is it related to this? - Log in to comment
It seems like even Bitbucket is having issues handling ampersands…