CDATA sections escaped when writing with Okapi XLIFF 2.0 lib
I am using okapi-lib-xliff2:1.44.0
to create an .xlf
file. I want to add CDATA sections to some of the elements. According to XLIFF 2.0 documentation it is allowed:
http://docs.oasis-open.org/xliff/xliff-core/v2.0/xliff-core-v2.0.html#d0e7792
However, in the output file Okapi XLIFF 2.0 writer escapes CDATA along with all inline codes used in the values. This is code snapshot I have so far. Thanks a lot for looking into this issue.
try (XLIFFWriter writer = new XLIFFWriter()) {
writer.setUseIndentation(true);
writer.create(
new File("cdata.xlf"),
Locale.US.toString(),
Locale.FRANCE.toString());
StartFileData fileElementAttribute = new StartFileData(null);
String originalFile = "with_cdata.xlf";
fileElementAttribute.setId("1");
fileElementAttribute.setOriginal(originalFile);
writer.writeStartFile(fileElementAttribute);
Unit unit = new Unit("1");
ExtAttributes additionalAttributes = new ExtAttributes();
additionalAttributes.setAttribute(new ExtAttribute(QName.valueOf("xml:space"), "preserve"));
unit.setExtAttributes(additionalAttributes);
String segmentId = "test-key-1";
unit.setName(segmentId);
unit.setCanResegment(false);
Segment segment = unit.appendSegment();
segment.setCanResegment(false);
segment.setSource(new CDATAEncoder("UTF-8", "\\n").encode("<b>Hello<\\b>", EncoderContext.TEXT));
segment.setTarget(new CDATAEncoder("UTF-8", "\\n").encode("<b>Bonjour<\\b>", EncoderContext.TEXT));
Note originalComment = new Note();
originalComment.setCategory("engineer-comment");
originalComment.setText(new CDATAEncoder("UTF-8", "\\n").encode("This is translation for <b>Hello<\\b>", EncoderContext.TEXT));
unit.addNote(originalComment);
Metadata unitMetadata = new Metadata();
MetaGroup metaGroup = new MetaGroup();
metaGroup.setCategory("unitMetadata");
Meta meta = new Meta("key-1");
meta.setData(new CDATAEncoder("UTF-8", "\\n").encode("This is translation for <b>Hello<\\b>", EncoderContext.TEXT));
metaGroup.add(meta);
unitMetadata.addGroup(metaGroup);
unit.setMetadata(unitMetadata);
writer.writeUnit(unit);
}
And this is the document it produces:
<?xml version="1.0"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en_US" trgLang="fr_FR">
<file id="1" original="with_cdata.xlf">
<unit id="1" canResegment="no" name="test-key-1" xml:space="preserve">
<mda:metadata xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0">
<mda:metaGroup category="unitMetadata">
<mda:meta type="key-1"><![CDATA[This is translation for <b>Hello<\b>]]></mda:meta>
</mda:metaGroup>
</mda:metadata>
<notes>
<note category="engineer-comment"><![CDATA[This is translation for <b>Hello<\b>]]></note>
</notes>
<segment>
<source><![CDATA[<b>Hello<\b>]]></source>
<target><![CDATA[<b>Bonjour<\b>]]></target>
</segment>
</unit>
</file>
</xliff>
Expected output would be
<?xml version="1.0"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en_US" trgLang="fr_FR">
<file id="1" original="with_cdata.xlf">
<unit id="1" canResegment="no" name="test-key-1" xml:space="preserve">
<mda:metadata xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0">
<mda:metaGroup category="unitMetadata">
<mda:meta type="key-1"><![CDATA[This is translation for <b>Hello<\b>]]></mda:meta>
</mda:metaGroup>
</mda:metadata>
<notes>
<note category="engineer-comment"><![CDATA[This is translation for <b>Hello<\b>]]></note>
</notes>
<segment>
<source><![CDATA[<b>Hello<\b>]]></source>
<target><![CDATA[<b>Bonjour<\b>]]></target>
</segment>
</unit>
</file>
</xliff>
Comments (4)
-
reporter -
- changed milestone to 1.45.0
-
assigned issue to
-
@Masha Buka We have a possible solution for this: https://bitbucket.org/okapiframework/okapi/pull-requests/
But I wanted to point out that CDATA sections will not survive xliff 2 filtering as most xml parsers convert CDATA to escaped PCDATA.
We do have a TODO in the code to preserve CDATA on output - we’ll consider adding this functionality as well.
-
We've decided to reject PR 651 (https://bitbucket.org/okapiframework/okapi/pull-requests/651) as it dirties the xliff2 data model and encourages bad practices like embedding native formatting as CDATA in source and target.
One alternative is to update the Xliff2Writer.write() methods with a boolean cdata parameter. A user can set cdata=true then the writer will wrap all PCDATA with CDATA markers. This is a bit course as an object like Unit may have several notes, metadata etc..
I would also like to add a new Property object to the xliff2 data model (just like okapi resource model) so we can add filter specific information like if the original content was a CDATA section. I don't see any other way to do this other than creating new fields.
Is there any pushback to adding a new Property/IProperty class? If there is a native xliff2 class I could use let me know - but nothing seemed obvious.
- Log in to comment
I found this test resource which presumably is supposed to test this expected behavior, but it does not seem to be used in any of the test cases: https://bitbucket.org/okapiframework/okapi/src/0987e3843ac7587d99cb7af2a772046c350c840c/okapi/libraries/lib-xliff2/src/test/resources/valid/withCDataSections.xlf?at=dev