CDATA sections escaped when writing with Okapi XLIFF 2.0 lib

Issue #1167 new
Masha Buka created an issue

I am using okapi-lib-xliff2:1.44.0 to create an .xlf file. I want to add CDATA sections to some of the elements. According to XLIFF 2.0 documentation it is allowed:
http://docs.oasis-open.org/xliff/xliff-core/v2.0/xliff-core-v2.0.html#d0e7792

However, in the output file Okapi XLIFF 2.0 writer escapes CDATA along with all inline codes used in the values. This is code snapshot I have so far. Thanks a lot for looking into this issue.

try (XLIFFWriter writer = new XLIFFWriter()) {
      writer.setUseIndentation(true);
      writer.create(
          new File("cdata.xlf"),
          Locale.US.toString(),
          Locale.FRANCE.toString());

      StartFileData fileElementAttribute = new StartFileData(null);
      String originalFile = "with_cdata.xlf";
      fileElementAttribute.setId("1");
      fileElementAttribute.setOriginal(originalFile);
      writer.writeStartFile(fileElementAttribute);

      Unit unit = new Unit("1");

      ExtAttributes additionalAttributes = new ExtAttributes();
      additionalAttributes.setAttribute(new ExtAttribute(QName.valueOf("xml:space"), "preserve"));
      unit.setExtAttributes(additionalAttributes);

      String segmentId = "test-key-1";
      unit.setName(segmentId);
      unit.setCanResegment(false);

      Segment segment = unit.appendSegment();
      segment.setCanResegment(false);
      segment.setSource(new CDATAEncoder("UTF-8", "\\n").encode("<b>Hello<\\b>", EncoderContext.TEXT));
      segment.setTarget(new CDATAEncoder("UTF-8", "\\n").encode("<b>Bonjour<\\b>", EncoderContext.TEXT));

      Note originalComment = new Note();
      originalComment.setCategory("engineer-comment");
      originalComment.setText(new CDATAEncoder("UTF-8", "\\n").encode("This is translation for <b>Hello<\\b>", EncoderContext.TEXT));
      unit.addNote(originalComment);

      Metadata unitMetadata = new Metadata();
      MetaGroup metaGroup = new MetaGroup();
      metaGroup.setCategory("unitMetadata");
      Meta meta = new Meta("key-1");
      meta.setData(new CDATAEncoder("UTF-8", "\\n").encode("This is translation for <b>Hello<\\b>", EncoderContext.TEXT));
      metaGroup.add(meta);
      unitMetadata.addGroup(metaGroup);

      unit.setMetadata(unitMetadata);

      writer.writeUnit(unit);
    }

And this is the document it produces:

<?xml version="1.0"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en_US" trgLang="fr_FR">
 <file id="1" original="with_cdata.xlf">
  <unit id="1" canResegment="no" name="test-key-1" xml:space="preserve">
   <mda:metadata xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0">
   <mda:metaGroup category="unitMetadata">
   <mda:meta type="key-1">&lt;![CDATA[This is translation for &lt;b>Hello&lt;\b>]]></mda:meta>
   </mda:metaGroup>
</mda:metadata>
   <notes>
    <note category="engineer-comment">&lt;![CDATA[This is translation for &lt;b>Hello&lt;\b>]]></note>
   </notes>
   <segment>
    <source>&lt;![CDATA[&lt;b&gt;Hello&lt;\b&gt;]]&gt;</source>
    <target>&lt;![CDATA[&lt;b&gt;Bonjour&lt;\b&gt;]]&gt;</target>
   </segment>
  </unit>
 </file>
</xliff>

Expected output would be

<?xml version="1.0"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en_US" trgLang="fr_FR">
 <file id="1" original="with_cdata.xlf">
  <unit id="1" canResegment="no" name="test-key-1" xml:space="preserve">
   <mda:metadata xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0">
   <mda:metaGroup category="unitMetadata">
   <mda:meta type="key-1"><![CDATA[This is translation for <b>Hello<\b>]]></mda:meta>
   </mda:metaGroup>
</mda:metadata>
   <notes>
    <note category="engineer-comment"><![CDATA[This is translation for <b>Hello<\b>]]></note>
   </notes>
   <segment>
    <source><![CDATA[<b>Hello<\b>]]></source>
    <target><![CDATA[<b>Bonjour<\b>]]></target>
   </segment>
  </unit>
 </file>
</xliff>

Comments (4)

  1. jhargrave-straker

    We've decided to reject PR 651 (https://bitbucket.org/okapiframework/okapi/pull-requests/651) as it dirties the xliff2 data model and encourages bad practices like embedding native formatting as CDATA in source and target.

    One alternative is to update the Xliff2Writer.write() methods with a boolean cdata parameter. A user can set cdata=true then the writer will wrap all PCDATA with CDATA markers. This is a bit course as an object like Unit may have several notes, metadata etc..

    I would also like to add a new Property object to the xliff2 data model (just like okapi resource model) so we can add filter specific information like if the original content was a CDATA section. I don't see any other way to do this other than creating new fields.

    Is there any pushback to adding a new Property/IProperty class? If there is a native xliff2 class I could use let me know - but nothing seemed obvious.

  2. Log in to comment