Wiki

Clone wiki

XLIFF Toolkit / Inline_Content

Inline Content

<Table of Content>

The <source> and <target> elements of XLIFF hold the content extracted from the original document. That content can be made of the following:

  • The text
  • The inline codes (<pc>, <sc>, <ec> and <ph>)
  • The annotation markers (<mrk>, <sm> and <em>)

Inline codes

The inline codes correspond to original inline data (usually formatting markup) that is optionally preserved in the <originalData> element of each <unit>. For example, the following unit has some bolded HTML text:

<unit id="1">
   <originalData>
      <data id="d1">&lt;b></data>
      <data id="d2">&lt;/b></data>
   </originalData>
   <source>Text in <pc id="1" dataRefStart="d1" dataRefEnd="d2">bold</pc>.</source>
</unit>

Markers

The annotation markers allow you to associate XLIFF-readable information to given spans of content. That information may or may not be part of the original document. For example, the following unit has an annotation indicating that the word "doppelgänger" is a term and should not be translated.

<unit id="1">
   <source>He saw his <mrk id="m1" type="term" translate="no">doppelgänger</mrk>.</source>
</unit>

Coded text representation

In the library, the parsed content of <source> and <target> is represented by the Fragment class. The text is represented in a coded text string where inline objects are denoted by pairs of special characters called tag reference.

The first character of the pair indicates the type of tag and the type of inline object (e.g. opening tag for a code, closing tag for a code, etc.), the second is an index for that type of tag/object. The two characters combined together make a key that points to the object holding the information about that tag.

The first character is one of the following values:

Unicode Value Constant Definition XLIFF Elements
U+E101 Fragment.CODE_OPENING Opening tag of an inline code <pc> or <sc/>
U+E102 Fragment.CODE_CLOSING Closing tag of an inline code </p> or <ec/>
U+E103 Fragment.CODE_STANDALONE Standalone tag for an inline code <ph/>
U+E104 Fragment.MARKER_OPENING Opening tag for a marker <mrk> or <sm/>
U+E105 Fragment.MARKER_CLOSING Closing tag for a marker </mrk> or <em/>
U+E106 Fragment.PCONT_STANDALONE Standalone tag for a protected content not applicable

The second character is one of the 6127 values between U+E110 and U+F8FF (included).

Note that both characters are in the Unicode PUA (Private Use Area) range, which means those characters are not affected by most operations such as toLowerCase(), and not part of most regular expressions character classes like punctuation, etc. Note also that the first and second character have never the same value, so you can guess which one is it from their value.

The library provides Fragment.isChar1() that returns true is it parameter is the first character of a tag reference. Once you know the first character of a tag reference you can access the second and use Fragment.toKey() to get the key for that tag. The key allows you to access the referenced object.

#!java
String ct = fragment.getCodedText();
for (int i=0; i<ct.length(); i++ ) {
   if ( Fragment.isChar1(ct.charAt(i)) ) {
      int key = Fragment.toKey(ct.charAt(i), ct.charAt(++i));
      Tag tag = fragment.getTag(key);
      // Do something with the tag...
   }
}

The class CTag represents inline code tags, the class MTag represents the annotations markup. Both class are derived from Tag.

Note that there is another type of inline object: PCont is not derived from Tag and represent a section of folded protected content. See the Protected Content section for details.

Updated