Wiki

Clone wiki

Okapi / FilterOverview

Filter Overview

The role of a filter (IFilter interface) is to parse an input document and split it into standardized events that can be carried through the pipeline.

The filter event is carried in a FilterEvent object that holds the type of the event and the resource that may be associated with it. The following table show the correspondence between event types and resource classes:

FilterEventType value Associated Resource
START none
START_DOCUMENT StartDocument
START_SUBDOCUMENT StartSubDocument
START_GROUP StartGroup
DOCUMENT_PART DocumentPart
TEXT_UNIT TextUnit
END_GROUP Ending
END_SUBDOCUMENT Ending
END_DOCUMENT Ending
FINISHED none
CANCELED none

A document has, at least, the following events: START, START_DOCUMENT, END_DOCUMENT and FINISHED.

Resources

The resource object contains the data associated with the event, such as: text, read-only properties, localizable properties, information about groups, etc.

Some resources can be referents (i.e. be referred to). Referents are used, for example, when a text content has some embedded code that corresponds a separate run of text. For instance: the text of a footnote embedded inside a paragraph may be treated as a referents. The resources that can be referents are:

  • StartGroup
  • DocumentPart
  • TextUnit

The resources that can have references are:

  • StartDocument
  • StartSubDocument
  • StartGroup
  • DocumentPart
  • TextUnit (including inside inline codes)

There are three interfaces that are common to the resources:

  • IResource is implemented for all resources.
  • INameable is implemented for the resources that hold properties, names, and various other data.
  • IReferenceable is implemented by the resource that can be referents.

The following table indicates what common methods are available for each resource:

Method Ending StartDocument StartSubDocument StartGroup DocumentPart TextUnit
IResource.getId yes yes yes yes yes yes
IResource.setId yes yes yes yes yes yes
IResource.getSkeleton yes yes yes yes yes yes
IResource.setSkeleton yes yes yes yes yes yes
IResource.getAnnotation yes yes yes yes yes yes
IResource.setAnnotation yes yes yes yes yes yes
INameable.getName - yes yes yes yes yes
INameable.setName - yes yes yes yes yes
INameable.getType - yes yes yes yes yes
INameable.setType - yes yes yes yes yes
INameable.getMimeType - yes yes yes yes yes
INameable.setMimeType - yes yes yes yes yes
INameable.isTranslatable - yes yes yes yes yes
INameable.setIsTranslatable - yes yes yes yes yes
INameable.preserveWhitespaces - yes yes yes yes yes
INameable.setPreserveWhitespaces - yes yes yes yes yes
INameable.getProperty - yes yes yes yes yes
INameable.setProperty - yes yes yes yes yes
INameable.getPropertyNames - yes yes yes yes yes
INameable.hasProperty - yes yes yes yes yes
INameable.getSourceProperty - yes yes yes yes yes
INameable.setSourceProperty - yes yes yes yes yes
INameable.getSourcePropertyNames - yes yes yes yes yes
INameable.hasSourceProperty - yes yes yes yes yes
INameable.getTargetProperty - yes yes yes yes yes
INameable.setTargetProperty - yes yes yes yes yes
INameable.getTargetPropertyNames - yes yes yes yes yes
INameable.hasTargetProperty - yes yes yes yes yes
INameable.createTargetProperty - yes yes yes yes yes
INameable.getTargetLanguages - yes yes yes yes yes
getParentId - - yes yes yes yes
setParentId - - yes yes yes yes
IReferenceable.isReferent - - - yes yes yes
IReferenceable.setIsReferent - - - yes yes yes

In addition, some resources have additional specific methods:

StartDocument TextUnit
getLanguage getSource
setLanguage setSource
getEncoding getTarget
setEncoding setTarget
isMultilingual hasTarget
setIsMultilingual createTarget
getParameters removeTarget
setParameters getSourceContent
setSourceContent
getTargetContent
setTargetContent
getEncoder
setEncoder
isEmpty

Skeleton

TODO

The parts of the input document that are not directly used by the caller are called the 'skeleton' and correspond to the underlying original codes that make up the source document.

The skeleton is carried at the resource level, and is accessible with getSkeleton().

Each filter is responsible to create its skeleton data and to decide which resource is associated to them (Obviously different filters can share classes for this). The internal representation of the skeleton is specific to each skeleton implementation.

TODO

Text Handling

Modifiable data is accessible in two places in the resources:

  • Modifiable properties
  • Translatable text

Modifiable properties are data that may need to be modified, but that is not text in the sense of translatable text. For example, a modifiable property could be the value of the dir attribute, of the URL of a href attribute in HTML.

Note that there is no relationship between how the data is stored in the original format and whether it is a modifiable property or a translatable text in the resource. For exmple, the HTML title attribute is a translatable text and therefore should be extracted as a TextUnit content rather than a modifiable property.

Most resources can have modifiable properties. Note that there are three kind of modifiable properties:

  • Modifiable properties at the resource level
  • Source modifiable properties (stored in the source TextContainer)
  • Target modifiable properties (stored in the target TextContainer objects)

An example of these different kinds of modifiable properties can be found in TMX where the changeid attribute exists for the <tu> element, as well as for the source and the target <tuv> elements.

The TextUnit, TextContainer and TextFragment Classes

The TextUnit holds the source text and all the target text, as well as all the source and target properties for a given extractable item (e.g. an HTML pargaraph).

The source object is a TextContainer that can be access with getSource() and setSource().

The target objects are also TextContainer objects (one for each language). They are available using: hasTarget(), getTarget(), setTarget(), createTarget(). You can also get the list of targets available with getTargetLanguages().

The text of a TextContainer object is stored in a TextFragment object that can be accessed with getContent() and setContent(). There are also helper methods at the TextUnit level to access the text directly: getSourceContent(), setSourceContent() getTargetContent(), and setTargetContent().

Here are some examples on how to access the resources:

TextUnit tu = new TextUnit("id1", "Source text");
TextContainer srcCont = tu.getSource();
TextFragment srcText1 = tu.getSourceContent();
TextFragment srcText2 = srcCont.getContent();
assert(srcText1==srcText2);

This:

TextContainer trgCont;
if ( tu.hasTarget("FR") ) {
   trgCont = tu.getTarget("FR");
}
else {
   trgCont = tu.setTarget("FR", tu.getSource().clone());
}

Is the same as this:

trgCont = tu.getTarget("FR");
if ( trgCont == null ) {
   trgCont = tu.setTarget("FR", tu.getSource().clone());
}

And is also the same as this:

trgCont = tu.createTarget("FR", false, IResource.COPY_CONTENT);

The Coded Text

TODO

The TextFragment class is responsible to store and manipulate the text and any inline codes that may be within the text (e.g. <b> in an HTML paragraph). To allow an esier handling of the text and the codes, they are separated in a TextFragment.

The text part is represented in a format called coded text. The code text is a normal Java String object you can manipulate almost like any String object. The difference that when the TextFragment has codes, they are marked up as a pair of special Unicode characters:

  • The first one is a prefix that indicate what kind of code it is.
  • The second one is a value used to retrieve the code itself when needed.

Because both characters are in the user-defined Unicode range, most normal string functions have no effect on them. For example you can safely call String.toLowerCase() on a coded text.

Use the method getCodedText() to access the coded text, and setCodedText() to set it back. For example:

TextFragment tf = new TextFragment("string");
tf.append(TagType.PLACEHOLDER, "br", "<br/>");
assertEquals(tf.toString(), "string<br/>");
String tmp = tf.getCodedText();
tf.setCodedText(tmp.toUpperCase());
assertEquals(tf.toString(), "STRING<br/>");

In the example above the original value of the String tmp is: stringXY where XY is the pair of special Unicode character representing the <br/> code.

Note that adding and removing codes should be done with Textfragment methods most of the time. Any change to the codes in a coded text must be synchronized with changes in the list of the real codes in the TextFragment object.

Some of the methods of the TextFragment are: append(), hasCode(), insert(), getCodedText(), remove(), setCodedText(), clear(), getCodes(), isEmpty(), etc... See the TextFragment documentation for a complete list and description of each methods.

How To...

How do I get the language of a document?

Use StartDocument.getLanguage().

How do I know if a document can have content in more than one language?

Use StartDocument.isMultiligual().

How do I access the target text of a TextUnit?

Use something like: TextUnit.hasTarget("fr") to determine if there is a target text available in for a given language.

Then use TextUnit.getTarget("fr") to retrieve the TextContainer object for that language.

Updated