Wiki
Clone wikiOkapi / FilterOverview
Filter Overview
The role of a filter (IFilter interface) is to parse an input document and split it into standardized events that can be carried through the pipeline.
The filter event is carried in a FilterEvent object that holds the type of the event and the resource that may be associated with it. The following table show the correspondence between event types and resource classes:
FilterEventType value | Associated Resource |
---|---|
START |
none |
START_DOCUMENT |
StartDocument |
START_SUBDOCUMENT |
StartSubDocument |
START_GROUP |
StartGroup |
DOCUMENT_PART |
DocumentPart |
TEXT_UNIT |
TextUnit |
END_GROUP |
Ending |
END_SUBDOCUMENT |
Ending |
END_DOCUMENT |
Ending |
FINISHED |
none |
CANCELED |
none |
A document has, at least, the following events: START
, START_DOCUMENT
, END_DOCUMENT
and FINISHED
.
Resources
The resource object contains the data associated with the event, such as: text, read-only properties, localizable properties, information about groups, etc.
Some resources can be referents (i.e. be referred to). Referents are used, for example, when a text content has some embedded code that corresponds a separate run of text. For instance: the text of a footnote embedded inside a paragraph may be treated as a referents. The resources that can be referents are:
StartGroup
DocumentPart
TextUnit
The resources that can have references are:
StartDocument
StartSubDocument
StartGroup
DocumentPart
TextUnit
(including inside inline codes)
There are three interfaces that are common to the resources:
IResource
is implemented for all resources.INameable
is implemented for the resources that hold properties, names, and various other data.IReferenceable
is implemented by the resource that can be referents.
The following table indicates what common methods are available for each resource:
Method | Ending |
StartDocument |
StartSubDocument |
StartGroup |
DocumentPart |
TextUnit |
---|---|---|---|---|---|---|
IResource.getId |
yes | yes | yes | yes | yes | yes |
IResource.setId |
yes | yes | yes | yes | yes | yes |
IResource.getSkeleton |
yes | yes | yes | yes | yes | yes |
IResource.setSkeleton |
yes | yes | yes | yes | yes | yes |
IResource.getAnnotation |
yes | yes | yes | yes | yes | yes |
IResource.setAnnotation |
yes | yes | yes | yes | yes | yes |
INameable.getName |
- | yes | yes | yes | yes | yes |
INameable.setName |
- | yes | yes | yes | yes | yes |
INameable.getType |
- | yes | yes | yes | yes | yes |
INameable.setType |
- | yes | yes | yes | yes | yes |
INameable.getMimeType |
- | yes | yes | yes | yes | yes |
INameable.setMimeType |
- | yes | yes | yes | yes | yes |
INameable.isTranslatable |
- | yes | yes | yes | yes | yes |
INameable.setIsTranslatable |
- | yes | yes | yes | yes | yes |
INameable.preserveWhitespaces |
- | yes | yes | yes | yes | yes |
INameable.setPreserveWhitespaces |
- | yes | yes | yes | yes | yes |
INameable.getProperty |
- | yes | yes | yes | yes | yes |
INameable.setProperty |
- | yes | yes | yes | yes | yes |
INameable.getPropertyNames |
- | yes | yes | yes | yes | yes |
INameable.hasProperty |
- | yes | yes | yes | yes | yes |
INameable.getSourceProperty |
- | yes | yes | yes | yes | yes |
INameable.setSourceProperty |
- | yes | yes | yes | yes | yes |
INameable.getSourcePropertyNames |
- | yes | yes | yes | yes | yes |
INameable.hasSourceProperty |
- | yes | yes | yes | yes | yes |
INameable.getTargetProperty |
- | yes | yes | yes | yes | yes |
INameable.setTargetProperty |
- | yes | yes | yes | yes | yes |
INameable.getTargetPropertyNames |
- | yes | yes | yes | yes | yes |
INameable.hasTargetProperty |
- | yes | yes | yes | yes | yes |
INameable.createTargetProperty |
- | yes | yes | yes | yes | yes |
INameable.getTargetLanguages |
- | yes | yes | yes | yes | yes |
getParentId |
- | - | yes | yes | yes | yes |
setParentId |
- | - | yes | yes | yes | yes |
IReferenceable.isReferent |
- | - | - | yes | yes | yes |
IReferenceable.setIsReferent |
- | - | - | yes | yes | yes |
In addition, some resources have additional specific methods:
StartDocument |
TextUnit |
---|---|
getLanguage |
getSource |
setLanguage |
setSource |
getEncoding |
getTarget |
setEncoding |
setTarget |
isMultilingual |
hasTarget |
setIsMultilingual |
createTarget |
getParameters |
removeTarget |
setParameters |
getSourceContent |
setSourceContent |
|
getTargetContent |
|
setTargetContent |
|
getEncoder |
|
setEncoder |
|
isEmpty |
Skeleton
TODO
The parts of the input document that are not directly used by the caller are called the 'skeleton' and correspond to the underlying original codes that make up the source document.
The skeleton is carried at the resource level, and is accessible with getSkeleton()
.
Each filter is responsible to create its skeleton data and to decide which resource is associated to them (Obviously different filters can share classes for this). The internal representation of the skeleton is specific to each skeleton implementation.
TODO
Text Handling
Modifiable data is accessible in two places in the resources:
- Modifiable properties
- Translatable text
Modifiable properties are data that may need to be modified, but that is not text in the sense of translatable text. For example, a modifiable property could be the value of the dir
attribute, of the URL of a href
attribute in HTML.
Note that there is no relationship between how the data is stored in the original format and whether it is a modifiable property or a translatable text in the resource. For exmple, the HTML title
attribute is a translatable text and therefore should be extracted as a TextUnit
content rather than a modifiable property.
Most resources can have modifiable properties. Note that there are three kind of modifiable properties:
- Modifiable properties at the resource level
- Source modifiable properties (stored in the source
TextContainer
) - Target modifiable properties (stored in the target
TextContainer
objects)
An example of these different kinds of modifiable properties can be found in TMX where the changeid
attribute exists for the <tu>
element, as well as for the source and the target <tuv>
elements.
The TextUnit, TextContainer and TextFragment Classes
The TextUnit
holds the source text and all the target text, as well as all the source and target properties for a given extractable item (e.g. an HTML pargaraph).
The source object is a TextContainer
that can be access with getSource()
and setSource()
.
The target objects are also TextContainer
objects (one for each language). They are available using: hasTarget()
, getTarget()
, setTarget()
, createTarget()
. You can also get the list of targets available with getTargetLanguages()
.
The text of a TextContainer
object is stored in a TextFragment
object that can be accessed with getContent()
and setContent()
. There are also helper methods at the TextUnit
level to access the text directly: getSourceContent()
, setSourceContent()
getTargetContent()
, and setTargetContent()
.
Here are some examples on how to access the resources:
TextUnit tu = new TextUnit("id1", "Source text"); TextContainer srcCont = tu.getSource(); TextFragment srcText1 = tu.getSourceContent(); TextFragment srcText2 = srcCont.getContent(); assert(srcText1==srcText2);
This:
TextContainer trgCont; if ( tu.hasTarget("FR") ) { trgCont = tu.getTarget("FR"); } else { trgCont = tu.setTarget("FR", tu.getSource().clone()); }
Is the same as this:
trgCont = tu.getTarget("FR"); if ( trgCont == null ) { trgCont = tu.setTarget("FR", tu.getSource().clone()); }
And is also the same as this:
trgCont = tu.createTarget("FR", false, IResource.COPY_CONTENT);
The Coded Text
TODO
The TextFragment
class is responsible to store and manipulate the text and any inline codes that may be within the text (e.g. <b>
in an HTML paragraph). To allow an esier handling of the text and the codes, they are separated in a TextFragment
.
The text part is represented in a format called coded text. The code text is a normal Java String object you can manipulate almost like any String object. The difference that when the TextFragment
has codes, they are marked up as a pair of special Unicode characters:
- The first one is a prefix that indicate what kind of code it is.
- The second one is a value used to retrieve the code itself when needed.
Because both characters are in the user-defined Unicode range, most normal string functions have no effect on them. For example you can safely call String.toLowerCase() on a coded text.
Use the method getCodedText()
to access the coded text, and setCodedText()
to set it back. For example:
TextFragment tf = new TextFragment("string"); tf.append(TagType.PLACEHOLDER, "br", "<br/>"); assertEquals(tf.toString(), "string<br/>"); String tmp = tf.getCodedText(); tf.setCodedText(tmp.toUpperCase()); assertEquals(tf.toString(), "STRING<br/>");
In the example above the original value of the String tmp is: stringXY
where XY
is the pair of special Unicode character representing the <br/>
code.
Note that adding and removing codes should be done with Textfragment
methods most of the time. Any change to the codes in a coded text must be synchronized with changes in the list of the real codes in the TextFragment
object.
Some of the methods of the TextFragment
are: append()
, hasCode()
, insert()
, getCodedText()
, remove()
, setCodedText()
, clear()
, getCodes()
, isEmpty()
, etc... See the TextFragment
documentation for a complete list and description of each methods.
How To...
How do I get the language of a document?
Use StartDocument.getLanguage()
.
How do I know if a document can have content in more than one language?
Use StartDocument.isMultiligual()
.
How do I access the target text of a TextUnit?
Use something like: TextUnit.hasTarget("fr")
to determine if there is a target text available in for a given language.
Then use TextUnit.getTarget("fr")
to retrieve the TextContainer
object for that language.
Updated