DocBook 5.0 XML Filter
DocBook (5.0) is an XML based standard format of document publishing.
DocBook defines various elements including in-line elements such as If the vanilla XMLFilter is used for extraction, inline elements such as emphasis, link, and literal.
This is a short docbook example:
<article xmlns='http://docbook.org/ns/docbook'>
<title>Example emphasis</title>
<para>The <emphasis>most</emphasis> important example of this phenomenon occurs in
A. Nonymous's book <citetitle>Power Snacking</citetitle>.
</para>
</article>
If we apply a vanilla XMLFilter to this, we’ll get 6 trans-units for:
Example emphasis
The
most
important example of this phenomenon … book
Power Snacking
.
This isn’t what we want. What we’d like to have for this example docbook would be two trans-units:
Example emphasis
The <g id="1">most</g> important example of this phenomenon ... book <g id="2">Power Snacking</g>.
This issue suggests to create a predefined XML Filter filter configuration, perhaps named okf_xml-docbook.
Comments (4)
-
-
reporter Thank you, @ysavourel for these pointers. I did not know them.
In my experiment and these lines of code in ITSFilter.java:
switch ( trav.getWithinText() ) { case ITraversal.WITHINTEXT_NESTED: //TODO: deal with nested elements // For now treat them as inline case ITraversal.WITHINTEXT_YES:
makes me think that withinTextRule[@withinText=”nested”] isn’t really supported.
Do you know how difficult it might be to implement this? I think implementing this is needed for the footnote element.
-
The nested case: Indeed it’s not supported properly in the filter.
It’s likely because it’s not easily implemented with the Okapi model (a bit like the <sub> inside the inline codes for XLIFF Filter). Someone fixed the <sub> issue not long ago, maybe the nested case can be implemented in similar ways. -
reporter - changed status to resolved
The pull request #502 has been merged. Also, the new config id has been added to the Filters wiki page.
- Log in to comment
You may have seen this already. But in case:
Some work has been done on DocBook and XML/ITS extraction.
See for example: https://xmlguru.cz/2013/05/docbook-and-its2
And some of the rules are already worked out: https://www.w3.org/International/its/wiki/RulesRepository