DocBook 5.0 XML Filter

Create issue
Issue #1036 resolved
Kuro Kurosaka (BH Lab) created an issue

DocBook (5.0) is an XML based standard format of document publishing.
DocBook defines various elements including in-line elements such as If the vanilla XMLFilter is used for extraction, inline elements such as emphasis, link, and literal.
This is a short docbook example:

<article xmlns='http://docbook.org/ns/docbook'>
  <title>Example emphasis</title>
  <para>The <emphasis>most</emphasis> important example of this phenomenon occurs in
  A. Nonymous's book <citetitle>Power Snacking</citetitle>.
  </para>
</article>

If we apply a vanilla XMLFilter to this, we’ll get 6 trans-units for:

  • Example emphasis
  • The
  • most
  • important example of this phenomenon … book
  • Power Snacking
  • .

This isn’t what we want. What we’d like to have for this example docbook would be two trans-units:

  • Example emphasis
  • The <g id="1">most</g> important example of this phenomenon ... book <g id="2">Power Snacking</g>.

This issue suggests to create a predefined XML Filter filter configuration, perhaps named okf_xml-docbook.

Comments (4)

  1. Kuro Kurosaka (BH Lab) reporter

    Thank you, @YvesS for these pointers. I did not know them.

    In my experiment and these lines of code in ITSFilter.java:

                switch ( trav.getWithinText() ) {
                case ITraversal.WITHINTEXT_NESTED: //TODO: deal with nested elements
                    // For now treat them as inline
                case ITraversal.WITHINTEXT_YES:
    

    makes me think that withinTextRule[@withinText=”nested”] isn’t really supported.

    Do you know how difficult it might be to implement this? I think implementing this is needed for the footnote element.

  2. YvesS

    The nested case: Indeed it’s not supported properly in the filter.
    It’s likely because it’s not easily implemented with the Okapi model (a bit like the <sub> inside the inline codes for XLIFF Filter). Someone fixed the <sub> issue not long ago, maybe the nested case can be implemented in similar ways.

  3. Log in to comment