Clone wiki

xhtml2stm / Home


Part of the AMI system for extracting facts from STM literature. See also PDF2SVG, SVG, SVG2XML

Extracts science-technical-medical (STM) information from HTML/SVG/PDF

This is being developed as the page where most users will land in the AMI system. See Current activity.

XHTML2STM takes XHTML+SVG emitted from SVG2XML and uses a Visitor pattern to apply domain-specific analysis and indexing. Visitors visit Visitables and carry out operations on them such as transformation, indexing. Thus a ChemVisitor knows how to create a CMLMolecule from a SVGVisitable if it contains a picture of the molecule.

The most important Visitables are currently (2013-11-06):

  • HTMLVisitable (XHTML "text")
  • SVGVisitable (SVG shapes and text)
  • ImageVisitable (bitmaps, not yet highly developed)
  • TableVisitable (XHTML table, whose cell might contain other visitables)

In general a Visitable is passive, doesn't know who will visit it or why. Its primary role is to organize its content in an easy way to visit. The number of Visitables will be roughly constant and depend on the type of abstract information supplied. Removing a Visitable will cause many Visitors to break.

A Visitor (e.g. ChemVisitor) is active and knows what it wants to do. Most Visitors can visit 2 or more Visitables and may find different and complementary information. There can be a large number of Visitors and new ones can be added or deleted without affecting the others or causing code to break. It's probable that we can have several Visitors iterating over one Visitable (e.g. from HTMLVisitable the ChemVisitor would extract formulae, the SequenceVisitor sequences, the SpeciesVisitor species and so on. Visitors should normally be in distinct packages and may well become separate projects. Current Visitors:

  • ChemVisitor. Extracts formulae from SVG and XHTML and (to come) builds reactions.
  • PlotVisitor. Interprets x-y plots from SVG.
  • SequenceVisitor. (to come - volunteers?)
  • SpeciesVisitor. Extracts species from SVG, Tables and XHTML.
  • TreeVisitor. Extracts trees from SVG.

We are planning Named Entity Recognizers (many will be through a generic regex engine).

  • MuseumVisitor (abbreviations and names)
  • AccessionVisitor (accession numbers)
  • DOIVisitor (DigitalObjectIdentifier for papers)
  • GeonamesVisitor.
  • DateVisitor
  • AuthorVisitor
  • CitationVisitor

Note that these can be section specific (e.g. species in titles, Authors in running text, etc.)


To create scripts for comandlinerunning build the system normally and then mvn appassembler:assemble will create a separate target/bin directory and an uber-jar.


Current capabilities of ChemVisitor

ChemVisitor (in createCML()) runs MoleculeCreator.createReactions() to create CML reactions. If none are found in the SVG input, it runs MoleculeCreator.createMolecules() instead. Reaction creation:

  • Tries to interpret every connected component as an arrow
  • For each arrow found, looks for the reactant and the product (createReactionsAndAddMolecules(), with MoleculeCreator.createMolecule() being used to interpret them)
  • For each caption found, looks for the molecule to which it refers and adds the label to the molecule as a CML label (addLabelsToMolecules())

MoleculeCreator.createMolecules() and MoleculeCreator.createReactions() first run ChemistryBuilder.createHigherPrimitives() to convert SVG to chemistry, which:

  • Pulls together groups of lines that should be represented as a single object (currently tram lines and hatched triangles)
  • Determines which objects are conceptually joined to others (creating Junction objects)
  • Joins characters to make words (as part of the above)

Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:

Example reaction scheme

SVG of part of it:

Example reaction

CML generated from above part

SVG animation of scheme

Another reaction scheme from that paper:

Example reaction scheme

The annotated SVG including OCR results:

Example annotated reaction scheme

Finally, an example of a complex molecule that ChemVisitor correctly interprets:

Complex molecule