Part of the AMI system for extracting facts from STM literature. See also PDF2SVG, SVG, SVG2XML

Extracts science-technical-medical (STM) information from HTML/SVG/PDF

This is being developed as the page where most users will land in the AMI system. Current activity

XHTML2STM takes XHTML+SVG emitted from SVG2XML and uses a Visitor pattern to apply domain-specific analysis and indexing. Visitors visit Visitables and carry out operations on them such as transformation, indexing. Thus a ChemVisitor knows how to create a CMLMolecule from a SVGVisitable if it contains a picture of the molecule.

The most important Visitables are currently (2013-11-06):

  • HTMLVisitable (XHTML "text")
  • SVGVisitable (SVG shapes and text)
  • ImageVisitable (bitmaps, not yet highly developed)
  • TableVisitable (XHTML table, whose cell might contain other visitables)

In general a Visitable is passive, doesn't know who will visit it or why. Its primary role is to organize its content in an easy way to visit. The number of Visitables will be roughly constant and depend on the type of abstract information supplied. Removing a Visitable will cause many Visitors to break.

A Visitor (e.g. ChemVisitor) is active and know what it wants to do. Most Visitors can visit 2 or more Visitables and may find different and complementary information. There can be a large number of Visitors and new ones can be added or deleted without affecting the others or causing code to break. It's probable that we can have several Visitors iterating over one Visitable (e.g. from HTMLVisitable the ChemiVisitor would extract formulae, the SequenceVisitor sequence, The SpeciesVisitor species and so on. Visitors should normally be in distinct packages and may well become separate projects. Current Visitors:

  • ChemVisitor. Extracts formulae from SVG and XHTML and (to come) builds reactions.
  • PlotVisitor. Interprets x-y plots from SVG.
  • SequenceVisitor. (to come - volunteers?)
  • SpeciesVisitor. Extracts species from SVG, Tables and XHTML.
  • TreeVisitor. Extracts trees from SVG.

We are planning Named Entity Recognizers (many will be through a generic regex engine).

  • MuseumVisitor (abbreviations and names)
  • AccessionVisitor (accession numbers)
  • DOIVisitor (DigitalObjectIdentifier for papers)
  • GeonamesVisitor.
  • DateVisitor
  • AuthorVisitor
  • CitationVisitor

Note that these can be section specific (e.g. species in titles, Authors in running text, etc.)


ChemVisitor (in createCML()) runs MoleculeCreator.createReactions() to create CML reactions. If none are found in the SVG input, it runs MoleculeCreator.createMolecules() instead. Reaction creation:

  • Tries to interpret every connected component as an arrow
  • For each arrow found, looks for the reactant and the product (createReactionsAndAddMolecules(), with MoleculeCreator.createMolecule() being used to interpret them)
  • For each caption found, looks for the molecule to which it refers and adds the label to the molecule as a CML label (addLabelsToMolecules())

MoleculeCreator.createMolecules() and MoleculeCreator.createReactions() first run ChemistryBuilder.createHigherPrimitives() to convert SVG to chemistry, which:

  • Pulls together groups of lines that should be represented as a single object (currently tram lines and hatched triangles)
  • Determines which objects are conceptually joined to others (creating Junction objects)
  • Joins characters to make words (as part of the above)

Taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY

Example reaction

CML generated from above example


Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.