Extracts science-technical-medical (STM) information from HTML/SVG/PDF
This is being developed as the page where most users will land in the AMI system. See Current activity.
XHTML2STM takes XHTML+SVG emitted from SVG2XML and uses a
Visitor pattern to apply domain-specific analysis and indexing.
Visitables and carry out operations on them such as transformation, indexing. Thus a
ChemVisitor knows how to create a
CMLMolecule from a
SVGVisitable if it contains a picture of the molecule.
The most important
Visitables are currently (2013-11-06):
SVGVisitable(SVG shapes and text)
ImageVisitable(bitmaps, not yet highly developed)
TableVisitable(XHTML table, whose cell might contain other visitables)
In general a
Visitable is passive, doesn't know who will visit it or why. Its primary role is to organize its content in an easy way to visit. The number of
Visitables will be roughly constant and depend on the type of abstract information supplied. Removing a
Visitable will cause many
Visitors to break.
ChemVisitor) is active and knows what it wants to do. Most
Visitors can visit 2 or more
Visitables and may find different and complementary information. There can be a large number of
Visitors and new ones can be added or deleted without affecting the others or causing code to break. It's probable that we can have several
Visitors iterating over one
Visitable (e.g. from
ChemVisitor would extract formulae, the
SequenceVisitor sequences, the
SpeciesVisitor species and so on.
Visitors should normally be in distinct packages and may well become separate projects. Current
ChemVisitor. Extracts formulae from
XHTMLand (to come) builds reactions.
PlotVisitor. Interprets x-y plots from
SequenceVisitor. (to come - volunteers?)
SpeciesVisitor. Extracts species from
TreeVisitor. Extracts trees from
We are planning Named Entity Recognizers (many will be through a generic regex engine).
MuseumVisitor(abbreviations and names)
DOIVisitor(DigitalObjectIdentifier for papers)
Note that these can be section specific (e.g. species in titles, Authors in running text, etc.)
To create scripts for comandlinerunning build the system normally and then
will create a separate
target/bin directory and an uber-jar.
MoleculeCreator.createReactions() to create CML reactions. If none are found in the SVG input, it runs
MoleculeCreator.createMolecules() instead. Reaction creation:
- Tries to interpret every connected component as an arrow
- For each arrow found, looks for the reactant and the product (
MoleculeCreator.createMolecule()being used to interpret them)
- For each caption found, looks for the molecule to which it refers and adds the label to the molecule as a CML label (
MoleculeCreator.createReactions() first run
ChemistryBuilder.createHigherPrimitives() to convert SVG to chemistry, which:
- Pulls together groups of lines that should be represented as a single object (currently tram lines and hatched triangles)
- Determines which objects are conceptually joined to others (creating
- Joins characters to make words (as part of the above)
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
SVG of part of it:
Another reaction scheme from that paper:
The annotated SVG including OCR results:
Finally, an example of a complex molecule that ChemVisitor correctly interprets: