Clone wiki

pdf2svg / Home


Part of the AMI system for extracting facts from STM literature. See also SVG, SVG2XML, XHTML2STM

Converts PDF documents to SVG (uses PDFBOX)

Converting PDFs to Science PDF2STM

How PDF2SVG Works


PDFBOX reads PDF documents and provides an API to extract content from each page (PDF2SVG deliberately does not access metadata (e.g. XMP) as it is not used consistently). PDF2SVG is independent of the application domain of the PDF though it has been developed for ScienceTechnicalMedical (STM).

Note that PDF does NOT contain words, paragraphs, subscripts, tables, figures, circles, squares, etc. These have to be created by heuristics operating on the output of PDF2SVG. There are two stages:

  • PDFBOX does much of the transformation of low-level commands and coordinates.
  • PDF2SVG transforms these into primitives with page coordinates.

There are three types of primitive:

  • characters represented as <svg:text x="" y="">Unicode point</svg:text>
  • images represented as <svg:image x="" y="" width="" height="" xref="[bitmap as base64]"/>
  • paths as <svg:path d="[Move][Line][Cubic Bezier][Quadratic Bezier][Z close]"/>

Primitives can be decorated with additional attributes (e.g. font-size, stroke-width, stroke, see the SVG spec).