Converts PDF documents to SVG (uses PDFBOX)
PDFBOX reads PDF documents and provides an API to extract content from each page (PDF2SVG deliberately does not access metadata (e.g. XMP) as it is not used consistently). PDF2SVG is independent of the application domain of the PDF though it has been developed for ScienceTechnicalMedical (STM).
Note that PDF does NOT contain words, paragraphs, subscripts, tables, figures, circles, squares, etc. These have to be created by heuristics operating on the output of PDF2SVG. There are two stages:
- PDFBOX does much of the transformation of low-level commands and coordinates.
- PDF2SVG transforms these into primitives with page coordinates.
There are three types of primitive:
- characters represented as
<svg:text x="" y="">Unicode point</svg:text>
- images represented as
<svg:image x="" y="" width="" height="" xref="[bitmap as base64]"/>
- paths as
<svg:path d="[Move][Line][Cubic Bezier][Quadratic Bezier][Z close]"/>
Primitives can be decorated with additional attributes (e.g. font-size, stroke-width, stroke, see the SVG spec).