Source

pdf2xml /

Filename Size Date modified Message
share/lib
t
327 B
34.3 KB
570 B
1.7 KB
8.1 KB
pdf2xml - convert PDF files to XML
----------------------------------

This script heavily relies on Apache Tika and pdftotext for the
extraction of text and the conversion to XML. It tries to combine
information from both tools and different conversion modes:

 - pdftotext in standard mode puts some hyphenated words together
 - pdftotext in 'raw' mode is better in keeping characters together
 - Apache Tika can add XML markup to mark paragraphs and lists


pdf2xml uses several heuristics to use information from both tools. It
collects the vocabulary from different conversions and tries to merge
tokens if they would form a valid word (which is part of the
vocabulary). The longest possibe string is prefered. Hyphens at the
end of a line may mark a hyphenated word. pdf2xml tries to put the
token before the hyphen together with the first token of the next line
and accepts the new string if it is part of the vocabulary.


Look at the example PDF-file in the test directory (t/). If you look
at the difference between raw conversion (without cleanup heuristics),
you will find a lot of changes. For example:


  raw:    <p>PRESENTATION ET R A P P E L DES PRINCIPAUX RESULTATS 9</p>
  clean:  <p>PRESENTATION ET RAPPEL DES PRINCIPAUX RESULTATS 9</p>

  raw: <p>2. Les c r i t è r e s de choix : la c o n s o m m a t i o n
          de c o m b u s - t ib les et l e u r moda l i t é 
          d ' u t i l i s a t i on d 'une p a r t , 
          la concen t r a t ion d ' a u t r e p a r t 16</p>

  clean: <p>2. Les critères de choix : la consommation 
            de combustibles et leur modalité 
            d'utilisation d'une part, 
            la concentration d'autre part 16</p>

Note: The heuristics may not always produce correct results!