1. Jörg Tiedemann
  2. pdf2xml

Commits

tiedeman  committed 91b1360

autofill in README

  • Participants
  • Parent commits 6505032
  • Branches master

Comments (0)

Files changed (1)

File README

View file
 pdf2xml - convert PDF files to XML
 ----------------------------------
 
-This script heavily relies on Apache Tika and pdftotext for the extraction of text and the conversion to XML. It tries to combine information from both tools and different conversion modes:
+This script heavily relies on Apache Tika and pdftotext for the
+extraction of text and the conversion to XML. It tries to combine
+information from both tools and different conversion modes:
 
  - pdftotext in standard mode puts some hyphenated words together
  - pdftotext in 'raw' mode is better in keeping characters together
  - Apache Tika can add XML markup to mark paragraphs and lists
 
 
-pdf2xml uses several heuristics to use information from both tools. It collects the vocabulary from different conversions and tries to merge tokens if they would form a valid word (which is part of the vocabulary). The longest possibe string is prefered. Hyphens at the end of a line may mark a hyphenated word. pdf2xml tries to put the token before the hyphen together with the first token of the next line and accepts the new string if it is part of the vocabulary.
+pdf2xml uses several heuristics to use information from both tools. It
+collects the vocabulary from different conversions and tries to merge
+tokens if they would form a valid word (which is part of the
+vocabulary). The longest possibe string is prefered. Hyphens at the
+end of a line may mark a hyphenated word. pdf2xml tries to put the
+token before the hyphen together with the first token of the next line
+and accepts the new string if it is part of the vocabulary.
 
 
-Look at the example PDF-file in the test directory (t/). If you look at the difference between raw conversion (without cleanup heuristics), you will find a lot of changes. For example:
+Look at the example PDF-file in the test directory (t/). If you look
+at the difference between raw conversion (without cleanup heuristics),
+you will find a lot of changes. For example:
 
 
   raw:    <p>PRESENTATION ET R A P P E L DES PRINCIPAUX RESULTATS 9</p>
   clean:  <p>PRESENTATION ET RAPPEL DES PRINCIPAUX RESULTATS 9</p>
 
-  raw:    <p>2. Les c r i t è r e s de choix : la c o n s o m m a t i o n de c o m b u s - t ib les et l e u r moda l i t é d ' u t i l i s a t i on d 'une p a r t , la concen t r a t ion d ' a u t r e p a r t 16</p>
-
-  clean:  <p>2. Les critères de choix : la consommation de combustibles et leur modalité d'utilisation d'une part, la concentration d'autre part 16</p>
+  raw: <p>2. Les c r i t è r e s de choix : la c o n s o m m a t i o n
+          de c o m b u s - t ib les et l e u r moda l i t é 
+          d ' u t i l i s a t i on d 'une p a r t , 
+          la concen t r a t ion d ' a u t r e p a r t 16</p>
 
+  clean: <p>2. Les critères de choix : la consommation 
+            de combustibles et leur modalité 
+            d'utilisation d'une part, 
+            la concentration d'autre part 16</p>
 
 Note: The heuristics may not always produce correct results!