PDFTableExtract2 is a command-line program from extracting tables from PDF.

PDFTableExtract2 is fully automatic; it can:

  • Detect tables on a PDF page (support several tables per page*)
  • Recognize vertical and horizontal lines
  • Recognize empty spaces as col/row separators (*)
  • Detect merged cells (i.e. rowspan and colspan)
  • Extract the text in each cell
  • Extract the text ouside tables (*)
  • Optionnaly, use an OCR program (such as Tesseract) for non-text PDF (*)
  • Output pseudo-HTML, CSV, JSON or Python lists

PDFTableExtract2 is written in Python3; it is an improved version of PDF-table-extract from Ashima Research. PDFTableExtract2 has then been improved (in particular features marked with an *) by Jean-Baptiste Lamy at the LIMICS reseach lab. It is available under the GNU LGPL licence v3. In case of trouble, please contact Jean-Baptiste Lamy <jean-baptiste.lamy @ univ-paris13 . fr>

University Paris 13, Sorbonne Paris Cité
Bureau 149
74 rue Marcel Cachin