Open Disclosure — extract structured data from house disclosure PDF forms
To run tests
$ cd open-disclosure $ ./sbt > test
Most dependencies are managed by SBT, so you don't need to really worry about them,
but you will need to install Tesseract (on OS X,
brew install tesseract does the
needful). You also need Poplar to pre-prpcess the PDFs and extract the images.
Extract (some) Data
If all that works, you can try extracting some data. Right now, it only works for
electronic schedule 3 pages, so run
pdfimages (part of poplar) on an electronically
submitted pdf to get the .pbm files (you can discard the .ppm files). Then, run the app:
./sbt > run /tmp/N00029139_2012-009.pbm # <- use the path to your extracted schedule 3 pbm file, of course
It will extract a bunch of smaller images (dumped to /tmp), and tell you what it thinks the text in each is. RIght now, it just dumps this to the console, but I am working on putting it in a database.
My vision for this is to have two apps: the data extraction app, written in scala, that loads data into an SQL database and extracts the smaller images, and a web app, probably written in python/django, that presents the data as a rest api, and also a web interface listing each field, with the extracted image, and the OCR'd text, and possibly a mechanism for users to submit corrections to the OCR text.
- Patrick Kaeding