Open Disclosure — extract structured data from house disclosure PDF forms

To run tests

$ cd open-disclosure
$ ./sbt
> test


Most dependencies are managed by SBT, so you don't need to really worry about them, but you will need to install Tesseract (on OS X, brew install tesseract does the needful). You also need Poplar to pre-prpcess the PDFs and extract the images.

Extract (some) Data

If all that works, you can try extracting some data. Right now, it only works for electronic schedule 3 pages, so run pdfimages (part of poplar) on an electronically submitted pdf to get the .pbm files (you can discard the .ppm files). Then, run the app:

> run /tmp/N00029139_2012-009.pbm # <- use the path to your extracted schedule 3 pbm file, of course

It will extract a bunch of smaller images (dumped to /tmp), and tell you what it thinks the text in each is. RIght now, it just dumps this to the console, but I am working on putting it in a database.

My vision for this is to have two apps: the data extraction app, written in scala, that loads data into an SQL database and extracts the smaller images, and a web app, probably written in python/django, that presents the data as a rest api, and also a web interface listing each field, with the extracted image, and the OCR'd text, and possibly a mechanism for users to submit corrections to the OCR text.


  • Patrick Kaeding