2. Untitled project
  3. LegEx Data


Clone wiki

LegEx Data / Home

Legislative Explorer: Data Driven Discovery

This is the repository for code used to create the interactive visualization, LegEx. All of the data driving this web app is in the public domain, and this repository explains how the raw text from Thomas was converted into usable data. These scripts are not "plug-and-play" outside of our workflow, but should be modular enough to be integrated into other other workflows with some minor tweaks.

Thomas is scheduled to be retired at the end of 2014, and the Library of Congress has not yet provided a public API for their new site http://beta.congress.gov. Visit the Congressional Data Coalition for more information on the push for open Congressional data.

Data and Codebook

  • Data (.csv): files available for bills, members, and actions that are updated nightly based on http://thomas.loc.gov data.
  • Codebook: An in-depth description of the variables produced by these scripts and available in the data files.


Each individual parsing function loads a local .json file with information about a single Congressional bill, as packaged by the unitedstates/congress bills scraper. These .json files contain dozen of pieces of information about each bill, and the parsers described below each return a python dictionary with a subset of the information collected from the .json file. We then loaded this data into a MySQL database, but the dictionaries can also easily be written to a flat file such as .csv.

Extracting Bill Metadata

binfoParser.py takes as input a bill number, congress, billtype (i.e. 'hr'), id number (user defined), and the path to the directory structure containing the .json files. The output includes various titles for the bill, committee referral information, sponsor information, as well as indicators of the sponsor's status on the referral committee(s).

Extracting Bill Progress

actionParser.py takes the same inputs as binfoParser.py and returns a list of dictionaries, with each dictionary containing information for a specific change in status for the bill. We call this an action. Note that the initial .json file contains many more actions than we capture (in particular, we don't capture subcommittee actions or procedural motions once a bill is on the floor). This script could easily be modified to capture additional actions.

Member Data

In order to build a table of members of Congress we used memberTable.py. This file relies upon relies on memberParser.py, commMemb.py, and congress-legislators. As a disclaimer, congress-legislators does not provide member information by session of Congress. This was inferred based on the starting and ending dates provided for each member.

Putting It All Together

In order to automate the processing collecting bill metadata and progress to the level of a single Congress, we created congressParser.py. After loading information on members, this file will process binfoParser.py followed by actionParser.py and uploads all the information to our MySQL database. It can be run from the command line with a single argument that is the Congress of interest, i.e.:

python congressParser.py 113

Keeping It Up To Date

One of the main goals of this project was to have it always be current. On a nightly basis, congressUpdater.py runs on our server as one step in a shell script that downloads any changes to bills in the current Congress from the web, processes each bill, looking for additional actions or changes to the metadata, and then updates our database accordingly.

What Actually Drives LegEx?

LegEx is hosted on an Amazon EC2 instance within an Auto Scaling Group which allows us to keep hosting costs to a minimum while still being able to quickly scale up if necessary to meet demand. In order to lessen demand on the server and provide the best experience possible for users, the data in the MySQL database is written back to .json files on a nightly basis which are downloaded to the user's computer on a congress-by-congress basis.


Additional scripts located in the /depreciated/ folder were used to correct errors in the initial scripts, but have since been integrated into the main workflow and are kept for reference only. Variables drawn from other datasets are stored in the database and applied to newly introduced legislation nightly by maintUpdates.sql.