Issue #79 closed
Greg Kochanski
created an issue

There's a publicly accessible database with a HTML RESTful API. Nicely documented. A bit complex, with lots of tables. One could clearly download this!

It has greenhouse gas source info and other stuff.

Comments (37)

  1. Greg Kochanski reporter

    Got it figured out. I'm starting to download the summary table. There are a lot of tables (~50) and it takes a fair bit of manual cutting-and-pasting to set up each one. But, it should be worth it.

  2. Jan Galkowski

    Just a comment that these data are highly at risk, today, although, as always, we do not know what the administration intends, nor whether public access to datasets is what's to be removed, or the datasets themselves.

  3. Greg Kochanski reporter

    GPK is re-collecting the database in XML format. It's proceeding well, but slowly. We're limited by the database's throughput and the fact that the database is sometimes overloaded (not by me!). If the EPA went under right now, we could do a pretty good reconstruction with some careful csv->xml conversion work.

    I have the greenhouse gas section, the ICIS, and PCS water pollution sections, the ICIS-AIR air pollution section, and the TRI toxics section.

    Basically, there are >million row tables in there, and a request for 100 rows take > 10 seconds to complete (and sometimes 30 minutes).

  4. Greg Kochanski reporter

    Found another failure mode. Not sure if it's a bug in python httplib library or what, but under rare conditions urllib2.urlopen(url).read() hangs forever (i.e. > 2 days). So four of the download scripts were stuck in that mode. Code is improved, a timeout added, and it's restarted. We're about at the 90% point, certainly in terms of number of columns; it's less clear in terms of total data bytes.

  5. Greg Kochanski reporter

    Found yet another rare failure mode. They sent me an incomplete chunk of XML and I didn't validate it before writing it to the output stream. So, I'm restarting one column, but otherwise, it's coming along.

  6. Sakari Maaranen

    I am assuming your backup includes also the specification documents -- especially the data model. Even if the REST service was not reproduced, having the documentation can help any such efforts.

  7. Greg Kochanski reporter

    The backup explicitly includes on documentation page for each column, e.g.

    Other documentation pages (and those) ought to have been picked up in the general web crawl. For example, was picked up correctly. I've checked several, and they all seem to exist and be good HTML. So I think we're OK there.

  8. Greg Kochanski reporter

    I now have all the tables in the greenhouse gas model, minus two that seem to be misdocumented / misnamed. I'm still downloading various kinds of other information from the database.


  9. Greg Kochanski reporter

    Started the download of the last tables (RCRA, hazardous waste handlers).

    Found that some of the tables appeared in two models. This led to occasional data corruption, which has not been a serious problem because there has been code to detect the corruption for a while. (If corruption was detected, I just truncated the file before the corruption and restarted the download. The restart code was smart enough to do the right thing.) Anyway, to prevent that from happening again, I added calls to fcntl.flock() to lock files, so that only one program can access them at once. That led to a certain amount of refactoring.

    It's progressing well. But note that I am ignoring certain tables that simply do not work as documented.

  10. Greg Kochanski reporter
    • Downloading has been going well for the last couple of weeks. But some of the tables are large (4.3M entries) and will take potentially a long time (8 weeks) to complete.

    • I did a re-read of the download code with an eye to security. There is one known DDOS problem where malicious XML generated by the EPA servers could potentially cause our machine to OOM. That's internal to the XML parsing library; XML allows recursive expansion of XML, so it can get very big. But, I see no obvious security risks beyond that.

  11. Greg Kochanski reporter

    What you have on pub04 is the database tables. (Two duplicate copies of most: the *.csv files are an early attempt, incomplete and buggy. The .xml files are the final download.)

    I believe that the associated HTML files are issue 78, pub04:/var/local/tmp/2017-01-18T2001Z_78_www.epa.gov_climatechange.

  12. Log in to comment