Testing group for detecting and incorporating dataset updates

Issue #80 new
Jan Galkowski
created an issue

In Issue #48, the data and datasets from the U.S. Energy Information Administration were reported as replicated and saved. An open question for The Azimuth Backup Project and for ClimateMirror (or Datarefuge) in general is how to respond to the case where government sites continue to operate, uninterrupted or at diminished capacity, but they do not "go dark."

Even if such sites were, at some point, to be shut down, in part or en masse, we do not know when that might be. Aiming to be complete with data copy by the Inauguration Friday is (a) optimistic, and (b) may not be a realistic read of the incoming administration's priorities. Accordingly, for the datasets we have captured, speaking for the overall, there is a gap between what has been saved and whatever else is added.

Updating our sets to reflect additions is a challenging technical problem. We don't receive any notice of updates in the disciplined matter that, for instance git provides. There is no mechanism for pushing change notices using RSS or Atom feeds. Accordingly, this mechanism will need to be invented, and will need to be able to distinguish between changes, additions, and faults in our initial data collection which are remedied by a second try.

To do this will require a test platform. I am proposing that the U.S. EIA data we have be such as platform, and we initially limit our efforts to testing against one element of it, namely, the state-level carbon dioxide emissions dataset, something which is updated about every 9 months. An update, which I believe we did not capture, is available at:

http://www.eia.gov/environment/emissions/state/analysis/

I do not know how sophisticated we want to get to do this, or what kind of funding we might be able to raise to take it one. Surely there are powerful idea for monitoring a set of documents, in the cloud, say, checking for duplicates or updates. Our datasets are not organized, at present, in such a manner.

I do not know if doing a differential walk of a site against our store makes sense, or how we would deal with a wholesale reorganization of data at a site.

These are interesting things to think about, and this task exists to encapsulate this effort.

Comments (4)

  1. Sakari Maaranen

    Yeah, just for information. wget also has the --mirror option that is a combination of above and other flags. The time-stamping mechanism does support incremental updates.

  2. Greg Kochanski

    The timestamping only helps when the server sets the appropriate server, which is not always.

    You'd think someone would have written code that does this: it has a copy, and randomly samples from URLs in the copy, and takes the diff between copy and the remote server. When the total quantity of diffs gets sufficiently large, it could take another copy of the remove server.

  3. Log in to comment