i'm looking into the data sets to bring down, at the moment.

  1. Greg Kochanski

    I'm grabbing it with wget. However, a manual walk through the site doesn't suggest that it contains any interesting data.

  2. Marc Rosen reporter

    To reiterate what I said in my email,, itself, does not host datasets. Rather, it links to datasets hosted on other websites. Because space is an issue, I said that it would be sensible to therefore not download the datasets from, since they should have already been downloaded. That is what I was referring to in what you quoted.

    Additionally, as I explained via email, what really has to offer is on This website provides graph data linking datasets with authors and models and other attributes, and provides a graph query interface for them. For this reason, as I explained via email, simply using wget to mirror the website will still end up losing most of the data that this website has to offer. To that end, I have sent emails to asking if it would be possible to get a database dump of their website, so that we could make a fully-functioning clone, without losing any data. I have not yet received a response from them, however.

  3. Greg Kochanski

    I have a pretty good image: 11 Gb; 201377 files. I terminated it, but not until it was down in the weeds of very repetitive accesses of the same files through slightly different paths.

  4. Greg Kochanski

    I can chug on it more, but it was getting URLs analogous to this one: "[0]=field_state:West Virginia&f[1]=field_state:Missouri&f[2]=field_state:Iowa"

    Since it's a site that is dynamically generated from a database, one never knows whether the set of URLs is infinite. You can always add f[3] and f[4], and f[5], etc.

    When I come back from work, I'll take a closer look at it to see if there's a reasonable hope that it's finite. (And the * site is suffering from the same problem; for the last day or more, it's been -- apparently -- finding many ways to reveal the same set of reports.)

