Issue #44 on hold
Marc Rosen
created an issue

i'm looking into the data sets to bring down, at the moment.

Official response

Comments (17)

  1. Jan Galkowski

    I am marking Issue #61 as a duplicate of this. However, I am quoting below what I wrote there:

    "I do not know how important this site is to scientific progress, but it is the federal site connecting the public with the U.S. Global Change and Research Program. To the degree that might be perceived at risk, it's worthwhile saving this material. As mentioned for Issue #60, though, it's possible a spider could wander off and duplicate data we already have, so the options would need to be chosen with care. As in issue #60, I don't think I (Jan) am well qualified to do this, so I'd ask others. Still, I'm not sure we do have this data, e.g.,"

    I also draw your attention to the practices and discipline we're imposed on ourselves at:

    These are key to our success. Note I have change the status of this ticket. Please use the ticketing system properly.

    On wget in particular note the --span-hosts and --domains= settings.

    Also, if the assessment is "the site doesn't suggest that it contains any interesting data", what about

  2. Marc Rosen reporter

    To reiterate what I said in my email,, itself, does not host datasets. Rather, it links to datasets hosted on other websites. Because space is an issue, I said that it would be sensible to therefore not download the datasets from, since they should have already been downloaded. That is what I was referring to in what you quoted.

    Additionally, as I explained via email, what really has to offer is on This website provides graph data linking datasets with authors and models and other attributes, and provides a graph query interface for them. For this reason, as I explained via email, simply using wget to mirror the website will still end up losing most of the data that this website has to offer. To that end, I have sent emails to asking if it would be possible to get a database dump of their website, so that we could make a fully-functioning clone, without losing any data. I have not yet received a response from them, however.

  3. Jan Galkowski

    Thanks for this clarification, Marc. Sorry I was dense, but this shows what kinds of analysis are increasingly needed for our work, something which I just either miss, or am too busy doing other things to do comprehensively.

    • Jan
  4. Greg Kochanski

    I have a pretty good image: 11 Gb; 201377 files. I terminated it, but not until it was down in the weeds of very repetitive accesses of the same files through slightly different paths.

  5. Greg Kochanski

    I can chug on it more, but it was getting URLs analogous to this one: "[0]=field_state:West Virginia&f[1]=field_state:Missouri&f[2]=field_state:Iowa"

    Since it's a site that is dynamically generated from a database, one never knows whether the set of URLs is infinite. You can always add f[3] and f[4], and f[5], etc.

    When I come back from work, I'll take a closer look at it to see if there's a reasonable hope that it's finite. (And the * site is suffering from the same problem; for the last day or more, it's been -- apparently -- finding many ways to reveal the same set of reports.)

  6. Log in to comment