NOAA Carbon Tracker FTP site, aftp.cmdl.noaa.gov

Issue #7 resolved
Jan Galkowski
created an issue

FTP download from:

aftp.cmdl.noaa.gov/products/carbontracker

Destined for /media/jan-one/ in

/media/jan-one/aftp.cmdl.noaa.gov.products.carbontracker]

Comments (22)

  1. Yuval Marcus

    Can you make a torrent for this data? In fact, I think it would make a lot of sense if all of the data in this project became torrents. Then, everyone could help host the data.

  2. Jan Galkowski reporter

    I don't think we are going the torrent route,@Yuval Marcus. @Sakari Maaranen I also think we want to know precisely where our data is hosted. Someone could claim to have a copy of our data, say something about it which was malicious, and the refutation would be something too complicated for most people to understand. We owe our Kickstarter investors better than that.

  3. Sakari Maaranen

    I wouldn't mind using torrent for any high quality files that have good metadata. However, due to our download method, many of our backup copies are kind of shotgun efforts that contain lots and lots of redundant files in addition to the actual data. We can return to this question later on if and when we have the kind of files suitable for torrent.

  4. Jan Galkowski reporter

    @Sakari Maaranen @Yuval Marcus Different people on Azimuth Backup have different perceptions of what our goals should be. These have not been clearly articulated, particularly in the light of realistic constraints. As I've written elsewhere, those constraints include the inability to consult with the originators of these datasets, due to political and jobs pressures put on them, our inability to access files on the servers using anything other than publicly available means, and, finally, no published documentation on organization of the sites.

    These projects began with the intent of saving climate data, so I have always pursued in my mind the objective as being that of saving datasets, not structure of Web sites. For sites which are only accessible by http or https, that means using that protocol to get at such datasets. For sites with ftp, which are increasingly rare on government sites, things are more direct. Frankly, I don't see how we can preserve structure of sites given that these, like most in the world today, increasingly rely upon Javascript to operate. (As an aside, I see that as a weakness of Javascript and CMS-like delivery options in general.)

    It varies from government site to government site, but many of the NOAA and NASA sites cross-link to one another. So, again, as I have written elsewhere, if the goal is to grab a semantically intact copy of a site, as opposed to one which consists of the stuff limited by an artificial organization to living on certain hostnames, a gather has to pull from multiple hostnames, and the success of the project depends upon knowing how the site is organized. This either demands cooperation from the developers, or deep and diligent study of the site, which takes many hours for each, without a guarantee that something might be missed, e.g., because of Javascript.

    Finally, there are ancillary benefits to what @Sakari Maaranen characterizes as shotgun efforts which I, rather, think are optimal. First, we are in a hurry, or *were in a hurry* to capture these data. The less thinking needs to be done to enable each capture, the better. Sure, had we more time, we could have automated that more. But we did not. And we did not, at least at the beginning, have a lot of staff. Second, not all sites in the universe of geophysical sites are being preserved, whether because of their size, or of people's interest, or because of lack of organization, and not all sites will be faithfully recorded. The lack of organization was somewhat deliberate: We wanted independent gathers, in case one or another of them was imperfect, and also because there was -- and continues to be -- a concern that coordination allows a concerted attack against the overall project compromising its methods, means, and possibly its data. Third and finally, our interest is exclusively climate and environmental data. However, the Datarefuge Project, of which we are a publicly announced ally, is interested in preserving as much data from before this administration as possible. The evidence is that is wise, as some datasets pertaining to animal care and to poverty in cities have been taken offline. So, if our spiders happen to wander from, say, the EPA site, to the Housing and Urban Development site, as I know one did, and grab datasets there, I say so much the better. At least we are archiving things which might be at risk, even if it will take the archivists time to figure this out.

    We are rescuing things from a burning building, not executing a corporate project in replication. The administration is necessarily viewed as hostile, despite their pronouncements in selected instances, and I feel we need to continue assuming they have no respect for evidence or the work it takes to gather these measurements and make sense of them. Moreover, compared to the staff at agencies and people who are paid through federal grants, we are relatively free to pursue such gathers anywhere we need to go, up to the point the datasets are taken offline.

    A final comment: Whatever success might be had to recover Web site structure, that capture won't be robust against changes in that structure going forward, and, in particular, it will be relatively useless distinguishing those changes which are simple updates as opposed to ones which reflect ``new policy'', which should at least be kept separate. I see no commitment in our project or funding to help the present administration preserve anything they do.

    I've written this enough times that I wonder if I oughtn't put this personal statement on the Wiki or something.

  5. Yuval Marcus

    Thanks for the update. Yes, I believe this information should be added to the wiki. On a side note, (because I don't know where else to ask this), how can I access the mirrored data, if I want to keep a backup of some of it on my own computer? I have an SSH key but where do I put it, if I don't have access to the authorized_keys file?

  6. Log in to comment