I don't think we are going the torrent route,@Yuval Marcus. @Sakari Maaranen I also think we want to know precisely where our data is hosted. Someone could claim to have a copy of our data, say something about it which was malicious, and the refutation would be something too complicated for most people to understand. We owe our Kickstarter investors better than that.
I wouldn't mind using torrent for any high quality files that have good metadata. However, due to our download method, many of our backup copies are kind of shotgun efforts that contain lots and lots of redundant files in addition to the actual data. We can return to this question later on if and when we have the kind of files suitable for torrent.
@Sakari Maaranen@Yuval Marcus Different people on Azimuth Backup have different perceptions of what our goals should be. These have not been clearly articulated, particularly in the light of realistic constraints. As I've written elsewhere, those constraints include the inability to consult with the originators of these datasets, due to political and jobs pressures put on them, our inability to access files on the servers using anything other than publicly available means, and, finally, no published documentation on organization of the sites.
Finally, there are ancillary benefits to what @Sakari Maaranen characterizes as shotgun efforts which I, rather, think are optimal. First, we are in a hurry, or *were in a hurry* to capture these data. The less thinking needs to be done to enable each capture, the better. Sure, had we more time, we could have automated that more. But we did not. And we did not, at least at the beginning, have a lot of staff. Second, not all sites in the universe of geophysical sites are being preserved, whether because of their size, or of people's interest, or because of lack of organization, and not all sites will be faithfully recorded. The lack of organization was somewhat deliberate: We wanted independent gathers, in case one or another of them was imperfect, and also because there was -- and continues to be -- a concern that coordination allows a concerted attack against the overall project compromising its methods, means, and possibly its data. Third and finally, our interest is exclusively climate and environmental data. However, the Datarefuge Project, of which we are a publicly announced ally, is interested in preserving as much data from before this administration as possible. The evidence is that is wise, as some datasets pertaining to animal care and to poverty in cities have been taken offline. So, if our spiders happen to wander from, say, the EPA site, to the Housing and Urban Development site, as I know one did, and grab datasets there, I say so much the better. At least we are archiving things which might be at risk, even if it will take the archivists time to figure this out.
We are rescuing things from a burning building, not executing a corporate project in replication. The administration is necessarily viewed as hostile, despite their pronouncements in selected instances, and I feel we need to continue assuming they have no respect for evidence or the work it takes to gather these measurements and make sense of them. Moreover, compared to the staff at agencies and people who are paid through federal grants, we are relatively free to pursue such gathers anywhere we need to go, up to the point the datasets are taken offline.
A final comment: Whatever success might be had to recover Web site structure, that capture won't be robust against changes in that structure going forward, and, in particular, it will be relatively useless distinguishing those changes which are simple updates as opposed to ones which reflect ``new policy'', which should at least be kept separate. I see no commitment in our project or funding to help the present administration preserve anything they do.
I've written this enough times that I wonder if I oughtn't put this personal statement on the Wiki or something.
Thanks for the update. Yes, I believe this information should be added to the wiki. On a side note, (because I don't know where else to ask this), how can I access the mirrored data, if I want to keep a backup of some of it on my own computer? I have an SSH key but where do I put it, if I don't have access to the authorized_keys file?
@Yuval Marcus Okay I will, but it probably won't be a high priority. And, to avoid the appearance that it is a widely-held view at Azimuth Project, or even just more than mine, I may do it as an update to my blog post on the subject.
@Yuval Marcus I have added you to the project team. Please see the azimuth-inventory README.md file and follow the links to read the documentation. Put your key in the private_cmdb that is documented. I hope you can use Git.