I've run various local downloads, starting at http://www.epa.gov/climatechange, and also starting at URLs identified by Google searches like <<".zip" site:.epa.gov climate>>. I was trying to find data-heavy sections of the EPA main website.
Since Greg is already on this, I am not starting additional processes, unless @Greg Kochanski who has already spent some time analyzing suggests a wget syntax suitable and not overlapping his efforts. Let's not accidentally overload the target.
I see @Jan Galkowski has a process dowloading this on azi03. Please document your work as instructed, so people who read the issue tracker can coordinate. We cannot work together effectively, if you do not communicate what you are doing.
1.6G azi03:/home/jan/local_data/epa.gov.climatechange (epa.gov/climatechange via httrack)2.2G pub05:/var/local/jan/epa.gov.climatechange-wget (epa.gov.climatechange via wget)
httrack 23857 jan cwd DIR 253,7 32768356909057 ./epa.gov.climatechange
[jan@azi03 local_data]$ ps -ef | grep 23857
jan 2385723841606:03 pts/3 00:46:56 httrack https://www.epa.gov/climatechange -O . --mirror --depth=8\
--ext-depth=3 --max-rate=100000000 %c500 --sockets=30 --retries=30\
--host-control=0 TN 60 --near --robots=0 %s
jan 2492313546006:08 pts/2 00:00:20 wget --dns-timeout=10 --connect-timeout=20 --read-timeout=120 --wait=2\
--random-wait -e robots=off --prefer-family=IPv4 --tries=40 --timestamping=on \
--mirror --recursive --level=8 --no-remove-listing --follow-ftp -nv \
--output-file=epa.gov.climatechange.log --no-check-certificate https://www.epa.gov/climatechange
httrack report for https://www.epa.gov/climatechange says 1.56 Gb saved, 557 out of 8246(estimated) scanned, running for 11h50m, using 1 active
connection. (Can use more but either is opting not to or something.)6656 files saved, transfer rate is 26 Kb/sec. Currently parsing some HTML file.
wget is having issues, e.g., for https://www.epa.gov/climatechange:
> Cannot write to ‘www.epa.gov/newsreleases/search/subject/grants/year/2016’ (Success).
> www.epa.gov/newsreleases/search/subject/hazardous-waste/year: Not a directorywww.epa.gov/newsreleases/search/subject/hazardous-waste/year/2016: Not a directory
> Cannot write to ‘www.epa.gov/newsreleases/search/subject/hazardous-waste/year/2016’ (Success).
> www.epa.gov/newsreleases/search/subject/international/year: Not a directorywww.epa.gov/newsreleases/search/subject/international/year/2016: Not a directory
> Cannot write to ‘www.epa.gov/newsreleases/search/subject/international/year/2016’ (Success).
> www.epa.gov/newsreleases/search/subject/other-news-topics/year: Not a directorywww.epa.gov/newsreleases/search/subject/other-news-topics/year/2016: Not a directory
> Cannot write to ‘www.epa.gov/newsreleases/search/subject/other-news-topics/year/2016’ (Success).
As a precursor to killling I am stopping the additional pulls I started against https://www.epa.gov/climatechange. I will continue to work Issue #88 though. I must say, however, given Climate Mirror's Issue #123, I imagine it would be loaded.
Re: [climate-mirror/datasets] Any EPA pages we can save? FAST? (#123)
Diane Trout 12:55 (2 hours ago)Pin to climate-mirror/datasets, cc Jan Galkowski, AuthorShow detailsInboxMark as UnreadReply More
Diane Trout <email@example.com>
Re: [climate-mirror/datasets] Any EPA pages we can save? FAST? (#123)
Wednesday, January 25, 201712:55
I have two wgets running www.epa.gov/energy/ and www.epa.gov/warm/
wget --mirror --warc-file= --warc-cdx --page-requisites --html-extension --convert-links --execute robots=off --directory-prefix=. --domains www.epa.gov --user-agent=Mozilla --wait=5 --random-wait
I was hoping the timeout would be enough to slow things down. I can abort if you think its a good idea.
Probably the more important thing was I copied the pdf and excel files for the warm model. Not sure if the website mirror tools download the supporting files.
@Jan Galkowski@Greg Kochanski a couple of backups should be sufficient - or even one, if you're confident it has at least the important files. If each of you already has one copy, we might want to free up further resources for other data sets.
When you work with multiple copies of the (supposedly) same data set, use a naming convention to identify each data set.
Usually, ISO 8601 timestamps are a good way of doing this for archives.
Let's use a naming convention like follows, when we want to keep multiple copies:
The time-stamp is that of work completion, after which it shouldn't have changed for reasons related to the source.
Time-stamp format: YYYY-MM-DDThhmmZ
T is a literal separator (ISO 8601).
Z indicates the UTC+0 time zone.
The precision may vary; for example you may drop the time of day (ThhmmZ), if date alone is already sufficient.
You might as well use the same convention for both your work in progress and for publishing, if we want to publish multiple copies.
With our current process, I would rather maintain the latest copy only, with zero or a limited number of older copies. We can keep more of the history later on, when we have better systems in place. This is only a suggestion, and if you insist on saving multiple copies, just consider your resource usage. Keep it economical.
Later on, with more sophisticated systems, we can do things like incremental deltas and/or eliminating duplicate data. This is not economical with our current plain-old-file-systems approach.
Given the urgency, intensity, and importance of these data, I have just completed an independent capture of the epa.gov/climatechange site, using wget and httrack. I will be storing these separately. It is important, I believe, to go at this with independent mindsets in order to capture as much as possible. There is, in fact, some optimization theory behind doing this in what otherwise seems this happenstance manner, and I would suggest it is important to safeguard Greg's and my versions. I am open to naming conventions, but, if these are especially important, I'd suggest those recommending to provide them, and I'll happily oblige.
The other aspect of this is the WARM model, per Issue #88.
Naming conventions have been provided. Link above. Please follow the instructions. You only need to issue a mkdir command - that will take seconds of your time - to create a timestamped directory name with issue number suffixed.
@Jan Galkowski that's okay. The sole purpose of the timestamped/issue-numbered directory name is to distinguish between different copies of the same. Whatever works for that purpose is fine. Time and author are among the best possible pieces of information you could use there, so you are fine.
Data should be moved to /var/local/pub before marking published, I imagine, but should coordinate with @Greg Kochanski and @Sakari Maaranen. So, is the set_read_only.sh script good to go now? Does it live someplace on pub04 so I don't need to pull it from the repository and download?
@Jan Galkowski@Greg Kochanski
The script is in the path on all pub servers, in /usr/local/bin/. You can run it directly simply by typing it on the command line. You can also leave that to me, if you like, or do it yourself, but only root can now really set them read only. When you do it, you will still maintain your own write permissions, effectively making it read-only for others except yourself and root.