Issue #78 closed
Greg Kochanski
created an issue

Exploring to see what might be interesting. Downloading some, just as a historical reference.

Comments (46)

  1. Sakari Maaranen

    I see @Jan Galkowski has a process dowloading this on azi03. Please document your work as instructed, so people who read the issue tracker can coordinate. We cannot work together effectively, if you do not communicate what you are doing.

    [sam@azi03 ~]$ sudo ps -U jan -F --cols 4096
    jan      23857 23841 12 42995 22004   2 06:03 pts/3    00:39:54 httrack https://www.epa.gov/climatechange -O . --mirror --depth=8 --ext-depth=3 --max-rate=100000000 %c500 --sockets=30 --retries=30 --host-control=0 TN 60 --near --robots=0 %s
    
  2. Jan Galkowski

    Status:

    1.6G  azi03:/home/jan/local_data/epa.gov.climatechange (epa.gov/climatechange via httrack)
    2.2G pub05:/var/local/jan/epa.gov.climatechange-wget (epa.gov.climatechange via wget)
    
    AZI03:
    httrack 23857  jan  cwd    DIR  253,7    32768 356909057 ./epa.gov.climatechange
    [jan@azi03 local_data]$ ps -ef | grep 23857
    jan      23857 23841  6 06:03 pts/3    00:46:56 httrack https://www.epa.gov/climatechange -O . --mirror --depth=8 \
                                                            --ext-depth=3 --max-rate=100000000 %c500 --sockets=30 --retries=30 \
                                                            --host-control=0 TN 60 --near --robots=0 %s
    PUB05:
    jan      24923 13546  0 06:08 pts/2    00:00:20 wget --dns-timeout=10 --connect-timeout=20 --read-timeout=120 --wait=2 \
                                                         --random-wait -e robots=off --prefer-family=IPv4 --tries=40 --timestamping=on \
                                                         --mirror --recursive --level=8 --no-remove-listing --follow-ftp -nv \
                                                         --output-file=epa.gov.climatechange.log --no-check-certificate https://www.epa.gov/climatechange
    
    httrack report for https://www.epa.gov/climatechange says 1.56 Gb saved, 557 out of 8246 (estimated) scanned, running for 11h50m, using 1 active
    connection. (Can use more but either is opting not to or something.) 6656 files saved, transfer rate is 26 Kb/sec. Currently parsing some HTML file.
    --
    wget is having issues, e.g., for https://www.epa.gov/climatechange:
    >
    > Cannot write to ‘www.epa.gov/newsreleases/search/subject/grants/year/2016’ (Success).
    > www.epa.gov/newsreleases/search/subject/hazardous-waste/year: Not a directorywww.epa.gov/newsreleases/search/subject/hazardous-waste/year/2016: Not a directory
    > Cannot write to ‘www.epa.gov/newsreleases/search/subject/hazardous-waste/year/2016’ (Success).
    > www.epa.gov/newsreleases/search/subject/international/year: Not a directorywww.epa.gov/newsreleases/search/subject/international/year/2016: Not a directory
    > Cannot write to ‘www.epa.gov/newsreleases/search/subject/international/year/2016’ (Success).
    > www.epa.gov/newsreleases/search/subject/other-news-topics/year: Not a directorywww.epa.gov/newsreleases/search/subject/other-news-topics/year/2016: Not a directory
    > Cannot write to ‘www.epa.gov/newsreleases/search/subject/other-news-topics/year/2016’ (Success).
    >
    
  3. Jan Galkowski

    As a precursor to killling I am stopping the additional pulls I started against https://www.epa.gov/climatechange. I will continue to work Issue #88 though. I must say, however, given Climate Mirror's Issue #123, I imagine it would be loaded.

  4. Jan Galkowski

    Continuing my jobs until I hear more from @Greg Kochanski ... Using:

    httrack "https://www.epa.gov/climatechange" -O . -i --mirror --depth=8 --ext-depth=3 --max-rate=100000000 %c500 --sockets=30 \
            --retries=30 --host-control=0 TN 60 --near --robots=0 %s
    

    because I had to kill the job and restart it. Note the -i.

  5. Jan Galkowski
    Re: [climate-mirror/datasets] Any EPA pages we can save? FAST? (#123)
    Diane Trout 12:55 (2 hours ago)Pin to climate-mirror/datasets, cc Jan Galkowski, AuthorShow detailsInboxMark as UnreadReply More
    From:
    Diane Trout <notifications@github.com>
    To:
    climate-mirror/datasets <datasets@noreply.github.com>
    Cc:
    Subject:
    Re: [climate-mirror/datasets] Any EPA pages we can save? FAST? (#123)
    Date:
    Wednesday, January 25, 2017 12:55
    List-Id:
    climate-mirror/datasets <datasets.climate-mirror.github.com>
    
    X-Spam-Score:
    0.0
    Size:
    12 KB
    I have two wgets running www.epa.gov/energy/ and www.epa.gov/warm/
    
    wget --mirror --warc-file= --warc-cdx --page-requisites --html-extension --convert-links --execute robots=off --directory-prefix=. --domains www.epa.gov --user-agent=Mozilla --wait=5 --random-wait
    
    I was hoping the timeout would be enough to slow things down. I can abort if you think its a good idea.
    
    Probably the more important thing was I copied the pdf and excel files for the warm model. Not sure if the website mirror tools download the supporting files.
    
    —
    
  6. Greg Kochanski reporter

    I got the website. Downloading ended on Jan 19. I'll start uploading it.

    (I'm still working on the database; I have most of that. I'm going over it again to get it all uniformly in .xml format.)

  7. Greg Kochanski reporter

    It's going to pub04.rz21.azimuthproject-kickstarter.org:/var/local/gpk/i78_www.epa.gov_climatechange.

    (Actually, I'm not sure it's the whole website, as it was done with a finite depth and centered on the climate change pages, but it's 45494 files and 16 GB.)

  8. Greg Kochanski reporter

    My copy is uploaded to pub04.rz21.azimuthproject-kickstarter.org:/var/local/gpk/i78_www.epa.gov_climatechange.

    We'll have to think a bit about what to do with two copies. Maybe this is the first act of our monitoring process.

  9. Sakari Maaranen

    When you work with multiple copies of the (supposedly) same data set, use a naming convention to identify each data set.

    Usually, ISO 8601 timestamps are a good way of doing this for archives.

    Let's use a naming convention like follows, when we want to keep multiple copies:

    /var/local/user/Data_Set_Name/TIMESTAMP/
    

    The time-stamp is that of work completion, after which it shouldn't have changed for reasons related to the source.

    Time-stamp format: YYYY-MM-DDThhmmZ

    • T is a literal separator (ISO 8601).
    • Z indicates the UTC+0 time zone.

    The precision may vary; for example you may drop the time of day (ThhmmZ), if date alone is already sufficient.

    You might as well use the same convention for both your work in progress and for publishing, if we want to publish multiple copies.

    With our current process, I would rather maintain the latest copy only, with zero or a limited number of older copies. We can keep more of the history later on, when we have better systems in place. This is only a suggestion, and if you insist on saving multiple copies, just consider your resource usage. Keep it economical.

    Later on, with more sophisticated systems, we can do things like incremental deltas and/or eliminating duplicate data. This is not economical with our current plain-old-file-systems approach.

  10. Greg Kochanski reporter

    In this case, because it's small and because one copy is from before Jan 20, we want both.

    Timestamping is a good idea. I'll do that on mine, when I'm back online.

  11. Sakari Maaranen

    Let's use the Data Set Name = EPA for publishing this. I am creating /var/local/pub/EPA/ and moving borislav's copies there. Please move also other work targeting epa.gov there when finished.

  12. Greg Kochanski reporter

    Set read-only and moved to pub04.rz21.azimuthproject-kickstarter.org:/var/local/pub/EPA/2017-01-18T2001Z_78_www.epa.gov_climatechange

    Note that the EPA database downloads are a separate issue (#79).

  13. Jan Galkowski

    Given the urgency, intensity, and importance of these data, I have just completed an independent capture of the epa.gov/climatechange site, using wget and httrack. I will be storing these separately. It is important, I believe, to go at this with independent mindsets in order to capture as much as possible. There is, in fact, some optimization theory behind doing this in what otherwise seems this happenstance manner, and I would suggest it is important to safeguard Greg's and my versions. I am open to naming conventions, but, if these are especially important, I'd suggest those recommending to provide them, and I'll happily oblige.

    The other aspect of this is the WARM model, per Issue #88.

    As far as sizes go, this is what I ended up with:

      12827384165     ./epa.gov.climatechange
     114610817919    ./epa.gov.climatechange-wget
     111102666081    ./epa.gov.warm-model-wget
      12664964909     ./epa.gov.warm-model
    
  14. Sakari Maaranen

    Naming conventions have been provided. Link above. Please follow the instructions. You only need to issue a mkdir command - that will take seconds of your time - to create a timestamped directory name with issue number suffixed.

  15. Jan Galkowski

    Yeah, but how do I distinguish between Greg's stuff and my stuff? And how do I distinguish between my stuff collected by HTTRACK and my stuff collected by WGET? These are not at all equivalent.

    Again, naming conventions.

  16. Sakari Maaranen

    @Jan Galkowski that's okay. The sole purpose of the timestamped/issue-numbered directory name is to distinguish between different copies of the same. Whatever works for that purpose is fine. Time and author are among the best possible pieces of information you could use there, so you are fine.

  17. Jan Galkowski

    Okay, I have grabbed my httrack and wget copies and put them both into:

    /var/local/jan/epa.gov-climatechange-jtgalkowski/2017-02-10T0602Z

    soon to be

    /var/local/pub/epa.gov-climatechange-jtgalkowski/2017-02-10T0602Z

    on pub04.

    SHA sums attached.

    Size is 127438210276 or 127 Gb.

  18. Sakari Maaranen

    @Jan Galkowski @Greg Kochanski The script is in the path on all pub servers, in /usr/local/bin/. You can run it directly simply by typing it on the command line. You can also leave that to me, if you like, or do it yourself, but only root can now really set them read only. When you do it, you will still maintain your own write permissions, effectively making it read-only for others except yourself and root.

  19. Sakari Maaranen

    I have now set the owner and group to ftp. Renamed to better follow our naming convention:

    [sam@pub04 EPA]$ pwd
    /var/local/pub/EPA
    [sam@pub04 EPA]$ ls -1 | grep galkowski
    2017-02-10T0602Z_78_epa.gov-climatechange-jtgalkowski
    2017-02-10T0602Z_88_epa.gov-warm-jtgalkowski
    

    Calculated total usage and updated README.txt:

    $ du -sbcBG /var/local/pub/EPA/*
    1G      /var/local/pub/EPA/2016-12-26T0312Z_41_www.epa.gov_superfund
    16G     /var/local/pub/EPA/2017-01-18T2001Z_78_www.epa.gov_climatechange
    119G    /var/local/pub/EPA/2017-02-10T0602Z_78_epa.gov-climatechange-jtgalkowski
    116G    /var/local/pub/EPA/2017-02-10T0602Z_88_epa.gov-warm-jtgalkowski
    250G    total
    

    Note that there is huge difference in usage of space between what @Jan Galkowski and @Greg Kochanski have done, because they have used different approaches.

  20. Log in to comment