EPA WARM model, as much as we can get

Issue #88 closed
Jan Galkowski
created an issue


Doing this on azi03 (because there's room) and pub05, the former using httrack and the latter using wget.

Tried azi01, but because I don't have sudo I could install httrack which should, as a default, have been installed. Will move later. This is an emergency.

  1. Sakari Maaranen

    You have space on pub01.rz13:/var/local/jan/. You don't need sudo, Jan. The server has httrack.

    I see this is your current process:

    [sam@azi03 ~]$ sudo ps -U jan -F --cols 4096
    jan      23872 23858 33 45818 31676   1 06:03 pts/7    01:48:41 httrack https://www.epa.gov/warm -O . --mirror --depth=8 --ext-depth=3 --max-rate=100000000 %c500 --sockets=30 --retries=30 --host-control=0 TN 60 --near --robots=0 %s
  2. Jan Galkowski reporter


    12.6G azi03:/home/jan/local_data/epa.gov.warm-model (epa.gov/warm-model via httrack)
    1.7G pub05:/var/local/jan/epa.gov.warm-model-wget (epa.gov.warm-model via wget)
    httrack 23872  jan  cwd    DIR  253,7    69632 52035585 ./epa.gov.warm-model
    [jan@azi03 local_data]$ ps -ef | grep 23872
    jan      23872 23858 25 06:03 pts/7    02:58:07 httrack https://www.epa.gov/warm -O . --mirror --depth=8 \
                                                              --ext-depth=3 --max-rate=100000000 %c500 --sockets=30 --retries=30 \
                                                              --host-control=0 TN 60 --near --robots=0 %s
    jan      24948 24924  0 06:09 pts/8    00:00:20 wget --dns-timeout=10 --connect-timeout=20 --read-timeout=120 --wait=2 \
                                                         --random-wait -e robots=off --prefer-family=IPv4 --tries=40 --timestamping=on \
                                                         --mirror --recursive --level=8 --no-remove-listing --follow-ftp -nv \
                                                         --output-file=epa.gov.warm-model.log --no-check-certificate https://www.epa.gov/warm
    httrack report for https://www.epa.gov/warm is hard to summarize, but appears to be somewhat sidelined on allied but not EPA Web sites. Cannot be helped on an emergency basis. A number of these are PR items advocating for why EPA is helping.
    wget having trouble:
    > 2017-01-25 17:47:13 URL:https://www.epa.gov/newsreleases/search/field_press_office/region-06?filter=&page=17 [45722/45722] -> "www.epa.gov/newsreleases/search/field_press_office/region-06?filter=&page=17" [1]
    > www.epa.gov/newsreleases/search/field_press_office/region-06/field_press_office: Not a directorywww.epa.gov/newsreleases/search/field_press_office/region-06/field_press_office/region-07: Not a directory
    > Cannot write to ‘www.epa.gov/newsreleases/search/field_press_office/region-06/field_press_office/region-07’ (Success).
    > www.epa.gov/newsreleases/search/field_press_office/region-06/field_press_office: Not a directorywww.epa.gov/newsreleases/search/field_press_office/region-06/field_press_office/region-08: Not a directory
    > Cannot write to ‘www.epa.gov/newsreleases/search/field_press_office/region-06/field_press_office/region-08’ (Success).
    > www.epa.gov/newsreleases/search/field_press_office/region-06/field_press_office: Not a directorywww.epa.gov/newsreleases/search/field_press_office/region-06/field_press_office/region-09: Not a directory
    > Cannot write to ‘www.epa.gov/newsreleases/search/field_press_office/region-06/field_press_office/region-09’ (Success).
  3. Jan Galkowski reporter
    I have two wgets running www.epa.gov/energy/ and www.epa.gov/warm/
    wget --mirror --warc-file= --warc-cdx --page-requisites --html-extension --convert-links --execute robots=off --directory-prefix=. --domains www.epa.gov --user-agent=Mozilla --wait=5 --random-wait
    I was hoping the timeout would be enough to slow things down. I can abort if you think its a good idea.
    Probably the more important thing was I copied the pdf and excel files for the warm model. Not sure if the website mirror tools download the supporting files.
