NOAA's National Centers for Environmental Information (NCEI), Web site

Issue #5 resolved
Jan Galkowski
created an issue

Web site:

    https://www.ncdc.noaa.gov

Downloading to:

azi03:/home/jan/local_data/eclipse.ncdc.noaa.gov-web

using this:

wget --dns-timeout=10 --connect-timeout=300 --read-timeout=120 --wait=5 --mirror --random-wait --user-agent="Lynx/2.8.8dev.5 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/2.8.6" --page-requisites --retry-connrefused --prefer-family=IPv4  --tries=40 --timestamping=on --recursive --level=8 --no-remove-listing  --follow-ftp -nv --output-file=ncdc-noaa-gov-web.log --no-check-certificate https://www.ncdc.noaa.gov

Comments (24)

  1. Jan Galkowski reporter

    This task was attempted with:

    wget --wait=10 -nv -4 --output-file=www-ncdc-noaa-gov-errors.log --no-check-certificate -t 40 -nc -r -p https://www.ncdc.noaa.gov
    

    and failed.

  2. Sakari Maaranen

    We can use the milestones 0-Identified and 1-Specified to distinguish between sources that have merely been identified and those that have been analyzed so that we know what should be backed up.

    It probably takes some manual work to explore the structure of each source and to determine the best way to process those. Some may be trivial, some rather complicated.

    Do we have people to do those analyses properly? It requires from minutes up to a few hours per source.

  3. Jan Galkowski reporter

    @John Baez That "eclipse.ncdc.noaa.gov-web" is the name of a directory on Sakari's FTP server. When the data is transferred, that's where it will live. Right now, there's nothing to put there because of the difficulties specified in the ticket.

    A number of the other transfers are "in progress", but they are being transferred to an intermediate spot and not directly to the FTP server. It's a two step process.

  4. Jan Galkowski reporter

    Trying this:

    wget --dns-timeout=10 --connect-timeout=300 --read-timeout=120 --wait=5 --mirror --random-wait --user-agent="Lynx/2.8.8dev.5 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/2.8.6" --page-requisites --retry-connrefused --prefer-family=IPv4  --tries=40 --timestamping=on --recursive --level=8 --no-remove-listing  --follow-ftp -nv --output-file=ncdc-noaa-gov-web.log --no-check-certificate https://www.ncdc.noaa.gov
    
  5. Jan Galkowski reporter

    Found the problem. The site has a very restrictive robots.txt:

    Disallow: /paleo/pubs/
    Disallow: /prototypes/
    Disallow: /records/*.php?
    Disallow: /snow-and-ice/extent/*/
    Disallow: /snow-and-ice/recent-snow/*/
    Disallow: /snow-and-ice/rsi/societal-impacts/
    Disallow: /snow-and-ice/snow-cover/*/
    Disallow: /societal-impacts/air-stagnation/*/
    Disallow: /societal-impacts/apparent-temp/*/
    Disallow: /societal-impacts/csig/*/
    Disallow: /societal-impacts/redti/*/
    Disallow: /societal-impacts/wildfires/*/
    Disallow: /societal-impacts/wind/*/
    Disallow: /stormevents/*.jsp?
    Disallow: /stormevents/csv
    Disallow: /swdiws/
    Disallow: /thredds/
    Disallow: /teleconnections/*.php?
    Disallow: /temp-and-precip/*.php?
    Disallow: /temp-and-precip/alaska/*/
    Disallow: /temp-and-precip/asos/*/
    Disallow: /temp-and-precip/climatological-rankings/?
    Disallow: /temp-and-precip/climatological-rankings/download.xml
    Disallow: /temp-and-precip/drought/nadm/nadm-maps.php/?
    Disallow: /temp-and-precip/drought/nadm/climatology/*/
    Disallow: /temp-and-precip/drought/nadm/indices/*/
    Disallow: /temp-and-precip/drought/nadm/maps/*/
    Disallow: /temp-and-precip/drought/recovery/climatology/*/
    Disallow: /temp-and-precip/drought/recovery/current/*/
    Disallow: /temp-and-precip/drought/weekly-palmers/
    Disallow: /temp-and-precip/global-temps/*/
    Disallow: /temp-and-precip/msu/*/
    Disallow: /temp-and-precip/national-temperature-index/*?
    Disallow: /temp-and-precip/time-series/?
    Disallow: /temp-and-precip/us-weekly/
    

    Changing the above to use -e robots=off.

  6. Log in to comment