Issue #23 closed
Sakari Maaranen
created an issue

2016-12-18 Hetzner storage box in Germany

wget --dns-timeout=10 --connect-timeout=20 --read-timeout=120 \
 --wait=12 --random-wait --follow-ftp --progress=dot:mega \
 --prefer-family=IPv4 --tries=40 --timestamping=on --recursive --level=inf \
 --no-remove-listing --no-check-certificate \
 --output-file=pds.nasa.gov.log \
 -H -Dpds.nasa.gov https://pds.nasa.gov

Comments (19)

  1. Sakari Maaranen reporter

    Also this one seems to have stuck into some dynamically created pages, traversing through infinite links. I am assuming the valuable content has already been downloaded, but cannot be sure without someone familiar with the data inspecting it.

    Interrupting the download.

  2. Sakari Maaranen reporter

    The following directories and files were created as a part of this job.

    0 Dec 18 23:17 atmos.pds.nasa.gov
    0 Dec 20 12:47 geo.pds.nasa.gov
    0 Dec 21 03:27 img.pds.nasa.gov
    0 Dec 21 00:34 mgmt.pds.nasa.gov
    0 Dec 19 03:05 naif.pds.nasa.gov
    0 Dec 21 02:25 pds.nasa.gov
    139824623 Jan  9 10:20 pds.nasa.gov.log
    0 Jan  5 13:38 ppi.pds.nasa.gov
    0 Dec 18 23:29 rings.pds.nasa.gov
    0 Dec 21 07:31 sbn.pds.nasa.gov
    

    All the above directories and files should now be moved together, away from datarefuge. I have a disk usage command running and will report size as soon as it completes.

  3. Sakari Maaranen reporter
    [sam@azi03 datarefuge]$ nice ionice du -s -c -b *
    364     atmos.pds.nasa.gov
    1949771752      geo.pds.nasa.gov
    167670838       img.pds.nasa.gov
    257551594       mgmt.pds.nasa.gov
    50579609        naif.pds.nasa.gov
    466522276       pds.nasa.gov
    139824623       pds.nasa.gov.log
    2946156955381   ppi.pds.nasa.gov
    16165   rings.pds.nasa.gov
    621398926403    sbn.pds.nasa.gov
    
  4. Sakari Maaranen reporter
    • changed status to resolved
    • edited description

    Not sure if this is all, but it is a lot:

    [root@azi03 datarefuge]# du --summarize --total --apparent-size -BG *pds*
    2G      geo.pds.nasa.gov
    1G      img.pds.nasa.gov
    1G      mgmt.pds.nasa.gov
    1G      naif.pds.nasa.gov
    1G      pds-datarefuge.listing
    1G      pds.nasa.gov
    1G      pds.nasa.gov.log
    2744G   ppi.pds.nasa.gov
    1G      rings.pds.nasa.gov
    579G    sbn.pds.nasa.gov
    3326G   total
    
  5. Sakari Maaranen reporter

    I see you already did. I have not asked you to delete any data on azi03.

    I have asked you to let processes targeting your local data there finish, so I can remount. Data will be kept.

    Here is the message that shows on the server every time you log in. It has been there since January 18:

    root@azi03 ~]# cat /etc/motd
    Jan, no need to interrupt work on this server (azi03). Data is safe here.
    However, please don't start new long running processes
    reading or writing ~jan/local_data/.
    
    Let running ones finish.
    
    Make sure no screen has shell open currently in that directory.
    
    When your local_data is idle, I will re-mount it under /var/local/.
    You can then continue. Data will be kept.
    
    Thank you!
    
  6. Sakari Maaranen reporter

    Well, at least it's clear what to do next. I'll download it again. Will take some weeks.

    Meanwhile, please observe the above message for other work on the same server.

    I am assuming you take your lesson without me stating the obvious.

  7. Sakari Maaranen reporter

    I already did the data set earlier, and I am not in the habit of giving up. If I have reported having done it, I will do it.

    It only takes a lot of time, because it is configured to be nice on the resources. Of course we can download it faster, but then it puts more load on the target server. I could remove the 'wait' parameter.

  8. Sakari Maaranen reporter

    Note that your httrack command has different depth configuration and you have limited it to 1Mbps. That won't work. Please terminate your job.

    If you want to try it, do it on some other server, because both jobs won't fit on pub05.

  9. Sakari Maaranen reporter

    I was able to restore 896 G of the data. Then remote end disconnected me:

    2017-01-27 17:46:31 URL:http://ppi.pds.nasa.gov/search/render/?id=pds://PPI/MEX-M-MARSIS-3-RDR-AIS-EXT4-V1.0 [14116] -> "ppi.pds.nasa.gov/search/render?id=pds:%2F%2FPPI%2FMEX-M-MARSIS-3-RDR-AIS-EXT4-V1.0.html" [1]
    Last-modified header missing -- time-stamps turned off.                    Read error (Connection timed out) in headers.                              Read error (Connection timed out) in headers.                              Read error (Connection timed out) in headers.                              Read error (Connection timed out) in headers.
    Read error (Connection timed out) in headers.
    Read error (Connection timed out) in headers.                              Read error (Connection timed out) in headers.                              Read error (Connection timed out) in headers.
    Read error (Connection timed out) in headers.                              Read error (Connection timed out) in headers.                              Read error (Connection timed out) in headers.
    Read error (Connection reset by peer) in headers.
    Read error (Connection reset by peer) in headers.
    Read error (Connection reset by peer) in headers.
    

    The 896 G that I got is now on pub01.

  10. Log in to comment