1. Azimuth Backup
  2. Azimuth Backup Project
  3. azimuth-inventory
  4. Issues
Issue #27 closed

NASA-ESDIS: Oak Ridge National Laboratory DAAC, data site

Jan Galkowski
created an issue

https://daac.ornl.gov/get_data.shtml

Transfer

https://daac.ornl.gov/get_data.shtml

to Sakari's FTP server in

daac.ornl.gov-get_data

Comments (22)

  1. Sakari Maaranen

    Please move these from datarefuge to ~jan/local_data/.

    0 Dec 18 02:12 cdiac.ornl.gov-pub_o
    0 Dec 20 22:18 daac.ornl.gov-get_data
    0 Dec 20 22:19 daac.ornl.gov-web
    

    Delete from datarefuge after move.

  2. Jan Galkowski reporter

    0 Dec 20 22:18 daac.ornl.gov-get_data

    0 Dec 20 22:19 daac.ornl.gov-web

    These are/were placeholders for data coming up from my workstation.

    0 Dec 18 02:12 cdiac.ornl.gov-pub_o

    is being rsync'd now.

  3. Jan Galkowski reporter

    This has been completed and is being moved to /var/local/pub on pub04. SHA sums are being attached.

    Two points, because this unit of the DOE and NASA is directly under threat of abolishment by both administration and the Republican controlled Congress in their draft proposal for DOE funding, this was started early in our process. Because of that it was buffeted by all the growing pains and reorganization of files and disks and servers.

    Originally there are two tickets, one to preserve the Web site, and this Issue #27 which was to preserve the data. As experience grew with this dataset and government data sites in general, keeping these apart seemed silly, so they are included all together in one directory in /var/local/pub.

    The SHA sums show files for both tickets.

  4. Jan Galkowski reporter

    There is a possibility that some of the data at https://daac.ornl.gov/get_data.shtml was missed, due to a report at the ClimateMirror issues. I don't really know any way of checking this apart from sizes and then comparing directories to see what was gotten and what not. Also, while httrack appears to do a better job of mirroring the structure of a Web site, it does not do as well doing the --follow-ftp that wget does. So I'm doing a wget and comparing sizes, on pub04. If this were an FTP pull, I could du -s -c -b but it is not. And I do not have a technique for estimating the size of a Web site.

  5. Jan Galkowski reporter

    I don't know, Sakari. I just tried a wget and it returned with very little.

    I have two Web pages that are examples:

    https://daac.ornl.gov/get_data.shtml
    

    and

    https://daac.ornl.gov/cgi-bin/catalog.pl?l
    

    The site requires registration and a sign-in. You can put things in a shopping cart for later download. (No cost.) Not all datasets are available for online download.

    Here are some other pages which are examples:

    I collected a bunch, as an example, in a shopping cart ... all seemingly mediated by Javascript: 2017-02-06_155001.png

    But there's no hint in the HTML of the page of a directory that one can go and get all this from, at least that I could find. (I used Chrome's developer tools raw HTML to look.) What there is is a PERL script, which downloads given a parameter:

    https://daac.ornl.gov/cgi-bin/download.pl?ds_id=1306
    

    I'm thinking of something like a cURL script that issues commands and pulls all these down, starting with ?ds_id=1 and incrementing, but dunno. Could use a scripter here.

    What do you all think?

    Greg Kochanski marsroverdriver Benjamin Rose Sakari Maaranen

    Anyway, this explains why just going after this with httrack or wget missed these.

  6. Jan Galkowski reporter

    I was trying to get sizes of the remnant up to Ben's 18 Tb using du -s -b -c on azi03 and apparently the connection has either been blocked or throttled. I went in using pub05 and it worked fine. Here's what I learned so far about the /pub subdirectory on eclipse, at the same level as /cdr:

    lftp eclipse.ncdc.noaa.gov:/pub> du -s -c -b ./gacp
    0       ./gacp
    0       ./gpcp
    124528278013    ./gridsat
    868107579283    ./hursat
    lftp eclipse.ncdc.noaa.gov:/pub> du -s -c -b ./ibtracs
    ./ibtracs/v03r03/all/shp/storm: Getting directory contents (12521795) [Waiting for response...]
    Interrupt
    
  7. Jan Galkowski reporter

    So, I tried curl on this, without success. If I do

    curl --anyauth -O https://daac.ornl.gov/cgi-bin/download.pl?ds_id=818

    even without offering a username or password, I get in the download file:

    [jan@azi03 cdiac]$ cat download.pl\?ds_id\=818 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>302 Found</title> </head><body> <h1>Found</h1> <p>The document has moved <a href="http://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=818">here</a>.</p> <hr> <address>Apache Server at daac.ornl.gov Port 443</address> </body></html>

    If I try wget with a username and password, I get the same thing.

    I also tried Lynx (!). The server doesn't quite know what to do with it.

  8. Log in to comment