Benjamin Rose
There have been more than a few data targets that are simply too large for Azimuth to hold on to. Soon, I will have over 300T's of free space available (500T total), pending arrival of hardware. Once it arrives and is setup, I'd like to hit the ground running with a list of larger targets that other efforts haven't been able to mirror in full, or especially as a collection of parts. This ticket can be a good starting point for these targets.

So far I have a few size estimates that will be no problem to fit: 165T 18T 13T Size Estimation Ongoing

Please suggest any other targets available to be copied wholesale, and if possible, a size estimate of the complete dataset. But I am also happy to start size calculations on my own infrastructure if given a target suggestion. I'm mainly interested in ftp & rsync targets that I can hit and mirror completely, http sites are a lot more messy and I frankly don't have much time available to curate a website scrape.

  1. Jeremiah Curtis

    Langley Atmospheric Science Data Center 144 TB

    Oak Ridge National Laboratory DAAC (this has been started elsewhere with about 285 GB saved [that I know of]; I don't know if it's complete or not.......requires ordering the data (pretty simple process), waiting a few minutes for an email with the relevant https data links, and using a download manager to grab said https links ORNL download page:

    From what I can tell here on azimuth and the mirroring effort at github, it would appear that the Langley effort has barely started. If I'm wrong, someone please let me know........I would grab this data first unless the ORNL effort is truly incomplete (see issue #26)

  2. Jeremiah Curtis

    Also, the LP DAAC contains several PB of data. I don't know if you know anyone who has that capacity or not. Obviously you would need several spectacularly fast connections

  3. Jeremiah Curtis

    FWIW, has several issues over at github

    that have either been started or finished

    For example, has a post explaining that at (which presumably mirrors the ftp site), there is 5.1 TB of data, but there is a large .tar.gz file containing all relevant data may want to look into this; just a heads up (and obviously not every ncdc subdirectory is going to have such a streamlined zip file)

  4. Jeremiah Curtis

    ORNL data is NOT completely mirrored; far from it, in fact


    almost 5TB

    I ran through all the "field collections" at and came up with 157600 files totalling 4782436.84 MB (655 datasets)

    several things to note: 1) you have to order the data from ORNL; there is a single link they will send you that contains an https directory containing ALL data in your order......signup and ordering is ridiculously easy; you will usually have the link in your email inbox within 30 minutes......the https directory is rather simple insofar as you don't have to recursively download more than 3 or 4 levels (you can see what I mean when you get the links) 2) Internet Download Manager and downthemall extension in firefox work splendidly with the https link (I have been able to simultaneously download up to 10 files) 3)the link is valid for one week; cart orders are limited to 1 TB (so you would need at five orders for the field collections data 4) I am currently working on the global/regional datasets, but to my knowledge, no one is working on the field collections either at github or azimuth............................I would grab these datasets, but I have neither the space nor speed to do so

  5. Benjamin Rose reporter
  6. Benjamin Rose reporter

    progress report (* = completed initial download):

    23G     /var/www/html/pub/climatemirror/dscovr *
    29G     /var/www/html/pub/climatemirror/ *
    451G    /var/www/html/pub/climatemirror/ *
    1.3T    /var/www/html/pub/climatemirror/ *
    9.9T    /var/www/html/pub/climatemirror/
    11T     /var/www/html/pub/climatemirror/
    16T     /var/www/html/pub/climatemirror/azimuth
    31T     /var/www/html/pub/climatemirror/
    35T     /var/www/html/pub/climatemirror/
    35T     /var/www/html/pub/climatemirror/
    45T     /var/www/html/pub/climatemirror/
    182T    total
