Issue #105 open
Benjamin Rose
created an issue

There have been more than a few data targets that are simply too large for Azimuth to hold on to. Soon, I will have over 300T's of free space available (500T total), pending arrival of hardware. Once it arrives and is setup, I'd like to hit the ground running with a list of larger targets that other efforts haven't been able to mirror in full, or especially as a collection of parts. This ticket can be a good starting point for these targets.

So far I have a few size estimates that will be no problem to fit:

ftp.cdc.noaa.gov: 165T

eclipse.ncdc.noaa.gov: 18T

airbornescience.nsstc.nasa.gov: 13T

ftp.coast.noaa.gov: Size Estimation Ongoing

Please suggest any other targets available to be copied wholesale, and if possible, a size estimate of the complete dataset. But I am also happy to start size calculations on my own infrastructure if given a target suggestion. I'm mainly interested in ftp & rsync targets that I can hit and mirror completely, http sites are a lot more messy and I frankly don't have much time available to curate a website scrape.

Comments (16)

  1. Jeremiah Curtis

    Langley Atmospheric Science Data Center 144 TB ftp://l5eil01.larc.nasa.gov

    Oak Ridge National Laboratory DAAC https://daac.ornl.gov/ (this has been started elsewhere with about 285 GB saved [that I know of]; I don't know if it's complete or not.......requires ordering the data (pretty simple process), waiting a few minutes for an email with the relevant https data links, and using a download manager to grab said https links ORNL download page: https://daac.ornl.gov/get_data.shtml

    From what I can tell here on azimuth and the mirroring effort at github, it would appear that the Langley effort has barely started. If I'm wrong, someone please let me know........I would grab this data first unless the ORNL effort is truly incomplete (see issue #26)

  2. Jeremiah Curtis

    Also, the LP DAAC contains several PB of data. I don't know if you know anyone who has that capacity or not. Obviously you would need several spectacularly fast connections

  3. Jeremiah Curtis

    FWIW, ftp.cdc.noaa.gov has several issues over at github https://github.com/climate-mirror/datasets/issues/

    that have either been started or finished

    For example, https://github.com/climate-mirror/datasets/issues/20 has a post explaining that at https://www.ncdc.noaa.gov/oa/climate/ghcn-daily/ (which presumably mirrors the ftp site), there is 5.1 TB of data, but there is a large .tar.gz file containing all relevant data .......................you may want to look into this; just a heads up (and obviously not every ncdc subdirectory is going to have such a streamlined zip file)

  4. Jeremiah Curtis

    ORNL data is NOT completely mirrored; far from it, in fact

    see https://github.com/climate-mirror/datasets/issues/316

    almost 5TB

    I ran through all the "field collections" at https://daac.ornl.gov/get_data.shtml and came up with 157600 files totalling 4782436.84 MB (655 datasets)

    several things to note: 1) you have to order the data from ORNL; there is a single link they will send you that contains an https directory containing ALL data in your order......signup and ordering is ridiculously easy; you will usually have the link in your email inbox within 30 minutes......the https directory is rather simple insofar as you don't have to recursively download more than 3 or 4 levels (you can see what I mean when you get the links) 2) Internet Download Manager and downthemall extension in firefox work splendidly with the https link (I have been able to simultaneously download up to 10 files) 3)the link is valid for one week; cart orders are limited to 1 TB (so you would need at five orders for the field collections data 4) I am currently working on the global/regional datasets, but to my knowledge, no one is working on the field collections either at github or azimuth............................I would grab these datasets, but I have neither the space nor speed to do so

  5. Benjamin Rose reporter
    progress report (* = completed initial download):
    29G     /var/www/html/airbornescience.nasa.gov *
    2.5T    /var/www/html/eclipse.ncdc.noaa.gov
    9.1T    /var/www/html/ftp.ngdc.noaa.gov
    12T     /var/www/html/ftp.cdc.noaa.gov
    23T     /var/www/html/ftp.ncdc.noaa.gov
    26T     /var/www/html/ftp.coast.noaa.gov
    28T     /var/www/html/ftp.star.nesdis.noaa.gov
    
  6. Benjamin Rose reporter

    progress report (* = completed initial download):

    23G     /var/www/html/pub/climatemirror/dscovr *
    29G     /var/www/html/pub/climatemirror/airbornescience.nasa.gov *
    451G    /var/www/html/pub/climatemirror/ftp.epa.gov *
    1.3T    /var/www/html/pub/climatemirror/public.sos.noaa.gov *
    9.9T    /var/www/html/pub/climatemirror/eclipse.ncdc.noaa.gov
    11T     /var/www/html/pub/climatemirror/ftp.ngdc.noaa.gov
    16T     /var/www/html/pub/climatemirror/azimuth
    31T     /var/www/html/pub/climatemirror/ftp.cdc.noaa.gov
    35T     /var/www/html/pub/climatemirror/ftp.ncdc.noaa.gov
    35T     /var/www/html/pub/climatemirror/ftp.star.nesdis.noaa.gov
    45T     /var/www/html/pub/climatemirror/ftp.coast.noaa.gov
    182T    total
    
  7. Log in to comment