FTP site, EPA, don't know how much deals with CO2 or climate change

Issue #89 resolved
Jan Galkowski
created an issue

ftp://ftp.epa.gov

Downloading to mirror.math.princeton.edu

Comments (27)

  1. Benjamin Rose

    wget reports:

    FINISHED --2017-01-25 14:06:42-- Total wall clock time: 13h 27m 30s Downloaded: 91212 files, 278G in 10h 0m 41s (7.91 MB/s)

    Not sure the cause of size difference between this and what was reported by lftp.

  2. Benjamin Rose

    checksums of all files created, gzipped, moved into directory and attached here. This data set should be all done. Just in time too, ftp.epa.gov seems absolutely slammed right now.

  3. Benjamin Rose
    • changed status to open

    I have discovered wget missed some files on the first run. It's picking up the deltas now in a second run, minor, shouldn't take too long. I will upload the revised hashes as soon as it's done. Sorry all!

  4. Benjamin Rose

    FINISHED --2017-01-26 18:12:41-- Total wall clock time: 20h 24m 29s Downloaded: 230225 files, 117G in 9h 43m 48s (3.41 MB/s)

    Took longer than expected, lots and lots of tiny files. But I am glad now:

    391G ./ftp.epa.gov

    This matches exactly what lftp reported the remote total filesize was. Hooray!

    Hashing now.

  5. Jan Galkowski reporter

    And we've just gone public with this mirror:

    The Azimuth Backup Project respectfully submits the FTP site of the EPA, mirrored at:
    
    https://mirror.math.princeton.edu/pub/climatemirror/ftp.epa.gov/
    
    for your copy if you make one, consider making life easy and using rsync to the same path:
    
    rsync -Hhartv --progress rsync://mirror.math.princeton.edu/pub/climatemirror/ftp.epa.gov/ /var/local/whereever
    
  6. Ken Miller

    Great mirror. I think there is a problem with the EmisInventory folder, though. It should contain ~50 subfolders chock-full of data files, but instead it is just an archived web page (and an old one at that). This is what it should look like; this is what it looks like instead.

  7. Benjamin Rose

    @Ken Miller There is no problem. The ftp.epa.gov link isn't auto-delivering the index.html file, while the mirror.math.princeton.edu link is delivering it. The dirs are all still there, manually accessible, for instance:

  8. Ken Miller

    Thanks Benjamin! For what it's worth, that index file is ancient and for just one of several EPA websites that call data from the EmisInventory folder. By delivering it the mirror hides all the directories, so unless someone knows what they are they won't be able to access them manually even though they're all there. So I'd suggest not delivering the index file and showing the list of directories instead, like the EPA site does.

  9. Benjamin Rose

    @Ken Miller Usually I am mirroring open source code, many of whom rely on having an index.html that links to their main webpage to initiate downloads. But I see what you are saying, and since for climate data I am hitting ftp servers that don't auto-serve index.html, I've now recursively disabled AutoIndex inside of the /pub/climatemirror/ directory.

    So now you should see: http://mirror.math.princeton.edu/pub/climatemirror/ftp.epa.gov/EmisInventory/ and all other subdirectories of climatemirror display a Directory Listing instead of an AutoIndex.

    This may make it browse different than azimuth's, but the data in azimuth isn't extensively curated, this probably doesn't make matters there much worse.

  10. Log in to comment