Issue #34 open
marsroverdriver
created an issue

From toward the end of the Climate Datasets spreadsheet. It was not marked as claimed.

https://modis.gsfc.nasa.gov/data/dataprod/

I'm backing this up using a very crude script:

[maxwell@azi01 ~]$ cat ~/bin/wg
#!/bin/bash

URL="$1"
SHORT="${2?Usage: $0 url short-name}"
shift ; shift

exec wget \
        --wait=1 \
        --mirror \
        --no-verbose \
        -4 \
        --output-file="$SHORT".log \
        --no-check-certificate \
        --tries=40 \
        --page-requisites\
        "$@" \
        "$URL"

run as

~/bin/wg https://modis.gsfc.nasa.gov/data/dataprod/ modis

If this works, I'll use the same script for other sites.

Comments (12)

  1. marsroverdriver reporter

    The script is fine, but the MODIS site seems to be a low-value candidate. Most of the interesting data appears to be hosted off site and not amenable to spidering. :-(

    I guess I'll leave the script running for now because why not, but I'll scope out more promising backup candidates meanwhile.

  2. marsroverdriver reporter

    It's sadly more complicated than that. The other sites also suck and seem to discourage direct, bulk downloading -- or I'd just mirror them directly.

    I don't want to say that there's nothing worthwhile here, just that it's not the most productive use of our time right now.

    As it happens, what is there, is now complete -- the download of that site has finished. But I'm not going to mark this issue as complete, because IMHO we don't really have all of the interesting MODIS data yet.

  3. Jan Galkowski

    Just curious, Scott, have you figured out how to SHA256 at the source? Were you able to figure out anything about checking logs so you know if you are missing files? I've focused on the latter, but not the former, since recent evidence shows my efforts have missed files in the past, due to naming frobs.

  4. marsroverdriver reporter

    There's no such thing: you can't SHA-256 the files without having a copy of them to run over, and you can't get a copy of them without downloading unless you have shell access on the server.

    You probably haven't seen what I added to the SHA-256 documentation page about inferring completeness from the logs. (Which is still suboptimal, but a whole lot better than nothing.)

  5. Log in to comment