Issue #36 resolved
marsroverdriver
created an issue

Long story short, I stumbled onto https://data.noaa.gov/dataset, which seems a rich prize. I didn't see it in our issues database or on the climate dataset spreadsheet -- which seems odd to me. Maybe it's just not there because it's a superset of data they're already tracking, or something?

Comments (12)

  1. Jan Galkowski

    Yes, actually, I identified this a couple of days ago, and brought it up not only to the Azimuth Backup Project's attention, but to the archivists at University of Pennsylvania. The latter group reported that the spreadsheet that people are working from should not be considered comprehensive, and we are encouraged to replicate data in the universe of interest on our own.

  2. marsroverdriver reporter

    Cool. So far the download has fetched 132GB, but nothing obviously worthwhile. :-( It's apparently all just HTML junk and RDF files so far, no datasets per se.

    It looks like the actual data is hosted mostly at other sites and this is mainly a clearinghouse. So I've stopped the wget for now. I'm running a process to extract the domains from the pages grabbed so far, and I'll rerun the wget with -H -D(those domains) as a start.

  3. marsroverdriver reporter

    To keep a rough record of what I'm doing ....

    First, get a rough-and-ready list of all domain-like things in the files grabbed so far:

    find . -type f -print0 | xargs -0 perl -nle 'print $1 if m!(?:https?|ftp)://([^"/]+)!' | sort -u >unique-domains.txt
    

    Look for things that don't belong in domain names:

    cat unique-domains-list.txt | grep -P '[^-\w\.]'
    

    I hand-edited the list (emacs) to remove the junk and re-uniquified it. Then I re-ran the wget with this list:

    ~/bin/wg --span-hosts --domains=$(cat unique-domains-list.txt | perl -le '@a = <STDIN>; chomp(@a); print join(",", @a);')
    

    wget in progress; we'll see how this goes.

    (When this wget has finished, I'll need to re-run this with the full data set, in case more domains come in. For that run, I should use --timestamping, so as not to re-fetch a whole bunch of already fetched data.)

  4. marsroverdriver reporter

    Latest restart (on azi02):

    ~/bin/wg https://data.noaa.gov/dataset data.noaa.gov-3 --span-hosts --domains=$(cat unique-domains-fixed.txt | perl -le '@a = <STDIN>; chomp @a; print join(",", @a)')
    
  5. Log in to comment