Issue #38 closed
Bryce A. Lynch
created an issue

I've just begun mirroring https://www.waterqualitydata.us/.

Comments (17)

  1. Greg Kochanski

    I've looked into the website a bit. First, it claims to have more than 1.5 million sampling locations in the database, so --wait 15 will lead to 1.5M * 15 / 86400 = 225 days of download time.

    Second, it looks like there's data in the database that's not displayed on static web pages. It looks like you need to go to https://www.waterqualitydata.us/portal/ and enter a site-ID to get all the data. Once you do that, download URLs can be seen by hitting the "Show Web Service Calls" button.

    I'm working on a python script to snarf that data.

  2. Greg Kochanski

    I'll look into that idea tomorrow morning before work. FYI, so far the python script seems to be behaving well. From the look of the site, it's a database-backed design, and it would not surprise me if the database stores the actual data. (The total amount of data isn't huge, so that's not an unreasonable design...)

  3. Greg Kochanski

    I searched for the string "ftp:" in all the files I've downloaded so far, and found no examples. So, the HTML doesn't point to any ftp access. Also, just trying to connect via ftp to names like ftp.waterqualitydata.us has no success.

    find www.waterqualitydata.us/ -exec grep 'ftp:' {} \; -print

    So, it looks like a "no".

  4. Jan Galkowski

    Thanks. That's one way. The other is to interrogate the DNS to see if there's anything. A simple way is to just use Google to search for:

    site:*ftp*.waterqualityata.*
    

    The other way is to use a tool like that at http://viewdns.info/dnsreport/?domain=waterqualitydata.us. If you check down at the bottom of that report, you'll find, among other things, no FTP hostname, and, moreover:

    www.waterqualitydata.us. CNAME wqp.esas-er2-usgs.gov.akadns.net. [TTL=3600]
    wqp.esas-er2-usgs.gov.akadns.net. A 141.8.225.31 [TTL=3600]
    

    That means the site has its identity protected and its function protected from DoS and DDoS by Akamai.

  5. ebovine

    Rather than a site-scrape, I began to grab the backend database through the Download link. The sites database was small (2,483,824 records). The Physical/chemical metadata is much larger (after 12 hours, I am at 12.02 GB of a ZIP archive). Does this sound consistent with what others are getting?

  6. Greg Kochanski

    Apparently done. Hashed. Uploading to pub04.rz21.azimuthproject-kickstarter.org:/var/local/gpk/i38_www.waterqualitydata.us . 13 GB, 809884 files. Of that, 4.3 GB is the database, in 88022 files (three files per geographic location). I'm somewhat concerned that the database download was only partial. I expected that total number of files would be dominated by the database-related files, but the database is much too small for that.

    • ebovine: did I miss a link to the contents of the database?

    • Me: need to look at logs and the database download scripts. Try to make a 1:1 correspondence.

  7. Greg Kochanski

    The database download apparently terminated early. I'm guessing it stopped when I ran out of disk space a week ago. It started right up again, though, picking up from where it left off. So, I believe I have all of the non-database parts of the website; the database is only ~10% downloaded, but back in business.

  8. Log in to comment