Wiki

Clone wiki

pub / Ignoring_robots_restrictions_with_wget

Ignoring robots restrictions with wget

By default, wget honors web sites' robots restrictions and disallows recursive downloads if the site wishes so. This guide teaches how to override this behavior.

NB! If you are going to override robot restrictions, please act responsibly. If you want to download significant amounts of data, use options like --limit-rate and --wait. Many websites will block your IP address if you exceed their quotas.

Often you find that these restrictions are pointless -- you could download all files anyway by clicking on them, one by one, in the web browser. It's just more convenient to get them all in bulk with wget. Often these restrictions are designed for search engines, not recursive downloaders.

I'm using http://http.de.scene.org/pub/music/artists/bad_loop/ as an example target for this article.

Step 1: Recognizing robot-restricted HTML

For bulk downloading, normally you'd use a command something like this:

    % wget http://http.de.scene.org/pub/music/artists/bad_loop/ -r -np -A .mp3 -A .ogg
  • -r (recursive) tells wget to also download linked elements
  • -np (no parent) tells wget not to download outside if the current directory (i.e. only within /pub/music/artists/bad_loop/)
  • -A .mp3 -A .ogg accept files with .mp3 and .ogg extensions

However, executing this command gives the following results:

    --2009-06-29 16:05:25--  http://http.de.scene.org/pub/music/artists/bad_loop/
    [...]
    2009-06-29 16:05:26 (89.3 KB/s) - `http.de.scene.org/pub/music/artists/bad_loop/index.html' saved [7678]

    Removing http.de.scene.org/pub/music/artists/bad_loop/index.html since it should be rejected.

    FINISHED --2009-06-29 16:05:26--
    Downloaded: 1 files, 7.5K in 0.08s (89.3 KB/s)

Two things happened here:

  • wget doesn't tell you that, but it refused to download the .mp3 and .ogg files because the HTML file contains robot restrictions.
  • wget deletes index.html because it was not in your accept list.

Step 2: Defeating robot-restricted HTML

In order to override the restrictions, first you have to download index.html to the disk. Add -A index.html to wget switches:

    % wget http://http.de.scene.org/pub/music/artists/bad_loop/ -r -np -A .mp3 -A .ogg -A index.html
    [...]
    Saving to: `http.de.scene.org/pub/music/artists/bad_loop/index.html'
    [...]

Now open up the file http.de.scene.org/pub/music/artists/bad_loop/index.html in a text editor; you'll find something like:

    <META NAME="ROBOTS" CONTENT="NOARCHIVE">
    <META NAME="ROBOTS" CONTENT="NOINDEX">
    <META NAME="ROBOTS" CONTENT="NOFOLLOW">

Delete all such lines and save the file. Next, you have to add the -nc (no clobber) switch to wget, which tells it not to overwrite previously downloaded files -- so it will respect your newly edited index.html:

    % wget http://http.de.scene.org/pub/music/artists/bad_loop/ -r -np -A .mp3 -A .ogg -A index.html -nc

Step 3: Recognizing robots.txt restrictions

Always when you us the recursive (-r) option, wget consults the robots.txt file, for which directories of the site are allowed for robots. If the website is also using robots.txt restrictions, the last command output will look like:

    % wget http://http.de.scene.org/pub/music/artists/bad_loop/ -r -np -A .mp3 -A .ogg -A index.html -nc
    File `http.de.scene.org/pub/music/artists/bad_loop/index.html' already there; not retrieving.

    Loading robots.txt; please ignore errors.
    --2009-06-29 16:18:41--  http://http.de.scene.org/robots.txt
    [...]
    2009-06-29 16:18:41 (1.40 MB/s) - `http.de.scene.org/robots.txt' saved [26/26]

    FINISHED --2009-06-29 16:18:41--
    Downloaded: 1 files, 26 in 0s (1.40 MB/s)

Step 4: Defeating robots.txt restrictions

In this case, the robots.txt file contains:

    User-agent: *
    Disallow: /

This disallows the whole site for all robots. To defeat this restriction, simply clear the file:

    % echo > http.de.scene.org/robots.txt

Step 5: Leech!

After this has been done, wget should give you the green light to download everything you intended. Just repeat the last command:

    % wget http://http.de.scene.org/pub/music/artists/bad_loop/ -r -np -A .mp3 -A .ogg -A index.html -nc

Voila!

Updated