Linkchecker: option to override user agent

Kristian Kolev avatarKristian Kolev created an issue

linkcheck.py currently hardcodes a 'Mozilla/5.0' user agent to simulate a browser, which works with most sites.

But Sourceforge resets the connection for that particular string. Interestingly enough, it works OK for other user agents, including 'Mozilla/4.0'.

It may be the case that other websites exhibit similar quirks, and it would be nice if we could specify a string to be used as the user agent in conf.py.

Comments (7)

  1. MarioVilas

    I'm also getting strange errors from other sites. For example code.activestate.com throws 405 Method Not Allowed errors (and it's not the only site that does that, may be related to the web server software rather than a specific site configuration), and Wordpress blogs also don't seem to like it (they give empty responses).

    The 405 errors appear to be related to the use of HEAD, which is not mandatory in HTTP. Instead of failing, linkcheck.py should retry with the GET method.

  2. Takayuki Shimizukawa

    I confirmed with sourceforge.com:

    >>> requests.head('http://docutils.sourceforge.net/docs/ref/rst/directives.html')
    <Response [200]>
    
    >>> requests.head('http://docutils.sourceforge.net/docs/ref/rst/directives.html', headers={'User-agent': 'Mozilla/5.0'})
    Traceback (most recent call last):
    ...
    requests.exceptions.ConnectionError: HTTPConnectionPool(host='docutils.sourceforge.net', port=80): Max retries exceeded with url: /docs/ref/rst/directives.html
    (Caused by <class 'socket.error'>: [Errno 10054] Connection reset by peer.)
    
    >>> requests.head('http://docutils.sourceforge.net/docs/ref/rst/directives.html', headers={'User-agent': 'Mozilla/4.0'})
    <Response [200]>
    

    and also confirmed with code.activestate.com:

    >>> requests.head('http://code.activestate.com/', headers={'User-agent': 'Mozilla/5.0'})
    <Response [200]>
    >>> requests.head('http://code.activestate.com/recipes/578788/', headers={'User-agent': 'Mozilla/5.0'})
    <Response [405]>
    >>> requests.head('http://code.activestate.com/recipes/578788/', headers={'User-agent': 'Mozilla/4.0'})
    <Response [405]>
    >>> requests.head('http://code.activestate.com/recipes/578788/')
    <Response [405]>
    >>> requests.get('http://code.activestate.com/recipes/578788/')
    <Response [200]>
    

    I think linkcheck should:

    • If a HEAD request receives 405 error, retry with a GET request.

    However, I wonder why ('User-agent', 'Mozilla/5.0') cause Connection reset by peer. exception?

  3. Log in to comment
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.