Source

python-clinic / Doc / library / urllib2.rst

Full commit

:mod:`urllib2` --- extensible library for opening URLs

Note

The :mod:`urllib2` module has been split across several modules in Python 3.0 named :mod:`urllib.request` and :mod:`urllib.error`. The :term:`2to3` tool will automatically adapt imports when converting your sources to 3.0.

The :mod:`urllib2` module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world --- basic and digest authentication, redirections, cookies and more.

The :mod:`urllib2` module defines the following functions:

The following exceptions are raised as appropriate:

The following classes are provided:

This class is an abstraction of a URL request.

url should be a string containing a valid URL.

data may be a string specifying additional data to send to the server, or None if no such data is needed. Currently HTTP requests are the only ones that use data; the HTTP request will be a POST instead of a GET when the data parameter is provided. data should be a buffer in the standard :mimetype:`application/x-www-form-urlencoded` format. The :func:`urllib.urlencode` function takes a mapping or sequence of 2-tuples and returns a string in this format.

headers should be a dictionary, and will be treated as if :meth:`add_header` was called with each key and value as arguments. This is often used to "spoof" the User-Agent header, which is used by a browser to identify itself -- some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", while :mod:`urllib2`'s default user agent string is "Python-urllib/2.6" (on Python 2.6).

The final two arguments are only of interest for correct handling of third-party HTTP cookies:

origin_req_host should be the request-host of the origin transaction, as defined by RFC 2965. It defaults to cookielib.request_host(self). This is the host name or IP address of the original request that was initiated by the user. For example, if the request is for an image in an HTML document, this should be the request-host of the request for the page containing the image.

unverifiable should indicate whether the request is unverifiable, as defined by RFC 2965. It defaults to False. An unverifiable request is one whose URL the user did not have the option to approve. For example, if the request is for an image in an HTML document, and the user had no option to approve the automatic fetching of the image, this should be true.

The :class:`OpenerDirector` class opens URLs via :class:`BaseHandler`s chained together. It manages the chaining of handlers, and recovery from errors.

This is the base class for all registered handlers --- and handles only the simple mechanics of registration.

A class which defines a default handler for HTTP error responses; all responses are turned into :exc:`HTTPError` exceptions.

A class to handle redirections.

A class to handle HTTP Cookies.

Cause requests to go through a proxy. If proxies is given, it must be a dictionary mapping protocol names to URLs of proxies. The default is to read the list of proxies from the environment variables :envvar:`<protocol>_proxy`. If no proxy environment variables are set, in a Windows environment, proxy settings are obtained from the registry's Internet Settings section and in a Mac OS X environment, proxy information is retrieved from the OS X System Configuration Framework.

To disable autodetected proxy pass an empty dictionary.

Keep a database of (realm, uri) -> (user, password) mappings.

Keep a database of (realm, uri) -> (user, password) mappings. A realm of None is considered a catch-all realm, which is searched if no other realm fits.

This is a mixin class that helps with HTTP authentication, both to the remote host and to a proxy. password_mgr, if given, should be something that is compatible with :class:`HTTPPasswordMgr`; refer to section :ref:`http-password-mgr` for information on the interface that must be supported.

Handle authentication with the remote host. password_mgr, if given, should be something that is compatible with :class:`HTTPPasswordMgr`; refer to section :ref:`http-password-mgr` for information on the interface that must be supported.

Handle authentication with the proxy. password_mgr, if given, should be something that is compatible with :class:`HTTPPasswordMgr`; refer to section :ref:`http-password-mgr` for information on the interface that must be supported.

This is a mixin class that helps with HTTP authentication, both to the remote host and to a proxy. password_mgr, if given, should be something that is compatible with :class:`HTTPPasswordMgr`; refer to section :ref:`http-password-mgr` for information on the interface that must be supported.

Handle authentication with the remote host. password_mgr, if given, should be something that is compatible with :class:`HTTPPasswordMgr`; refer to section :ref:`http-password-mgr` for information on the interface that must be supported.

Handle authentication with the proxy. password_mgr, if given, should be something that is compatible with :class:`HTTPPasswordMgr`; refer to section :ref:`http-password-mgr` for information on the interface that must be supported.

A class to handle opening of HTTP URLs.

A class to handle opening of HTTPS URLs.

Open local files.

Open FTP URLs.

Open FTP URLs, keeping a cache of open FTP connections to minimize delays.

A catch-all class to handle unknown URLs.

Request Objects

The following methods describe all of :class:`Request`'s public interface, and so all must be overridden in subclasses.

OpenerDirector Objects

:class:`OpenerDirector` instances have the following methods:

OpenerDirector objects open URLs in three stages:

The order in which these methods are called within each stage is determined by sorting the handler instances.

  1. Every handler with a method named like :samp:`{protocol}_request` has that method called to pre-process the request.

  2. Handlers with a method named like :samp:`{protocol}_open` are called to handle the request. This stage ends when a handler either returns a non-:const:`None` value (ie. a response), or raises an exception (usually :exc:`URLError`). Exceptions are allowed to propagate.

    In fact, the above algorithm is first tried for methods named :meth:`default_open`. If all such methods return :const:`None`, the algorithm is repeated for methods named like :samp:`{protocol}_open`. If all such methods return :const:`None`, the algorithm is repeated for methods named :meth:`unknown_open`.

    Note that the implementation of these methods may involve calls of the parent :class:`OpenerDirector` instance's :meth:`~OpenerDirector.open` and :meth:`~OpenerDirector.error` methods.

  3. Every handler with a method named like :samp:`{protocol}_response` has that method called to post-process the response.

BaseHandler Objects

:class:`BaseHandler` objects provide a couple of methods that are directly useful, and others that are meant to be used by derived classes. These are intended for direct use:

The following members and methods should only be used by classes derived from :class:`BaseHandler`.

Note

The convention has been adopted that subclasses defining :meth:`protocol_request` or :meth:`protocol_response` methods are named :class:`\*Processor`; all others are named :class:`\*Handler`.

HTTPRedirectHandler Objects

Note

Some HTTP redirections require action from this module's client code. If this is the case, :exc:`HTTPError` is raised. See RFC 2616 for details of the precise meanings of the various redirection codes.

HTTPCookieProcessor Objects

:class:`HTTPCookieProcessor` instances have one attribute:

ProxyHandler Objects

HTTPPasswordMgr Objects

These methods are available on :class:`HTTPPasswordMgr` and :class:`HTTPPasswordMgrWithDefaultRealm` objects.

AbstractBasicAuthHandler Objects

HTTPBasicAuthHandler Objects

ProxyBasicAuthHandler Objects

AbstractDigestAuthHandler Objects

HTTPDigestAuthHandler Objects

ProxyDigestAuthHandler Objects

HTTPHandler Objects

HTTPSHandler Objects

FileHandler Objects

FTPHandler Objects

CacheFTPHandler Objects

:class:`CacheFTPHandler` objects are :class:`FTPHandler` objects with the following additional methods:

UnknownHandler Objects

HTTPErrorProcessor Objects

Examples

This example gets the python.org main page and displays the first 100 bytes of it:

>>> import urllib2
>>> f = urllib2.urlopen('http://www.python.org/')
>>> print f.read(100)
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<?xml-stylesheet href="./css/ht2html

Here we are sending a data-stream to the stdin of a CGI and reading the data it returns to us. Note that this example will only work when the Python installation supports SSL.

>>> import urllib2
>>> req = urllib2.Request(url='https://localhost/cgi-bin/test.cgi',
...                       data='This data is passed to stdin of the CGI')
>>> f = urllib2.urlopen(req)
>>> print f.read()
Got Data: "This data is passed to stdin of the CGI"

The code for the sample CGI used in the above example is:

#!/usr/bin/env python
import sys
data = sys.stdin.read()
print 'Content-type: text-plain\n\nGot Data: "%s"' % data

Use of Basic HTTP Authentication:

import urllib2
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
                          uri='https://mahler:8092/site-updates.py',
                          user='klem',
                          passwd='kadidd!ehopper')
opener = urllib2.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib2.install_opener(opener)
urllib2.urlopen('http://www.example.com/login.html')

:func:`build_opener` provides many handlers by default, including a :class:`ProxyHandler`. By default, :class:`ProxyHandler` uses the environment variables named <scheme>_proxy, where <scheme> is the URL scheme involved. For example, the :envvar:`http_proxy` environment variable is read to obtain the HTTP proxy's URL.

This example replaces the default :class:`ProxyHandler` with one that uses programmatically-supplied proxy URLs, and adds proxy authorization support with :class:`ProxyBasicAuthHandler`.

proxy_handler = urllib2.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib2.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')

opener = urllib2.build_opener(proxy_handler, proxy_auth_handler)
# This time, rather than install the OpenerDirector, we use it directly:
opener.open('http://www.example.com/login.html')

Adding HTTP headers:

Use the headers argument to the :class:`Request` constructor, or:

import urllib2
req = urllib2.Request('http://www.example.com/')
req.add_header('Referer', 'http://www.python.org/')
r = urllib2.urlopen(req)

:class:`OpenerDirector` automatically adds a :mailheader:`User-Agent` header to every :class:`Request`. To change this:

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open('http://www.example.com/')

Also, remember that a few standard headers (:mailheader:`Content-Length`, :mailheader:`Content-Type` and :mailheader:`Host`) are added when the :class:`Request` is passed to :func:`urlopen` (or :meth:`OpenerDirector.open`).