Source

features/pep-382-2 / Doc / library / html.parser.rst

Full commit

:mod:`html.parser` --- Simple HTML and XHTML parser

Source code: :source:`Lib/html/parser.py`


This module defines a class :class:`HTMLParser` which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.

Create a parser instance. If strict is True (the default), invalid html results in :exc:`~html.parser.HTMLParseError` exceptions [1]. If strict is False, the parser uses heuristics to make a best guess at the intention of any invalid html it encounters, similar to the way most browsers do.

An :class:`HTMLParser` instance is fed HTML data and calls handler functions when tags begin and end. The :class:`HTMLParser` class is meant to be overridden by the user to provide a desired behavior.

This parser does not check that end tags match start tags or call the end-tag handler for elements which are closed implicitly by closing an outer element.

An exception is defined as well:

:class:`HTMLParser` instances have the following methods:

Example HTML Parser Application

As a basic example, below is a simple HTML parser that uses the :class:`HTMLParser` class to print out start tags, end tags, and data as they are encountered:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)
    def handle_endtag(self, tag):
        print("Encountered  an end tag:", tag)
    def handle_data(self, data):
        print("Encountered   some data:", data)

parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

Footnotes

[1]For backward compatibility reasons strict mode does not raise exceptions for all non-compliant HTML. That is, some invalid HTML is tolerated even in strict mode.