This program extract text from a webpage. It can be used as a library or as a way to browser the web.

The library outputs ContentBlock class object which is an iterable of str objects and ContentBlock objects. ContentBlock objects represents block object in the html semantic, blocks without meaningful text are stripped from the output. An str call on a ContentBlock will return a string representation of the object, separating ContentBlock with \n, for instance:

<div>Héllo <p>World</p></div>

Is represented as this string:



this is tested with:
  • python 2.7
  • html5lib 0.90
  • lxml 2.2



