Source

kkes / README.rst

Kkes

This program extract text from a webpage. It can be used as a library or as a way to browser the web.

The library outputs ContentBlock class object which is an iterable of str objects and ContentBlock objects. ContentBlock objects represents block object in the html semantic, blocks without meaningful text are stripped from the output. An str call on a ContentBlock will return a string representation of the object, separating ContentBlock with \n, for instance:

<div>Héllo <p>World</p></div>

Is represented as this string:

"Héllo\nWorld"

Dependencies

this is tested with:
  • python 2.7
  • html5lib 0.90
  • lxml 2.2

Licence

MPL/GPL/LGPL