Source

kkes /

Filename Size Date modified Message
bin
kkes
33 B
782 B

Kkes

This program extract text from a webpage. It can be used as a library or as a way to browser the web.

The library outputs ContentBlock class object which is an iterable of str objects and ContentBlock objects. ContentBlock objects represents block object in the html semantic, blocks without meaningful text are stripped from the output. An str call on a ContentBlock will return a string representation of the object, separating ContentBlock with \n, for instance:

<div>Héllo <p>World</p></div>

Is represented as this string:

"Héllo\nWorld"

Dependencies

this is tested with:
  • python 2.7
  • html5lib 0.90
  • lxml 2.2

Licence

MPL/GPL/LGPL

Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.