Overview

ShelobPy: Python Document Text Extractor

ShelobPy is a simple Python driven package to read files of various formats and extract text that can be searched on.

Reads the following formats:
Word 97-2003 (doc files) Word 2007 on (docx files) PDF files HTML files Rich Text files Works (wps files) Open Office Text (odt files) Plain Text files

It's rather hackish and comes from another project I wrote in C called Shelob that did the same thing. I moved to Python as the code for the old project was a nightmare to install as it used various libraries that were either badly documented or would change their API with each minor revision.

As I only wanted to pull out terms that were used for searching I don't care if I don't pull out 100% of the terms or if there's some noise.

See the tests folder for an idea on how it works.

Dependencies:
pyPDF-1.3

Copyright (C) 2012 South Wales Business Systems Ltd

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.