ShelobPy: Python Document Text Extractor
ShelobPy is a simple Python driven package to read files of various formats and extract text that can be searched on.
- Reads the following formats:
- Word 97-2003 (doc files) Word 2007 on (docx files) PDF files HTML files Rich Text files Works (wps files) Open Office Text (odt files) Plain Text files
It's rather hackish and comes from another project I wrote in C called Shelob that did the same thing. I moved to Python as the code for the old project was a nightmare to install as it used various libraries that were either badly documented or would change their API with each minor revision.
As I only wanted to pull out terms that were used for searching I don't care if I don't pull out 100% of the terms or if there's some noise.
See the tests folder for an idea on how it works.
Copyright (C) 2012 South Wales Business Systems Ltd
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.