Scanning, text recognition and archiving of paper documents...
with GUI clients but from the command line if necessary
The Paper Tiger code has a liberal MIT license.
It uses various other open-source programs.
- scanning documents using sane into TIFF documents
- OCR/text recognition using Tesseract
- storage of documents as PDF file (image file+OCR text) e.g. on a Samba share
- index of documents+notes+full text in Firebird database
- server component written in FreePascal, so no X Windows required.
- command line control on server
- CGI REST server component
- Viewer/scanner GUI written in Lazarus+FreePascal
- Initial support for WIA/TWAIN on Windows to support scanning from desktop
Further possible refinements:
- support for other databases (sqlite, PostgreSQL, MS SQL Server)
- using image cleanup tools such as scantailor and unpaper
- write .deb install pacakge for easy installation on Debian servers
- batch import of images/pdfs
Architecture and development principles
- use other people's work if possible - the Unix way...
- if possible, build using modules:
e.g. allow use of multiple OCR engines etc
- store OCR text in the PDF, and store the image tiff.
This enables external tools to work with the PDFs,
use the PDFs in other applications etc.
- save all OCR text in database or file
(e.g. a Lucene index) in order to allow fast search across all documents
- this means synchronizing PDF text with the full text archive may be required
- develop towards a single point of control:
tigerservercore, which may speak multiple protocols, e.g. via plugins
- however, use standard methods of storing data (e.g. full text search
components), normalized database schema in order to allow programs/tools that
don't speak the protocols mentioned above to get data easily
- these 2 principles clash; the code will need to stabilize until it is wise to
directly try to access e.g. the database.
Even then, breaking changes will not be avoided if e.g. cleanness of design
would be compromised
FPC 2.7.1/trunk is preferred for the server/CGI programs.
At least FPC 2.6.2 fpweb does not accept the DELETE method.
For the client program, Lazarus trunk has been used for development.
1. Compile hgversion.pas, e.g.:
2. Compile the program(s) you want
2.1 With Lazarus:
2.2 With FreePascal:
- Run hgversion first to update the version info
fpc -dCGI tigercgi.lpr
- prerequisites: Linux/*nix (virtual) machine. Windows support may come later.
- prerequisites: have sane installed and configured for your scanner. E.g.:
aptitude install sane-utils
- prerequisites: have tesseract installed and configured. E.g.:
aptitude install tesseract-ocr tesseract-ocr-eng #for English language support
Note: we need version 3 because of hOCR support needed for getting searchable
- prerequisites: have exactimage installed (for hocr2pdf), e.g.:
aptitude install exactimage
- Tesseract must/can then be configured to output hocr, e.g.:
check you have this file present (adjust config directory to your situation):
If not (again, adjust config file location to your situation):
cat >> /usr/local/share/tessdata/configs/hocr << "EOF_DOCUMENT"
- prerequisites: have pdftk installed (for concatenating pdfs), e.g.:
aptitude install pdftk
- nice to have: have scantailor installed (for aligning/cleaning up the tiff
images before OCR).
see installation notes below
Installing the command line server:
- copy hocrwrap.sh to server directory (e.g. /opt/tigerserver/)
- copy scanwrap.sh to server directory
- copy tigerserver to server directory
- go to the server directory and make files executable, e.g. (replace
directory with your own if necessary):
chmod u+rx hocrwrap.sh
chmod u+rx scanwrap.sh
chmod u+rx tigerserver
- copy tigerserver.ini.template to tigerserver.ini and edit settings to match
Test by running ./tigerserver --help
Installing the cgi application:
- prerequisites: apache2 or another HTTP server that supports cgi
aptitude install apache2
- copy tigercgi to cgi directory (e.g. /usr/lib/cgi-bin).
Make sure the user Apache runs under may read and execute the file (e.g.
chmod ugo+rx tigercgi)
- copy hocrwrap.sh to cgi directory (e.g. /usr/lib/cgi-bin/)
- copy scanwrap.sh to cgi directory
- copy tigercgi to cgi directory
- copy tigerserver.ini.template to tigerserver.ini in the cgi directory and edit
settings to match your environment
- go to the cgi directory and make files executable for the apache/www user,
e.g. (replace directory with your own if necessary):
# replace user/groups below with correct user/group if needed, e.g. apache2
chown www-data:www-data hocrwrap.sh
chown www-data:www-data scanwrap.sh
chown www-data:www-data tigercgi
chown www-data:www-data tigerserver.ini
# make scripts executable:
chmod u+rx hocrwrap.sh
chmod u+rx scanwrap.sh
chmod u+rx tigercgi
chmod u+r tigerserver.ini
Installing the client:
- prerequisites: *nix: imagemagick dev libraries installed: e.g.
aptitude install imagemagick
- prerequisites: Windows: imagemagick DLLs e.g. Q16 x86 or x64 (depending on
papertiger client bitness) version downloaded from
http://www.imagemagick.org/script/binary-releases.php in client directory or
- compilation without imagemagick is possible (see source code for compiler
define) but the program will be much slower
- copy tigerclient.ini.template to tigerclient.ini and edit settings to match
Building Tesseract 3
If tesseract 3 is not available for your platform, you will need to build it.
Preliminary notes for building Tesseract 3 on Debian aqueeze
aptitude install build-essential leptonica libleptonica-dev libpng-dev
libjpeg-dev libtiff-dev zlib1g-dev
# as root:
tar -zxvf tesseract-3.01.tar.gz
checkinstall #follow the prompts and type "y" to create documentation directory.
# Enter a brief description then press enter twice
#language/training data, e.g. for Dutch and English:
#todo: check dir
Building scantailor from source
Scantailor is being developed; we use the scantailor enhanced fork.
Notes for Debian below.
# get compilers and dependencies
aptitude install build-essential cmake libqt4-dev libjpeg-dev zlib1g-dev \
libpng-dev libtiff-dev libtiff5-alt-dev libboost-dev libxrender-dev \
#libtiff5-alt-dev for good measure; hope it improves tiff support
Get source from git repository:
git clone git://git.code.sf.net/p/scantailor/code scantailor
git checkout enhanced #check out branch called "enhanced"
su - #switch to root
cd /home/pascaldev/scantailor #or wherever the files are located
exit #out of root
Getting PDF viewers to open a certain page:
Adobe Acrobat Reader
acrobat.exe /A "page=<pageNo>"
could also use "nameddest=<named destination>"
sumatrapdf -reuse-instance -page <pageNo>
Scrolls the first indicated file to the indicated page.
Tells an already open SumatraPDF to load the indicated files. If there are
several running instances, behaviour is undefined.
ImageMagick DLLs on Windows
The following dlls seem sufficient for converting TIFF images for the client -
I just copied all dlls:
In modules\coders (just copied all dlls)