The code expects the following toolkits to be installed:


The code depends on a few C-programs in the toolbox, created by Arnold Meijster. These only need to be compiled in place, which can be done by entering the toolkit folder and running make:

```(bash) cd toolkit make

# Usage
The program is shipped with a trained recognizer, which is stored in `state.p`. So training is not necessary!

## Recognizing a text in a word file
To determine all characters and text for each word in a specific word file and accompanying page image, use:

./ image.jpg word.xml output.xml

That's it! It should also be able to handle other image formats, such as pgm.

Training on a new dataset

To train the recognizer on a new dataset, or test it on a part of the dataset, a few steps are necessary.

The paths to the dataset are coded in Here, the function default_dataset() specifies which datasets to load. The default implementation, which uses the datasets as included in this repository:

(python) def default_dataset(): return CombinedDataset([ Dataset('data/pages/KNMP/','data/charannotations/KNMP/'), Dataset('data/pages/Stanford/','data/charannotations/Stanford/') ])

When you run, it will by default only test on 20% of the dataset. To force the recognizer to retrain itself, remove the state.p file. Then, running the recognizer without arguments will cause it to retrain and test itself again.


The repository contains a number of tools:

  • is a test case for using the same window size for every instance of the same character.
  • is a test case for determining the bounding box of a character by using connected components.
  • is a test case which handles binarization using Otsu and opening, which is then used to mask the image partially.
  • tests whether certain ngrams occur more than others, and if it would be worth wile to train on ngrams instead of only single characters as well.
  • compares the occurrences of f and s in the dataset.
  • plots the distribution of widths per character class.

There are also a few interactive tools, implemented as micro webservices using

  • is a tool to explore the dataset
  • is a very basic program to convert certain s-shaped f's to actual f's.
  • allows you to inspect the files that can be dumped by the recognizer, which encode not only the final solution, but also all other candidates, and how they were classified. To create files that can be read by this service, set DUMP_PARTIALS to True in