Transcribo - a plain text rendering library written in pure Python

Mailing List: (currently not maintained)
Version: 0.7 alpha
Author: fhaxbox66 at googlemail.com
(c) 2009-2010 Dr. leo

1   What's new?

Version 0.7

This is a milestone release with many new features. Much of the code has been refactored.

• unified command line front end using argparse (dependency under Python2.6)
• new generic configuration system named yaconfig with cascading style sheets using PyYAML (new dependency)
• supports multiple YAML files which are successively mixed into a tree of nested dictionaries
• multiple inheritance from any node specified by absolute or local paths (relative paths not fully supported)
• supports string interpolation similar to configparser from the stdlib (this feature is not used though)
• more rST features including
• references and targets (not yet footnotes)
• definition lists, literal blocks and transitions
• use the class directive to change hyphenation, wrapper, translator etc. on the fly
• readers (the components that read the input files such as rST are fully configurable through cascading style sheets in YAML format. In case of rST this means that the Docutils own configuration system is no longer visible to the Transcribo user. Note that Transcribo when used with the rst reader acts as a Docutils writer component.
• no longer depends on a Braille translator such as YABT
• hard page breaks improved; can be used with rST reader through style sheets: break page after end of section etc.

Version 0.6

• hard page breaks
• avoid widows and orphans when soft-breaking pages
• support for Hyphenation (requires PyHyphen, see the installation instructions below)
• numerous bug fixes and improvements

Version 0.5.3:

This is a minor bug fix release.

2   Introduction

The transcribo project is aimed at the development of a modular, easy to use and powerful cross-platform software to convert various file formats into accurate plain text. What might seem a somewhat strange goal in the age of pdf and HTML turns out to be very useful, e.g., for output devices which can only handle plain text such as Braille embossers. Indeed, Transcribo has been designed with the objective in mind to allow printing documents in high-quality Braille. However, Transcribo should be useful in all contexts where text-based output formats in highly customizable layouts are needed.

Transcribo has been designed so as to separate the processing of the input file from the actual rendering algorithm. Hence, there are two layers: In the input layer various format-specific readers parse the input streams and feed them into the renderer (second layer).

More specifically, the input layer may contain readers specific to each supported input format. readers do the following:

• parse the input file,
• derive from it the layout structure and
• use the renderer to generate
• a proprietary tree representation of the document, and
• traverse the tree creating a line-by-line representation of the document.
• Thereafter, the renderer's paginator is called to insert white space as margins, page breaks, create headers and footers, resolve page references etc.
• Finally, the paginated line-by-line representation is assembled to a plain text file.

The renderer allows to attach to each content block (paragraph, heading, reference etc.) a specific translator and wrapper including optional hyphenation to perform translations and achieve the required text outline. In combination with readers for mark-up languages, this feature allows the user to control the output at a high level of granularity.

Currently there are readers for reStructuredText and plain text. Additional readers for formats such as LaTeX, ODF, RTF, XML formats such as DocBook and HTML appear useful.

3   Installation and usage

Transcribo is developed with Python 2.6. It should run on older versions, possibly with small changes. There are a few mandatory and optional dependencies:

• PyYAML
• argparse It is already included in the stdlib of Python 2.7.
• if you want to have hyphenated output, you'll need PyHyphen
• If you want to use the translation features for Braille, you may wish to install one of the following Braille translators:
• Docutils, because Transcribo's rST reader is essentially a docutils writer component. Well, if you are happy with txt2txt, forget this.

Transcribo is a pure Python package. It is installed by unpacking the archive and typing from the shell prompt something like:

cd <package dir>
python setup.py install


The test/test.py script demonstrates how to use Transcribo programmatically. Use the transcribe.py script from the shell prompt to generate paginated plain text from rST or plain text documents. Type 'transcribo.py --help' to read an argparse-generated help text on the available commands. Examples:

# Generate a block-aligned text with en_US hyphenation dictionary. Requires PyHyphen!
transcribe infile.rst outfile.out --styles align-block hyphen_en_US
# Note that '--reader rst' is used by default. the 'base.yaml' style file is
# Generate paginatd plain text from plain text. Each blank line
# is interpreted as a paragraph separator.
transcribe infile.txt outfile.out --reader txt