tiedemann

Jörg Tiedemann

ParCor

ParCor 1.0 is a parallel corpus of texts in which pronoun coreference -- reduced coreference in which pronouns are used as referring expressions -- has been annotated. It consists of a collection of parallel English-German documents from two different text genres: TED Talks (transcribed planned speech), and EU Bookshop publications (written text). All documents in the corpus have been manually annotated with respect to the type and location of each pronoun and, where relevant, its antecedent.

pdf2xml

A tool for converting PDF files to XML

Uplug

A collection of tools for processing parallel corpora

subalign

Tools for converting and aligning (translated) movie subtitles

Blacklist Classifier

A simple and fast classifier for language discrimination between closely related languages based on word blacklists

Lingua-Align

A package for word and tree alignment