1. Grzegorz Chrupała
  2. sequor

Overview

HTTPS SSH

Sequor

Sequor is a sequence labeler based on Collins's (2002) perceptron. Sequor has a flexible feature template language and is meant mainly for NLP applications such as Named Entity labeling, Part of Speech tagging or syntactic chunking. It includes the SemiNER named entity recognizer, with pre-trained models for German and English (see Named Entity Recognition (SemiNER)).

Sequor is especially useful if your dataset has a large label set. In this case it is likely to run faster and allow you to use much less RAM than a sequence labeler based on Conditional Random Fields. Additionally sequor implements options which allow you to control the size of model and tradeoff speed against accuracy:

  • size of the beam
  • label dictionary
  • feature hashing

See https://bitbucket.org/gchrupala/sequor/wiki/Options for details.

Installation

The easiest way to compile and install sequor is to

  1. Install the Haskell platform

  2. Run:

    cabal update
    cabal install sequor --prefix=`pwd`
    

Cabal should then download and install the necessary packages, and install the sequor binary in ./bin, and the data files in ./share

Usage

With Sequor you can learn a model from sequences manually annotated with labels, and then apply this model to new data in order to add labels. Sequor is meant to be used mainly with linguistic data, for example to learn Part of Speech tagging, syntactic chunking or Named Entity labeling:

Usage: sequor command [OPTION...] [ARG...]
train:    train model
train [OPTION...] TEMPLATE-FILE TRAIN-FILE MODEL-FILE
  --rate=NUM (0.01)         learning rate
  --beam=INT (10)           beam size
  --iter=INT (10)           number of iterations
  --min-count=INT (100)     minimum feature frequency for label dictionary
  --heldout=FILE            path to heldout data
  --hash                    use hashing instead of feature dictionary
  --hash-sample=INT (1000)  sample size to estimate number of features when hashing
  --hash-max-size=INT       maximum size of parameter vector when hashing

See https://bitbucket.org/gchrupala/sequor/wiki/Options for more details about the training options.

predict: predict using model predict MODEL-FILE

version: print version version

help: print usage information help

Data files should be in the UTF-8 encoding.

As an example we can use data annotated with syntactic chunk labels in the data directory. For example:

./bin/sequor train data/all.features data/train.conll  model\
           --rate 0.1 --beam 10 --iter 5 --hash\
           --heldout data/devel.conll

./bin/sequor predict model < data/test.conll > data/test.labels

Feature template syntax

Sequor uses a mini language to specify which features to extract from data. For details see https://bitbucket.org/gchrupala/sequor/wiki/Templates

Named Entity Recognition (SemiNER)

Sequor includes the SemiNER named entity recognizer, with pre-trained models for German and English.

The German model recognizer is trained on the CoNLL 2003 data and recognizes the following labels:

  • PER - people
  • ORG - organizations
  • LOC - locations such as cities and countries
  • MISC - miscellaneous entities such as nationalities

The German model is described in [Chrupala_and_Klakow_2010].

The English model is trained on the BBN Wall Street Journal data and recognizes the following labels:

  • CARDINAL - cardinal number
  • DATE - calendar date
  • GPE:CITY - city
  • GPE:COUNTRY - country
  • GPE:STATE_PROVINCE - state or province
  • MONEY - currency
  • NORP:NATIONALITY - nationality
  • NORP:OTHER -
  • NORP:POLITICAL - political affiliation
  • ORDINAL - ordinal number
  • ORGANIZATION - organization
  • PERCENT - percentage
  • PERSON - people
  • QUANTITY - numerical quantity

See https://bitbucket.org/gchrupala/sequor/wiki/SemiNER for usage information.

Sequence perceptron

Compared to the commonly used Conditional Random Field model, the Sequence Perceptron algorithm is simpler, more efficient and often has similar performance.

The sequence perceptron was introduced in [Collins_2002].

[Collins_2002]Collins, Michael. 2002. Discriminative training methods for Hidden Markov Models: Theory and experiments with perceptron algorithms. EMNLP 2002. http://www.clic.cs.columbia.edu/~mcollins/papers/tagperc.pdf
[Chrupala_and_Klakow_2010]Grzegorz Chrupała and Dietrich Klakow. 2010. A Named Entity Labeler for German: exploiting Wikipedia and distributional clusters. LREC. http://grzegorz.chrupala.me/papers/lrec-2010.pdf