Clone wiki

sequor / SemiNER

SemiNER - Named Entity Labeler

SemiNER is a tool for Named Entity labeleling. This release includes:

  • two models trained on the German CoNLL data with features extracted from a large unlabeled German corpus
  • a model trained on the BBN corpus.

Usage

The easiest way to start using SemiNER is to run the following command from the top-level sequor directory:

cabal install --prefix=`pwd`

This assumes that you have already installed the Haskell platform from http://www.haskell.org/platform

There are two pretrained German models: full (which uses all the features from training data, including lemmas, POS tags and chunk tags) and raw (which only uses word features and cluster id features). You don't need to run any additional preprocessing steps to run the raw model.

There is also a single English model, which also does not need any additional preprocessing.

Run these commands from the toplevel sequor directory. To label German text using the raw pre-trained model:

bin/seminer de-raw < INPUT-FILE > OUTPUT-FILE

To label German text using the full pre-trained model:

bin/seminer de-full < INPUT-FILE > OUTPUT-FILE

To label English text:

bin/seminer en < INPUT-FILE > OUTPUT-FILE

Format

The CoNLL input format is one token per line, sentences separated by a blank line.

For prediction with the German raw model you just need the word forms:

Seit
1740
wurde
im
Steinheimer
Stadtwirtshaus
Apfelwein
ausgeschenkt
.

For the German full model you need to provide word-form, lemma, POS and the chunk label:

Seit seit APPR B-PC
1740 @card@ CARD B-NC
wurde werden VAFIN B-VC
im im APPRART B-PC
Steinheimer <unknown> NN B-NC
Stadtwirtshaus Stadtwirtshaus NN I-NC
Apfelwein Apfelwein NN I-NC
ausgeschenkt ausschenken VVPP B-VC
. . $. O

These annotations should be compatible with those in the German training data from CoNLL 2005 i.e. using Treetagger: http://www.cnts.ua.ac.be/conll2003/ner/

For prediction with the English model you only need to provide the word-forms.

Updated