1. lowlands
  2. Data and code
  3. ttagger-nsd

Overview

HTTPS SSH

ttagger-nsd: Twitter tagger trained using ``not-so-distant'' supervision

Introduction

This package contains the tagging models for Twitter POS and NER using the Lowlands Twitter tagger using not-so-distant supervision [1]. The models are based on CRFsuite [3]. The POS tags are the universal pos tags proposed in [2]. If you use this package, please cite [1]:

@inproceedings{Plank:ea:2014:COLING,
    Author = {Barbara Plank, Dirk Hovy, Ryan McDonald and Anders Søgaard},
    Booktitle = {COLING}
    Title = {Adapting taggers to Twitter with not-so-distant supervision},
    Address = {Dublin, Ireland},
    Year = {2014}}

Installation

  • make sure you have a running version of crfsuite under tools/crfsuite-0.12/bin/crfsuite (the file included is compiled for and has been tested on Linux)

  • set the variable LOWLANDS_TTAGGER_HOME to the directory where you unpacked ttagger-nsd export LOWLANDS_TTAGGER_HOME=`pwd`

  • add crfutils.py to your PYTHONPATH: export PYTHONPATH=`pwd`/tools/:$PYTHONPATH

If you want to store these variables permamently, add them with appropriate paths to your .bashrc file.

Usage

run the POS tagger:

  ./runPOS.sh -t FILE 

where FILE contains one token per line, sentences are separated by a blank line. The output is written to a file called FILE.tagged, unless you specify the -s flag (write to stdout).

FILE can be either only one token per line, or token and gold tag (for testing purposes). In the former case, the output is token and predicted tag, in the latter case you will get token,gold tag and the predicted tag in the last column.

Example:

  ./runPOS.sh -t data/pos/example-nogold.txt

For options, see:

./runPOS.sh -h

run the NER tagger:

./runNER.sh -t <FILE>

where FILE contains one token and POS tag per line, tab-separated. The output is written to FILE.NER-tagged, unless you specify -s (write to stdout). Again, the file can optinally contain the gold-tag in the last column. (see data/ner/example.txt vs data/ner/example-nogold.txt)

Example:

  ./runNER.sh -t data/ner/example-nogold.txt

and, with gold tags:

  ./runNER.sh -t data/ner/example.txt

Both taggers use the best models described in [1], that is, models were trained on tweets by exploiting tag projections from URLs [1].

More specifically, default models are:

  • for POS: DICT≺WEB model (iter=25, trained on WSJ+Gimpel)

  • for NER: DICT≺WEB model (iter=27, trained on CoNLL+Finin)

References

[1] Barbara Plank, Dirk Hovy, Ryan McDonald and Anders Søgaard. Adapting taggers to Twitter with not-so-distant supervision. In COLING 2014.

[2] Slav Petrov, Dipanjan Das, Ryan McDonald. A universal part-of-speech tagset. In LREC, 2012.

[3] http://www.chokkan.org/software/crfsuite/