HTTPS SSH

MoNoise Readme

Monoise is a lexical normalization model for Twitter (but could be used for other domains). In short it's task is to convert:

new pix comming tomoroe

to:

new pictures coming tomorrow

Monoise achieves state of the art performance on different datasets, and can normalize multiple sentences per second on a single thread.

A short abstract of the model:

This model generates candidates using the Aspell spell checker and a word embedding model trained on Twitter data. Features from the generation are then complemented with n-gram probabilities of canonical text and the Twitter domain. A random forest classifier is exploited for the ranking of the generated candidates. For more information see the full paper (draft version): www.let.rug.nl/rob/doc/clin27.pdf

Requirements

  • A recent c++ compiler (>=c++11)
  • Input data (or use -m INteractive)
  • For training your own model: a lot of memory, the random forest classifier is quite memory hungry. However, I did include some memory saving options, which result in a slightly lower performance (--badspeller and --known). Additionally, you could switch ranger to save memory, but it becomes very slow.
  • Instead, you could just download a model here: www.let.rug.nl/rob/data/monoise
  • Running the system should work with 4gb -8gb ram, depending on the size of the word embeddings/ngrams.

Languages

At the moment MoNoise has models for:

  • English
  • Dutch
  • Slovenian
  • Serbian
  • Croatian

The requirements for adding a new language are:

the steps are:

  • generate aspell wordlist: aspell --data-dir /mnt/D/normalization/croatian/aspell-hr-0.51-0/ --dump master --lang hr > ~/projects/monoise/hrData/aspell
  • if you get an error about iso- files, copy them from another language.
  • Clean your noisy and canonical data (remove markup); for wikipedia:
python3 WikiExtractor.py  -o extracted.sl ~/Downloads/slwiki-20170520-pages-articles-multistream.xml.bz2
cat extracted.sl/*/* | grep -v "^<" > wiki.sl

If you run into any problems or need any help, please do not hesitate to contact me (r.van.der.goot@rug.nl)

Compilation

Edit icmconf to find the right c++ version if necessary, then:

> cd src
> icmbuild

If icmbuild is not available and you do not want to install it (sudo apt-get install icmake) :

> cd src
> g++ --std=c++11 -Wall *cc */*cc -lpthread -L./headers -Wl,-rpath=./headers -laspell -o MoNoise

Run the system

Just run the binary to see the possible options:

p270396@vesta1:src$ ./tmp/bin/binary 
USAGE: ./monoise -m MODE -r MODELPATH [options]

Options:
  -b         --badspeller    Set bad spellers mode for aspell; gives higher
                             scores, but uses considerably more time and memory.

  -C         --caps          Consider capitals. Most corpora don't use capitals
                             in gold data, so by default the input is converted
                             to lowercase, and evaluation is done while ignoring
                             capitals.

  -c <arg>   --cands=<arg>   Specify the number of candidates outputted when
                             using RU or -p.

  -f <arg>   --feats=<arg>   Specify the feature groups to use. Should be the
                             same as the trained model!. Expects a boolean
                             string, default: 111111111. See util/feats.txt for
                             possible features.

  -F <arg>   --feats2=<arg>  Specify the single features to use. Should be the
                             same as the trained model!. Expects a boolean
                             string, default: 1111111111111111111. See
                             util/feats.txt for possible features.

  -g         --gold          Assume gold error detection. Can not be used with
                             -m RUn, since it typically isnt available. This
                             should match during train/test!

  -h         --help          Print usage and exit.

  -i <arg>   --input=<arg>   expects input in lexnorm (3 collumn) format: <word>
                             <spacefiller> <normalization>, when using TR, DE or
                             TE. For RU raw text is expected. Reads from stdin
                             if not used.

  -k <arg>   --known=<arg>   Normalize only to known words (1: only in train
                             corpus, 2: also in knowns.

  -l <arg>   --lang=<arg>    Specify the language code. MoNoise will then read
                             its configuration from config.<arg>, default is
                             "en"

  -L <arg>   --lookup=<arg>  Specify lookup file generated from another corpus.

  -m <arg>   --mode=<arg>    Where arg = TRain, TEst, DEv, RUn or INteractive
                             (Required); DEv is equal to TEst, but only uses
                             part of the corpus (based on the config file).

  -n <arg>   --nThreads=<arg>Number of threads used to train the classifier
                             (default=4).

  -o <arg>   --output=<arg>  File to write to, when TEsting it writes the
                             results, and when RUnning it writes the
                             normalization. Writes to stdout if not specified.

  -p <arg>   --parse=<arg>   Evaluate the parser. Argument should be the path to
                             the gold treebank.

  -r <arg>   --rf=<arg>      Path to the random forest classifier (required).

  -s <arg>   --seed=<arg>    Seed used for random forest classifier (default=5).

  -S         --syntactic     Do not normalize: n't ca 'm 're.

  -t         --tokenize      Enable rule based tokenization (probably only
                             usable with RU.

  -T <arg>   --trees=<arg>   Specify the number of trees used by random forest
                             classifier.

  -u         --unk           Consider only unknown words for normalization. The
                             list of known words can be specified in the config
                             file. Note that this should probably also match
                             during training and testing/running.

  -v         --verbose       Print debugging info. NF = Not Found, NN = Not
                             Normalized, WR = Wrong Ranking.

  -w         --weight        Extra weight given to original word, to tune the
                             precision/recall.

  -x         --xml           Read in XML format as often used for south-slavic
                             data (janes-norm).

Example run

If you simply want to run a pre-trained model on new data, use these commands:

> cd monoise/src
> icmbuild
> cd ../data
> curl www.let.rug.nl/rob/data/monoise/enData.tar.gz | tar xvz
> cd ../src
> echo "new pix comming tomoroe" | ./tmp/bin/binary -r ../data/enData/enModel -m RU

For another language the steps are roughly the same:

> curl www.let.rug.nl/rob/data/monoise/nlData.tar.gz | tar xvz
> cd ../src
> echo "je vind da gwn lk" | ./tmp/bin/binary -m RU -r ../data/nlData/nlModel -l nl

Notes on running the parser

There are two ways to communicate with the parser. The first is through sockets, MoNoise automatically connects to: 127.0.0.1:4447

java -jar util/BerkeleyGraph.jar -gr enData/ewtwsj.gr -latticeWeight 2 -server 4447 &
./tmp/bin/binary -m TE -p treebank.ptb -i data.txt -r working/lexnorm -c 6 -a -u

A simpler way might be to use a pipeline approach:

./tmp/bin/binary -m RU -i data -r working/lexnorm -c 6 -a -u | java -jar util/BerkeleyGraph.jar -gr enData/ewtwsj.gr -latticeWeight 2

Reference?

This model is described in detail in:

MoNoise: Modeling Noise Using a Modular Normalization System. (van der Goot & van Noord, 2017)

http://www.let.rug.nl/rob/doc/clin27.pdf

http://www.let.rug.nl/rob/doc/clin27.bib

The results of the paper can be reproduced by the following command:

> ./scripts/clin/all.sh

The parser is described in more detail in:

Rob van der Goot & Gertjan van Noord. 2017. Parser Adaptation for Social Media by Integrating Normalization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.

http://www.let.rug.nl/rob/doc/acl17.bib

I made use of six other open source projects:

Contact

Did you encounter any problems with the installation/running of this software. Or do you want to train the model using your own n-grams/word embeddings, don't hesitate to contact me: r.van.der.goot@rug.nl

Known Problems ###.

On some setups it the aspell library might be incompatible with your compiler. In this case I suggest you to first try changing the compiler (in icmconf), otherwise the only option is to compile aspell yourself, using the patch (utils/aspell-0.60.6.1.patch.txt). If this gives difficulties, do not hesitate to contact me!

The parser is written in Java, the normalization system communications with it through sockets. Kill the java program manually if the parser does not work anymore.

If it still does not work, use -m RUn in combination with -c <num> to generate an n-best normalization output. Then parse the result:

./tmp/bin/binary -m RU -i data -r working/lexnorm -c 6 -a -u | java -jar util/BerkeleyGraph.jar -gr enData/ewtwsj.gr -latticeWeight 2