MoNoise Readme

Monoise is a lexical normalization model for Twitter (but could be used for other domains). In short it's task is to convert:

new pix comming tomoroe


new pictures coming tomorrow

Monoise achieves state of the art performance on different datasets, and can normalize multiple sentences per second on a single thread.

A short abstract of the model:

This model generates candidates using the Aspell spell checker and a word embedding model trained on Twitter data. Features from the generation are then complemented with n-gram probabilities of canonical text and the Twitter domain. A random forest classifier is exploited for the ranking of the generated candidates. For more information see the full paper:

You can also try the demo on:


  • A recent c++ compiler (>=c++11)
  • Input data (or use -m INteractive)
  • For training your own model: a lot of memory, the random forest classifier is quite memory hungry. However, I did include some memory saving options, which result in a slightly lower performance (--badspeller and --known). Additionally, you could switch ranger to save memory, but it becomes very slow.
  • Instead, you could just download a model here:
  • Running the system should work with 4gb -8gb ram, depending on the size of the word embeddings/ngrams.


At the moment MoNoise has models for:

  • English
  • Dutch
  • Slovenian
  • Serbian
  • Croatian
  • Spanish

The requirements for adding a new language are:

the steps are:

python3  -o ~/Downloads/slwiki-20170520-pages-articles-multistream.xml.bz2
cat*/* | grep -v "^<" >

If you run into any problems or need any help, please do not hesitate to contact me (


Edit icmconf to find the right c++ version if necessary, then:

> cd src
> icmbuild

If icmbuild is not available and you do not want to install it (sudo apt-get install icmake) :

> cd src
> g++ --std=c++11 -Wall *cc */*cc -lpthread -L./headers -Wl,-rpath=./headers -laspell -o MoNoise

Run the system

Just run the binary to see the possible options:

p270396@vesta1:src$ ./tmp/bin/binary 
USAGE: ./monoise -m MODE -r MODELPATH -d DATADIR [options]

  -b         --badspeller    Set bad spellers mode for aspell; gives higher
                             scores, but uses considerably more time and memory.

  -c <arg>   --cands=<arg>   Specify the number of candidates outputted when
                             using RU or -p.

  -C         --caps          Consider capitals. Most corpora don't use capitals
                             in gold data, so by default the input is converted
                             to lowercase, and evaluation is done while ignoring

  -d <arg>   --dir=<arg>     Specify directory were required data is located.
                             Needs: tweets.ngr.bin, wiki.ngr.bin, aspell,
                             aspell-model w2v.bin and w2v.cache.bin. See
                    for more information.

  -D <arg>   --dev=<arg>     Specify dev file (and test on it). Can only used
                             with -m TR.

  -f <arg>   --feats=<arg>   Specify the feature groups to use. Should be the
                             same as the trained model!. Expects a boolean
                             string, default: 111111111. See util/feats.txt for
                             possible features.

  -F <arg>   --feats2=<arg>  Specify the single features to use. Should be the
                             same as the trained model!. Expects a boolean
                             string, default: 1111111111111111111. See
                             util/feats.txt for possible features.

  -g         --gold          Assume gold error detection. Can not be used with
                             -m RUn, since it typically isnt available. This
                             should match during train/test!

  -h         --help          Print usage and exit.

  -i <arg>   --input=<arg>   Expects input in lexnorm (3 collumn) format: <word>
                             <spacefiller> <normalization>, when using TR, DE or
                             TE. For RU raw text is expected. Reads from stdin
                             if not used.

  -k <arg>   --known=<arg>   Normalize only to known words (1: only in train
                             corpus, 2: also in knowns.

  -K <arg>   --kfold=<arg>   Number of folds to use, when using -m KF,

  -m <arg>   --mode=<arg>    Where arg = TRain, TEst, RUn, INteractive, KFold or
                             DEmo (Required).

  -n <arg>   --nThreads=<arg>Number of threads used to train the classifier

  -o <arg>   --output=<arg>  File to write to, when TEsting it writes the
                             results, and when RUnning it writes the
                             normalization. Writes to stdout if not specified.

  -p <arg>   --parse=<arg>   Evaluate the parser. Argument should be the path to
                             the gold treebank. Make sure the java parser runs

  -r <arg>   --rf=<arg>      Path to the random forest classifier (required).

  -s <arg>   --seed=<arg>    Seed used for random forest classifier (default=5).

  -S         --syntactic     Do not normalize: n't ca 'm 're.

  -t         --tokenize      Enable rule based tokenization (probably only
                             usable with RU.

  -T <arg>   --trees=<arg>   Specify the number of trees used by random forest

  -u         --unk           Consider only unknown words for normalization. The
                             list of known words can be specified in the config
                             file. Note that this should probably also match
                             during training and testing/running.

  -v         --verbose       Print debugging info. NF = Not Found, NN = Not
                             Normalized, WR = Wrong Ranking.

  -w <arg>   --weight<arg>   Extra weight given to original word, to tune the

  -W         --wordline      Use tokenized input/output; one word per line and a
                             newline for a sentence split.

Example run

If you simply want to run a pre-trained model on new data, use these commands:

> cd monoise/src
> icmbuild
> mkdir ../data
> cd ../data
> curl | tar xvz
> cd ../src
> echo "new pix comming tomoroe" | ./tmp/bin/binary -r ../data/en/en.model -m RU -d ../data/en -f 111101111111

For another language the steps are roughly the same:

> curl | tar xvz
> cd ../src
> echo "je vind da gwn lk" | ./tmp/bin/binary -m RU -r ../data/nlData/nlModel -d ../data/nl -f 111101111111

Notes on running the parser

There are two ways to communicate with the parser. The first is through sockets, MoNoise automatically connects to:

java -jar util/BerkeleyGraph.jar -gr enData/ -latticeWeight 2 -server 4447 &
./tmp/bin/binary -m TE -p treebank.ptb -i data.txt -r working/lexnorm -c 6 -a -u

A simpler way might be to use a pipeline approach:

./tmp/bin/binary -m RU -i data -r working/lexnorm -c 6 -a -u | java -jar util/BerkeleyGraph.jar -gr enData/ -latticeWeight 2


This model is described in detail in:

MoNoise: Modeling Noise Using a Modular Normalization System. (van der Goot & van Noord, 2017)

The results of the paper can be reproduced by the following command:

> ./scripts/clin/

The parser is described in more detail in:

Rob van der Goot & Gertjan van Noord. 2017. Parser Adaptation for Social Media by Integrating Normalization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.

I made use of six other open source projects:


Did you encounter any problems with the installation/running of this software. Or do you want to train the model using your own n-grams/word embeddings, don't hesitate to contact me:

Known Problems ###.

On some setups it the aspell library might be incompatible with your compiler. In this case I suggest you to first try changing the compiler (in icmconf), otherwise the only option is to compile aspell yourself, using the patch (utils/aspell- If this g

tar -zxvf aspell-
cd aspell-
cp ~/projects/monoise/utils/aspell- .
patch -p1 < aspell-
cp .libs/ .libs/ .libs/ interfaces/cc/aspell.h ~/projects/monoise/src/aspgen/

The parser is written in Java, the normalization system communications with it through sockets. Kill the java program manually if the parser does not work anymore.

If it still does not work, use -m RUn in combination with -c <num> to generate an n-best normalization output. Then parse the result:

./tmp/bin/binary -m RU -i data -r working/lexnorm -c 6 -a -u | java -jar util/BerkeleyGraph.jar -gr enData/ -latticeWeight 2