Monoise is a lexical normalization model for Twitter (but could be used for other domains). In short it's task is to convert:
new pix comming tomoroe
new pictures coming tomorrow
Monoise achieves state of the art performance on different datasets, and can normalize multiple sentences per second on a single thread.
A short abstract of the model:
This model generates candidates using the Aspell spell checker and a word embedding model trained on Twitter data. Features from the generation are then complemented with n-gram probabilities of canonical text and the Twitter domain. A random forest classifier is exploited for the ranking of the generated candidates. For more information see the full paper (draft version): www.let.rug.nl/rob/doc/clin27.pdf
- A recent c++ compiler (>=c++11)
- Input data (or use -m INteractive)
- For training your own model: a lot of memory, the random forest classifier is quite memory hungry. However, I did include some memory saving options, which result in a slightly lower performance (--badspeller and --known). Additionally, you could switch ranger to save memory, but it becomes very slow.
- Instead, you could just download a model here: www.let.rug.nl/rob/data/monoise
- Running the system should work with 4gb -8gb ram, depending on the size of the word embeddings/ngrams.
At the moment MoNoise has models for:
The requirements for adding a new language are:
- Annotated training data
- Raw noisy data
- clean data: I recommend a wikidump: https://dumps.wikimedia.org/backup-index.html
- Aspell dictionary: ftp://ftp.gnu.org/gnu/aspell/dict/0index.html
the steps are:
- generate aspell wordlist: aspell --data-dir /mnt/D/normalization/croatian/aspell-hr-0.51-0/ --dump master --lang hr > ~/projects/monoise/hrData/aspell
- if you get an error about iso- files, copy them from another language.
- Clean your noisy and canonical data (remove markup)
- build ngrams (https://bitbucket.org/robvanderg/utils) for both domains
- Train a word2vec model on noisy data
- cache the word2vec model (https://bitbucket.org/robvanderg/utils)
If you run into any problems or need any help, please do not hesitate to contact me (email@example.com)
Edit icmconf to find the right c++ version if necessary, then:
> cd src > icmbuild
If icmbuild is not available and you do not want to install it (sudo apt-get install icmake) :
> cd src > g++ --std=c++11 -Wall *cc */*cc -lpthread -L./headers -Wl,-rpath=./headers -laspell -o MoNoise
Run the system
Just run the binary to see the possible options:
p270396@vesta1:src$ ./tmp/bin/binary USAGE: ./monoise -m MODE -r MODELPATH [options] Options: -b --badspeller Set bad spellers mode for aspell; gives higher scores, but uses considerably more time and memory. -C --caps Consider capitals. Most corpora don't use capitals in gold data, so by default the input is converted to lowercase, and evaluation is done while ignoring capitals. -c <arg> --cands=<arg> Specify the number of candidates outputted when using RU or -p. -f <arg> --feats=<arg> Specify the feature groups to use. Should be the same as the trained model!. Expects a boolean string, default: 111111111. See util/feats.txt for possible features. -F <arg> --feats2=<arg> Specify the single features to use. Should be the same as the trained model!. Expects a boolean string, default: 1111111111111111111. See util/feats.txt for possible features. -g --gold Assume gold error detection. Can not be used with -m RUn, since it typically isnt available. This should match during train/test! -h --help Print usage and exit. -i <arg> --input=<arg> expects input in lexnorm (3 collumn) format: <word> <spacefiller> <normalization>, when using TR, DE or TE. For RU raw text is expected. Reads from stdin if not used. -k <arg> --known=<arg> Normalize only to known words (1: only in train corpus, 2: also in knowns. -l <arg> --lang=<arg> Specify the language code. MoNoise will then read its configuration from config.<arg>, default is "en" -L <arg> --lookup=<arg> Specify lookup file generated from another corpus. -m <arg> --mode=<arg> Where arg = TRain, TEst, DEv, RUn or INteractive (Required); DEv is equal to TEst, but only uses part of the corpus (based on the config file). -n <arg> --nThreads=<arg>Number of threads used to train the classifier (default=4). -o <arg> --output=<arg> File to write to, when TEsting it writes the results, and when RUnning it writes the normalization. Writes to stdout if not specified. -p <arg> --parse=<arg> Evaluate the parser. Argument should be the path to the gold treebank. -r <arg> --rf=<arg> Path to the random forest classifier (required). -s <arg> --seed=<arg> Seed used for random forest classifier (default=5). -S --syntactic Do not normalize: n't ca 'm 're. -t --tokenize Enable rule based tokenization (probably only usable with RU. -T <arg> --trees=<arg> Specify the number of trees used by random forest classifier. -u --unk Consider only unknown words for normalization. The list of known words can be specified in the config file. Note that this should probably also match during training and testing/running. -v --verbose Print debugging info. NF = Not Found, NN = Not Normalized, WR = Wrong Ranking. -w --weight Extra weight given to original word, to tune the precision/recall. -x --xml Read in XML format as often used for south-slavic data (janes-norm).
If you simply want to run a pre-trained model on new data, use these commands:
> cd monoise/src > icmbuild > cd ../data > curl www.let.rug.nl/rob/data/monoise/enData.tar.gz | tar xvz > cd ../src > echo "new pix comming tomoroe" | ./tmp/bin/binary -r ../data/enData/chenliLow.forest -m RU
For another language the steps are roughly the same:
> curl www.let.rug.nl/rob/data/monoise/nlData.tar.gz | tar xvz > cd ../src > echo "je vind da gwn lk" | ./tmp/bin/binary -m RU -r ../data/nlData/nlModelLow -l nl
This model is described in detail in:
MoNoise: Modeling Noise Using a Modular Normalization System. (van der Goot & van Noord, 2017)
The results of the paper can be reproduced by the following command:
The parser is described in more detail in:
Rob van der Goot & Gertjan van Noord. 2017. Parser Adaptation for Social Media by Integrating Normalization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.
I made use of six other open source projects:
- word2vec: https://code.google.com/archive/p/word2vec/
- aspell: http://aspell.net/
- ranger: https://github.com/imbs-hl/ranger
- the lean mean c++ option parser: http://optionparser.sourceforge.net/
- evalb: http://nlp.cs.nyu.edu/evalb/
- Berkeleyparser: https://github.com/slavpetrov/berkeleyparser
Did you encounter any problems with the installation/running of this software. Or do you want to train the model using your own n-grams/word embeddings, don't hesitate to contact me: firstname.lastname@example.org
Known Problems ###.
On some setups it the aspell library might be incompatible with your compiler. In this case I suggest you to first try changing the compiler (in icmconf), otherwise the only option is to compile aspell yourself, using the patch (utils/aspell-0.60.6.1.patch.txt). If this gives difficulties, do not hesitate to contact me!
The parser is written in Java, the normalization system communications with it through sockets. Kill the java program manually if the parser does not work anymore.
If it still does not work, use -m RUn in combination with -c <num> to generate an n-best normalization output. Then parse the result:
./tmp/bin/binary -m RU -i data -r working/lexnorm -c 6 -a -u | java -jar util/BerkeleyGraph.jar -gr enData/ewtwsj.gr -latticeWeight 2