# MoNoise Readme # Monoise is a lexical normalization model for Twitter (but could be used for other domains). In short it's task is to convert: new pix comming tomoroe to: new pictures coming tomorrow Monoise achieves state of the art performance on different datasets, and can normalize multiple sentences per second on a single thread. A short abstract of the model: This model generates candidates using the Aspell spell checker and a word embedding model trained on Twitter data. Features from the generation are then complemented with n-gram probabilities of canonical text and the Twitter domain. A random forest classifier is exploited for the ranking of the generated candidates. For more information see the full paper (draft version): You can also try the demo on: ### Requirements ### * A recent c++ compiler (>=c++11) * Input data (or use -m INteractive) * For training your own model: a lot of memory, the random forest classifier is quite memory hungry. However, I did include some memory saving options, which result in a slightly lower performance (--badspeller and --known). Additionally, you could switch ranger to save memory, but it becomes very slow. * Instead, you could just download a model here: * Running the system should work with 4gb -8gb ram, depending on the size of the word embeddings/ngrams. ### Languages ### At the moment MoNoise has models for: * English * Dutch * Slovenian * Serbian * Croatian * Spanish The requirements for adding a new language are: * Annotated training data * Raw noisy data * clean data: I recommend a wikidump: * Aspell dictionary: the steps are: * Get aspell model from: * unpack: tar -zxjf * build: cd aspell6-en-2018.04.16-0/ && ./configure && make * generate dictionary: * aspell --dict-dir=./aspell6-en-2018.04.16-0/ --lang=en dump master | aspell -l en expand --dict-dir=./aspell6-en-2018.04.16-0/ > aspell.en * Clean your noisy and canonical data (remove markup); for wikipedia: ``` #!bash python3 -o ~/Downloads/slwiki-20170520-pages-articles-multistream.xml.bz2 cat*/* | grep -v "^<" > ``` * build ngrams ( for both domains * Train a word2vec model on noisy data * cache the word2vec model ( If you run into any problems or need any help, please do not hesitate to contact me ( ### Compilation ### Edit icmconf to find the right c++ version if necessary, then: ``` #!bash > cd src > icmbuild ``` If icmbuild is not available and you do not want to install it (sudo apt-get install icmake) : ``` #!bash > cd src > g++ --std=c++11 -Wall *cc */*cc -lpthread -L./headers -Wl,-rpath=./headers -laspell -o MoNoise ``` ### Run the system ### Just run the binary to see the possible options: ``` p270396@vesta1:src$ ./tmp/bin/binary USAGE: ./monoise -m MODE -r MODELPATH -d DATADIR [options] Options: -b --badspeller Set bad spellers mode for aspell; gives higher scores, but uses considerably more time and memory. -c <arg> --cands=<arg> Specify the number of candidates outputted when using RU or -p. -C --caps Consider capitals. Most corpora don't use capitals in gold data, so by default the input is converted to lowercase, and evaluation is done while ignoring capitals. -d <arg> --dir=<arg> Specify directory were required data is located. Needs: tweets.ngr.bin, wiki.ngr.bin, aspell, aspell-model w2v.bin and w2v.cache.bin. See for more information. -D <arg> --dev=<arg> Specify dev file (and test on it). Can only used with -m TR. -f <arg> --feats=<arg> Specify the feature groups to use. Should be the same as the trained model!. Expects a boolean string, default: 111111111. See util/feats.txt for possible features. -F <arg> --feats2=<arg> Specify the single features to use. Should be the same as the trained model!. Expects a boolean string, default: 1111111111111111111. See util/feats.txt for possible features. -g --gold Assume gold error detection. Can not be used with -m RUn, since it typically isnt available. This should match during train/test! -h --help Print usage and exit. -i <arg> --input=<arg> Expects input in lexnorm (3 collumn) format: <word> <spacefiller> <normalization>, when using TR, DE or TE. For RU raw text is expected. Reads from stdin if not used. -k <arg> --known=<arg> Normalize only to known words (1: only in train corpus, 2: also in knowns. -K <arg> --kfold=<arg> Number of folds to use, when using -m KF, default=10. -m <arg> --mode=<arg> Where arg = TRain, TEst, RUn, INteractive, KFold or DEmo (Required). -n <arg> --nThreads=<arg>Number of threads used to train the classifier (default=4). -o <arg> --output=<arg> File to write to, when TEsting it writes the results, and when RUnning it writes the normalization. Writes to stdout if not specified. -p <arg> --parse=<arg> Evaluate the parser. Argument should be the path to the gold treebank. Make sure the java parser runs first. -r <arg> --rf=<arg> Path to the random forest classifier (required). -s <arg> --seed=<arg> Seed used for random forest classifier (default=5). -S --syntactic Do not normalize: n't ca 'm 're. -t --tokenize Enable rule based tokenization (probably only usable with RU. -T <arg> --trees=<arg> Specify the number of trees used by random forest classifier. -u --unk Consider only unknown words for normalization. The list of known words can be specified in the config file. Note that this should probably also match during training and testing/running. -v --verbose Print debugging info. NF = Not Found, NN = Not Normalized, WR = Wrong Ranking. -w <arg> --weight<arg> Extra weight given to original word, to tune the precision/recall. -W --wordline Use tokenized input/output; one word per line and a newline for a sentence split. ``` ### Example run ### If you simply want to run a pre-trained model on new data, use these commands: ``` > cd monoise/src > icmbuild > mkdir ../data > cd ../data > curl | tar xvz > cd ../src > echo "new pix comming tomoroe" | ./tmp/bin/binary -r ../data/en/en.model -m RU -d ../data/en ``` For another language the steps are roughly the same: ``` > curl | tar xvz > cd ../src > echo "je vind da gwn lk" | ./tmp/bin/binary -m RU -r ../data/nlData/nlModel -d ../data/nl ``` ### Notes on running the parser ### There are two ways to communicate with the parser. The first is through sockets, MoNoise automatically connects to: ``` #!bash java -jar util/BerkeleyGraph.jar -gr enData/ -latticeWeight 2 -server 4447 & ./tmp/bin/binary -m TE -p treebank.ptb -i data.txt -r working/lexnorm -c 6 -a -u ``` A simpler way might be to use a pipeline approach: ``` #!bash ./tmp/bin/binary -m RU -i data -r working/lexnorm -c 6 -a -u | java -jar util/BerkeleyGraph.jar -gr enData/ -latticeWeight 2 ``` ### Reference? ### This model is described in detail in: MoNoise: Modeling Noise Using a Modular Normalization System. (van der Goot & van Noord, 2017) The results of the paper can be reproduced by the following command: ``` > ./scripts/clin/ ``` The parser is described in more detail in: Rob van der Goot & Gertjan van Noord. 2017. Parser Adaptation for Social Media by Integrating Normalization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. I made use of six other open source projects: * word2vec: * aspell: * ranger: * the lean mean c++ option parser: * evalb: * Berkeleyparser: ### Contact ### Did you encounter any problems with the installation/running of this software. Or do you want to train the model using your own n-grams/word embeddings, don't hesitate to contact me: ### Known Problems ###. On some setups it the aspell library might be incompatible with your compiler. In this case I suggest you to first try changing the compiler (in icmconf), otherwise the only option is to compile aspell yourself, using the patch (utils/aspell- If this gives difficulties, do not hesitate to contact me! The parser is written in Java, the normalization system communications with it through sockets. Kill the java program manually if the parser does not work anymore. If it still does not work, use -m RUn in combination with -c <num> to generate an n-best normalization output. Then parse the result: ``` #!bash ./tmp/bin/binary -m RU -i data -r working/lexnorm -c 6 -a -u | java -jar util/BerkeleyGraph.jar -gr enData/ -latticeWeight 2 ```