16-12-2016: There is a new, better performing much faster version available by now!, it will be released soon!


First run, like:


The only files which are not directly downloadable is the google ngram corpus. This dataset is neccesary to replicate the results.

Now you can run train and test with the command:


This script depends on scipy, numpy, sklearn, matplotlib and gensim.

If you use this normalization model, please cite:

    author  = {van der Goot, Rob},
    title   = {Normalizing Social Media Texts by Combining Word Embeddings and Edit
Distances in a Random Forest Regressor},
    publisher = {Normalisation and Analysis of Social Media Texts (NormSoMe)},
    year = {2016}