This system was submitted to the 2017 Semeval task. The system is built from the SimpleTextClassifier project codes.
The primary difference is that 2 features are introduced in addition to n-gram (sparse) vectors. They are: 1. Dense vectors using pre-trained Twitter Word Embeddings. 2. Generalized sparse vectors- vectors of word cluster numbers.
Identical to the SimpleTextClassifier, the runcrossvalidation_3_class.py is used to find optimal values for the cost parameter and weights for the classes. runclassifier_3_class.py is used to perform the prediction on the test set.
Requirements: - pandas, sklearn, nltk, numpy
How to run: I had placed some hard-coded values for the weights in this file so that it can be quickly run... For best results, Please use runcrossvalidation_3_class.py on your data and use the identified optimal values in this code.
The directory './english_training/' contains the Semeval annotated files. They are available here (at competition time): http://alt.qcri.org/semeval2017/task4/index.php?id=data-and-tools
For dense word embeddings, pretrained vectors are used. We used the embeddings available here (at competition time): http://www.fredericgodin.com/software/
The binary model and the extracted package contents should be placed in './word2vec_twitter_model/' please change the file path in the variable model_path in the code below to point to the right folder
Cluster features are generated using the CMU Twitter clusters available here (at competition time): http://www.cs.cmu.edu/~ark/TweetNLP/clusters/50mpaths2 please place this file in the project folder
Happy running. Citation details to be added soon...
Note: I will not be actively maintaining this code. Please email me at: firstname.lastname@example.org for questions.