This system was submitted to the 2017 Semeval task. The system is built from the SimpleTextClassifier project codes.

The primary difference is that 2 features are introduced in addition to n-gram (sparse) vectors. They are: 1. Dense vectors using pre-trained Twitter Word Embeddings. 2. Generalized sparse vectors- vectors of word cluster numbers.

Identical to the SimpleTextClassifier, the is used to find optimal values for the cost parameter and weights for the classes. is used to perform the prediction on the test set.

Requirements: - pandas, sklearn, nltk, numpy

How to run: I had placed some hard-coded values for the weights in this file so that it can be quickly run... For best results, Please use on your data and use the identified optimal values in this code.

The directory './english_training/' contains the Semeval annotated files. They are available here (at competition time):

For dense word embeddings, pretrained vectors are used. We used the embeddings available here (at competition time):

The binary model and the extracted package contents should be placed in './word2vec_twitter_model/' please change the file path in the variable model_path in the code below to point to the right folder

Cluster features are generated using the CMU Twitter clusters available here (at competition time): please place this file in the project folder

Happy running. Citation details to be added soon...

Note: I will not be actively maintaining this code. Please email me at: for questions.