HTTPS SSH

This system was submitted to the 2017 Semeval task. The system is built from the SimpleTextClassifier project codes.

The primary difference is that 2 features are introduced in addition to n-gram (sparse) vectors. They are: 1. Dense vectors using pre-trained Twitter Word Embeddings. 2. Generalized sparse vectors- vectors of word cluster numbers.

Identical to the SimpleTextClassifier, the runcrossvalidation_3_class.py is used to find optimal values for the cost parameter and weights for the classes. runclassifier_3_class.py is used to perform the prediction on the test set.

Requirements: - pandas, sklearn, nltk, numpy

How to run: I had placed some hard-coded values for the weights in this file so that it can be quickly run... For best results, Please use runcrossvalidation_3_class.py on your data and use the identified optimal values in this code.

The directory './english_training/' contains the Semeval annotated files. They are available here (at competition time): http://alt.qcri.org/semeval2017/task4/index.php?id=data-and-tools

For dense word embeddings, pretrained vectors are used. We used the embeddings available here (at competition time): http://www.fredericgodin.com/software/

The binary model and the extracted package contents should be placed in './word2vec_twitter_model/' please change the file path in the variable model_path in the code below to point to the right folder

Cluster features are generated using the CMU Twitter clusters available here (at competition time): http://www.cs.cmu.edu/~ark/TweetNLP/clusters/50mpaths2 please place this file in the project folder

Happy running. Citation details to be added soon...

Note: I will not be actively maintaining this code. Please email me at: abeed@upenn.edu for questions.