HTTPS SSH

README

What is this repository for?

This repository is for the Roadrunner "Efficient and Scalable Processing of Big Data" project. The result of this project was published in SENTIRE 2016.

  • SentimentAnalysis:

    -- A python project responsible for the preprocessing and classification of tweets. (scikit-learn, nltk, gate pos-tag model)

  • Storm:

    -- A topology for real time streaming tweets processing and classification.

  • Web:

    -- A Django project to visualize the results of the streaming process.

How do I get set up?

  • Summary of set up Download Apache Storm and extract it in a directory. Set this directory/bin to path : export PATH="/path/to/apache-storm-0.9.7/bin:$PATH"

  • Configuration

    • Download Apache Storm
    • Extract the zip/tar to a folder and set the path to bin to the path
    • To run it in local mode:

      • Modify the /conf/storm.yaml: ``` yaml storm.zookeeper.servers: - "localhost"

      nimbus.host: "localhost" - Open Topology.java:java // The following line submits the topology to the configured cluster // do not use this while in local mode / StormSubmitter.submitTopology("TweetTopology", conf, builder.createTopology()); /
      // the following lines run storm in local mode final LocalCluster cluster = new LocalCluster(); cluster.submitTopology("TweetTopology", conf, builder.createTopology());

      Runtime.getRuntime().addShutdownHook(new Thread() { @Override public void run() { cluster.killTopology("TweetTopology"); cluster.shutdown(); } }); } `` - In the multilang folder, modify:Storm/twitter_storm/multilang/resources/TweetUtils/helpers/static_paths.pyso that the paths reflect to your absolute paths - In multilang folder, modify:Storm/twitter_storm/multilang/resources/TweetUtils/database/init.py`, so that the credentials reflect your MySQL set-up*.

After that, just run Topology.java. You should be able to see the logs of the process in the console.

*Notes on this will come soon.

  • Dependencies

  • Database configuration MySQL: A MySQL instance must be running and must be accessible to the machine that will run the PreprocessingBolt. Configure database regarding credentials and host.

Also: in the SentimentAnalysis/FigurativeTextAnalysis/data/dumps folder, unzip and run MySQL_final.sql.tar.gz. This creates the structure and fills the database with the respective data.

MogoDB:

A MongoDB instance must be running and must be accessible to the machine that will run the MongoBolt. (Set up MongoBolt with the proper credentials and host)

Datasets

Since the target is to classify tweets from the stream, it seems reasonable to have a dataset to train the classifier that contains as much variety as possible, for example tweets that contain irony.

The idea is to combine known datasets to construct a relatively balanced dataset:

- Semeval 2015 Task 11 (12529 tweets – ironic/ sarcastic/ metaphoric and others – highly skewed negatively). The whole dataset is used. Score scale was transformed from [-5, 5] to [-1, 1] distinct.
- SemEval 2013 Task 2 b (9059 tweets – general purpose dataset used in consequent SemEval tasks). The whole dataset is used. “NeutralORIrrelevant” was considered neutral (0)
- Twitter Sentiment Classification using Distant Supervision (http://help.sentiment140.com/for-students)
    1.6M tweets annotated as positive or negative based on emoticons.*
    Since emoticons have been stripped from the dataset, in order to utilize the feature and see results with and without it, emoticons are restored in the tweets used. (this is in accordance to the preprocessing mentioned in their paper)
    Scale [0, 4] was converted to [-1, 1].
    Also, tweets with both positive and negative score were removed because they could not be safely considered either positive, negative or neutral.
    Twitter Sentiment Classification using Distant Supervision
    Manual data (408 positive, negative, neutral hand annotated tweets the authors used to test their method, using as test set the 1.6M dataset – the test was on positive-negative classes only). The whole dataset is used.
- Towards Building Large-Scale Distributed Systems for Twitter Sentiment Analysis:
    - 30000 Manually labeled as neutral (~26300 unique)
    - 1000 Manually labeled as positive, negative, neutral

Preprocessing

Morphological Feature Extraction:

__OH_SO__: Presence/ Absence of “Oh, so …” that may indicate ironic/ sarcastic text.         
__DONT_YOU__: Presence/ Absence of “Don’t you …” that may indicate ironic/ sarcastic text.         
__AS_GROUND_AS_VEHICLE__: Presence/ Absence of “As * as * …” that may indicate ironic/ sarcastic text (reference)
__CAPITAL__: Presence/ Absence of capitalized words with more than two letters
Hashtag Sentiment Calculation: Weighing last hashtag more when calculating the total hashtag polarity (following features are mutually exclusive):
__HT__: Boolean                
__HT_POS__: Boolean                  
__HT_NEG__: Boolean                  
__LINK__ : Presence/ Absence of urls  may indicate neutral tweet (reference?)       
__POS_SMILEY__: Presence/ Absence of common positive emoticons
__NEG_SMILEY__: Presence/ Absence of common negative emoticons
__NEGATION__: Presence/ Absence of negating words such as not, cant etc.  
__REFERENCE__: Presence/ Absence of user mentions like @user
__questionmark__: Presence/ Absence of ?
__exclamation__: Presence/ Absence of !
__fullstop__: Presence/ Absence of more than two consecutive dots.
__RT__: Presence/ Absence of retweet          
__LAUGH__: Presence/ Absence of common laughter indications such as haha, lol etc
__punctuation_percentage__:  The percentage of punctuation in a tweet (usage                                                                       of a threshold tbd)
__hashtag_lexicon_sum__: Using NRC Hashtag Lexicon, calculate the sum of the score for all hashtags in a tweet. (final value is: positive or negative or neutral)
__multiple_chars_in_a_row__: If the tweet contains more than two consecutive characters True, else False.

Cleaning and normalization:

  • Stripping non-printable characters
  • Remove RT indication
  • Remove common laughter indications (haha, lol etc.)
  • Remove URLs
  • Remove common emoticons
  • Remove @user – Hashtags are NOT removed
  • Reduce spaces between words to one
  • Remove non word characters (punctuation etc.)
  • Tokenize (using nltk)
  • Convert to lower
  • Remove more than two consecutive characters and spellcheck respective words
  • Normalize hashtags into split singular words using a word cost algorithm and a singularization package.
  • Remove common stopwords (nltk and others)

Other Features’ extraction:

  • Part-Of-Speech tags: 'postags' : Dictionary with words as keys and their part-of-speech as value (twitter specific tagger used -gate tagger)
  • Part-Of-Speech tags that combine tags and word position: Dictionary with word positions** in a tweet as keys and their part-of-speech as value (twitter specific tagger used -gate tagger)
  • Total SentiWordNet score (swn_score): The sum of SentiWordNet score of all the words*
  • SentiWordNet score for each word (s_word- word_position) : SentiWordNet score for each word in combination with the position of the word.
  • Resnik similarity: "Presence/ Absence"
  • is_metaphor : Boolean. A Linear SVM classifier is trained with ~12000 tweets, 3242 collected from MetaphorMagnet, 3207 from MetaphorMinute, 3225 positive tweets from 1.6M dataset, 3225 negative tweets from 1.6M dataset (as non metaphoric). The tweets from 1.6M were chosen in a way such as not to overlap with the ones used in the main dataset (62000 or 42000) This classifier is used to determine whether a tweet contains metaphor (True) or not (False).

Deployment

The set up for distributed, multi–node Storm cluster

We used 6 virtual machines with Ubuntu 14.04.2 LTS.

First we had to set up a single-node Zookeeper cluster for our Storm cluster.
This was set up on a virtual machine with 4 CPUs , 6144MB RAM and 10GB disk size.

For the master node , in order to run Storm’s Nimbus daemon and Storm UI , we set up a virtual machine with 4 CPUs , 6144MB RAM and 10GB disk size.

For the slave nodes , in order to run Storm’s Supervisor daemon we set up 4 different virtual machines.

The first node(supervisor – preprocessing bolt) with 4 CPUs , 8192MB RAM and 20GB disk size.
It has been configured in such way to run spout and preprocessing bolt’s tasks.

The second node (supervisor –  post processing bolt) with 2 CPUs , 4096MB RAM and 10GB disk size. It has been configured in such way to run the post processing bolt’s tasks .

The third node (supervisor – classification Bolt) with 2 CPUs , 6144MB RAM and 20GB disk size. It has been configured in such way to run the classification bolt’s tasks.

The forth node (supervisor – mongo Bolt && Statistics bolt ) is with 2 CPUs , 8192MB RAM and 20GB disk size. It has been configured in such way to run Mongo Bolt && Statistics bolt’s tasks.

VM CPU Cores RAM Disk Size Scheduled For Final Parallelism Hint
Zookeeper 4 6GB 10GB - -
Master node (nimbus & UI) 4 6GB 10GB - -
Slave node 1 (supervisor) 4 8GB 20GB Tweet Spout & Preprocessing Bolt 12
Slave node 2 (supervisor) 2 4GB 10GB Post Processing Bolt 12
Slave node 3 (supervisor) 2 6GB 20GB Classification Bolt 12
Slave node 4 (supervisor) 2 8GB 20GB Mongo && Statistics Bolts 12

Contribution guidelines

  • Writing tests
  • Code review
  • Other guidelines

Who do I talk to?