Commits

<jro...@gmail.com  committed 2498d79

  • Participants
  • Parent commits d0ae2f2

Comments (0)

Files changed (1)

-= Getting Started with Updown =
-
-== About ==
-
+= About =
 Updown is a package written in Scala for performing semi-supervised polarity classification on tweets. It is the code behind the paper //Twitter Polarity Classification with Label Propagation over Lexical Links and the Follower Graph// by [[https://webspace.utexas.edu/mas5622/www/|Michael Speriosu]], Nikita Sudan, Sid Upadhyay, and [[http://www.jasonbaldridge.com/|Jason Baldridge]], from [[https://sites.google.com/site/emnlpworkshop2011unsupnlp/|The EMNLP 2011 Workshop on Unsupervised Learning in NLP]].
 
-== Setup ==
-
-=== Prerequisites ===
-
-You will need [[http://www.java.com/|Java]], [[http://www.scala-lang.org/|Scala]], and [[http://mercurial.selenic.com/|Mercurial]].
-
-=== Cloning the repository ===
-
-The following command will download a copy of the code and put it in a directory called updown within the current working directory:
-
-{{{
-$ hg clone https://speriosu@bitbucket.org/speriosu/updown updown
-}}}
-
-=== Setting the Environment Variables ===
-
-Set the environment variable UPDOWN_DIR to point to the updown directory where you downloaded the code above. Then add UPDOWN_DIR/bin to your PATH environment variable, so that the updown executable file in that directory can be run.
-
-== Compiling the Code ==
-
-The following command will compile Updown so that it's ready to be run:
-
-{{{
-$ updown build clean update compile
-}}}
-
-This may take a few minutes. Ignore any warnings.
-
-== Unzipping the EmoMaxent model ==
-
-Run the following command from UPDOWN_DIR:
-
-{{{
-$ gzip -d models/maxent-eng.mxm.gz
-}}}
-
-== Preprocessing the Datasets ==
-
-=== Stanford Sentiment (STS) ===
-
-Run the following command from UPDOWN_DIR to preprocess Go et al (2009)'s Stanford Sentiment dataset:
-
-{{{
-$ updown preproc-stanford data/stanford/orig/testdata.manual.2009.05.25 \
-    src/main/resources/eng/dictionary/stoplist.txt > data/stanford/stanford-features.txt
-}}}
-
-You should see the following output:
-{{{
-Preprocessed 183 tweets. Fraction positive: 0.59016395
-}}}
-
-=== Obama McCain Debate (OMD) ===
-
-Run the following command from UPDOWN_DIR to preprocess Shamma et al (2009)'s Obama McCain Debate dataset:
-
-{{{
-$ updown preproc-shamma data/shamma/orig/debate08_sentiment_tweets.tsv \
-    src/main/resources/eng/dictionary/stoplist.txt > data/shamma/shamma-features.txt
-}}}
-
-You should see the following output:
-{{{
-Preprocessed 1898 tweets. Fraction positive: 0.37144363
-Average inter-annotator agreement: 0.837065339873538
-}}}
-
-=== Heathcare Reform (HCR) ===
-
-Run the following command to preprocess the train portion of our Healthcare Reform dataset (only used to train a supervised model for comparison to the semisupervised models of interest):
-
-{{{
-$ updown preproc-hcr data/hcr/train/orig/hcr-train.csv src/main/resources/eng/dictionary/stoplist.txt > \
-    data/hcr/train/hcr-train-features.txt
-}}}
-
-You should see the following output:
-{{{
-Preprocessed 488 tweets. Fraction positive: 0.43237704
-}}}
-
-Run the following command to preprocess the development portion of our Healthcare Reform dataset:
-
-{{{
-$ updown preproc-hcr data/hcr/dev/orig/hcr-dev.csv src/main/resources/eng/dictionary/stoplist.txt > \
-    data/hcr/dev/hcr-dev-features.txt
-}}}
-
-You should see the following output:
-{{{
-Preprocessed 534 tweets. Fraction positive: 0.3220974
-}}}
-
-Run the following command to preprocess the test portion of our Healthcare Reform dataset:
-
-{{{
-$ updown preproc-hcr data/hcr/test/orig/hcr-test.csv src/main/resources/eng/dictionary/stoplist.txt > \
-    data/hcr/test/hcr-test-features.txt
-}}}
-
-You should see the following output:
-{{{
-Preprocessed 396 tweets. Fraction positive: 0.38636363
-}}}
-
-== Running the Experiments ==
-
-=== LexRatio Baseline ===
-
-To run LexRatio on the Stanford Sentiment dataset, use the following command from the UPDOWN_DIR directory:
-
-{{{
-$ updown lex-ratio -g data/stanford/stanford-features.txt -p \
-    src/main/resources/eng/lexicon/subjclueslen1polar.tff 
-}}}
-
-You should see the following output:
-{{{
-***** PER TWEET EVAL *****
-58 tweets were abstained on; assuming half (29.0) were correct.
-Accuracy: 0.72131145 (132.0/183)
-
-***** PER USER EVAL *****
-Number of users evaluated: 0 (min of 3 tweets per user)
-Mean squared error: NaN
-}}}
-
-Point the -g flag to other preprocessed feature files to run LexRatio on other datasets.
-
-=== EmoMaxent ===
-
-To load the maximum entropy model trained on about 2 million tweets with positive and negative emoticons in them and evaluate its per-tweet performance on the Stanford Sentiment dataset, run the following command:
-
-{{{
-$ updown per-tweet-eval -g data/stanford/stanford-features.txt -m models/maxent-eng.mxm
-}}}
-
-You should see the following output:
-{{{
-***** PER TWEET EVAL *****
-Accuracy: 0.8306011 (152.0/183)
-}}}
-
-To run per-user evaluation rather than per-tweet evaluation, use the following command:
-
-{{{
-$ updown per-user-eval -g data/stanford/stanford-features.txt -m models/maxent-eng.mxm
-}}}
-
-You should see the following output:
-{{{
-***** PER USER EVAL *****
-Number of users evaluated: 0 (min of 3 tweets per user)
-Mean squared error: NaN
-}}}
-
-Point the -g flag to other preprocessed feature files to run EmoMaxent on other datasets. (Per-user evaluation makes the most sense on the HCR datasets, where there are many users who have authored three or more tweets.)
-
-=== Label Propagation (Modified Adsorption) ===
-
-To run label propagation using [[http://code.google.com/p/junto/|Junto]]'s implementation of Modified Adsorption on the Stanford Sentiment dataset, use the following command:
-
-{{{
-$ updown 8 junto -g data/stanford/stanford-features.txt -m models/maxent-eng.mxm -p \
-    src/main/resources/eng/lexicon/subjclueslen1polar.tff -f data/stanford/username-username-edges.txt -r \
-    src/main/resources/eng/model/ngramProbs.ser.gz 
-}}}
-
-(Note that this currently requires more than 4 gigabytes of memory (the '8' above indicates that 8 are used) due to the way the unigram and bigram probabilities are stored. We plan on improving the space efficiency of this in the future. You can run the label propagation with less memory by eliminating the -r flag and its argument, but results will not be as good.)
-
-After some status output, you should see the following output:
-{{{
-***** PER TWEET EVAL *****
-Accuracy: 0.8469945 (155.0/183)
-
-***** PER USER EVAL *****
-Number of users evaluated: 0 (min of 3 tweets per user)
-Mean squared error: NaN
-}}}
-
-Point the -g flag to other preprocessed feature files and the -f flag to the corresponding username-username-edges.txt file to run the label propagation on other datasets.
-
-The optional -e flag can be used to tell the label propagation algorithm which edges and/or seeds to include, according to the following abbreviations:
-* 'n' stands for the edges between n-grams and tweets that contain them.
-* 'f' stands for the follower graph
-* 'm' stands for seeds based on EmoMaxent's predictions
-* 'o' stands for the OpinionFinder/MPQA seeds on some unigrams
-* 'e' stands for emoticon seeds
-
-By default, all five of these are included, i.e. adding "-e nfmoe" to the above command line would not change output. To run on just the follower graph and EmoMaxent's predictions, for example, you would add "-e fm" to the command line, like so:
-
-{{{
-$ updown 8 junto -g data/stanford/stanford-features.txt -m models/maxent-eng.mxm -p \
-    src/main/resources/eng/lexicon/subjclueslen1polar.tff -f data/stanford/username-username-edges.txt -r \
-    src/main/resources/eng/model/ngramProbs.ser.gz -e fm
-}}}
-
-You should see the following output:
-{{{
-***** PER TWEET EVAL *****
-Accuracy: 0.8306011 (152.0/183)
-
-***** PER USER EVAL *****
-Number of users evaluated: 0 (min of 3 tweets per user)
-Mean squared error: NaN
-}}}
-
-=== Per-Target Evaluation ===
-
-Tweets in the HCR datasets are annotated for target as well as sentiment. To extract the list of targets for one of the HCR datasets (necessary to perform per-target evaluation), add a third argument before the '>' to the HCR preprocessing command, where that argument is a target output filename. For example, this will extract the targets from HCR-dev:
-
-{{{
-$ updown preproc-hcr data/hcr/dev/orig/hcr-dev.csv src/main/resources/eng/dictionary/stoplist.txt \
-    data/hcr/dev/hcr-dev-targets.txt > data/hcr/dev/hcr-dev-features.txt
-}}}
-
-Whenever running the above experiments on an HCR dataset for which targets have been extracted, you can point to the appropriate target file with the -t flag and see a breakdown of results per target. For example, this command will run per-target evaluation on HCR-dev after performing the default label propagation:
-
-{{{
-$ updown 8 junto -g data/hcr/dev/hcr-dev-features.txt -m models/maxent-eng.mxm -p \
-    src/main/resources/eng/lexicon/subjclueslen1polar.tff -f data/hcr/username-username-edges.txt -r \
-    src/main/resources/eng/model/ngramProbs.ser.gz -t data/hcr/dev/hcr-dev-targets.txt 
-}}}
-
-You should see the following output:
-{{{
-***** PER TWEET EVAL *****
-Accuracy: 0.6516854 (348.0/534)
-
-***** PER USER EVAL *****
-Number of users evaluated: 24 (min of 3 tweets per user)
-Mean squared error: 0.12439673091458808
-
-***** PER TARGET EVAL *****
-hcr: 0.6298932384341637 (281)
-gop: 0.5822784810126582 (79)
-other: 0.7446808510638298 (47)
-dems: 0.7567567567567568 (37)
-conservatives: 0.5714285714285714 (35)
-stupak: 0.75 (24)
-obama: 0.8235294117647058 (17)
-teaparty: 0.8181818181818182 (11)
-liberals: 0.3333333333333333 (3)
-}}}
-
+= [[Getting Started]] =
+[[Getting Started]] contains instructions on building the code, preprocessing data, and running the experiments from the paper.
 
 = Misc =
 * [[Dev Environment Setup]]