Clone wiki

updown / Corpora

This is a listing of corpora used in and around the Sentiment Analysis literature.

PangLee Movie Reviews

This dataset is released at people/ pabo/ movie-review-data/.

  • Pang and Lee 2002: They obtained equal class distribution be randomly selecting 700 positive and 700 negative reviews. They performed three-fold cross validation on this data. They had a random-choice baseline of 50% and three human baselines of 58%, 64%, and 69%. The best result they achieved was 82.9% using an SVM with unigram features.

PangLee Movie Reviews 2.0 (Polarity Dataset)

This dataset is released at

  • Pang and Lee 2004: 1000 pos and 1000 neg reviews. Using NB as a subjectivity detector (ExtractNB), they evaluated polarity accurracy using only the subjective sentences from the reviews. ExtractNB+NB achieved 86.4% (compare to an 82.8% Full+NB baseline). ExtractNB+SVM gave 86.4% (compared to 87.15% for Full+SVM). It should be noted that ExtractNB is usually about 60% the size of Full. They also compared the graph-cut method of subjectivity classification ExtractNB_PROX and ExtractSVM_PROX to plain NB or SVM classification using the paragraph as the unit of subjectivity classification (ExtractNB_PARA). The results were: (ExtractNB_PROX+NB, 86.4%), (ExtractNB_PARA+NB, 85.2%), (ExtractSVM_PROX, 86.15%), (ExtractSVM_PARA, 85.45%).
  • Maas et al 2011: They achieved 88.9%.

Sentence Corpus 1.0 (Subjectivity Dataset)

This dataset is released at

  • Pang and Lee 2004: 1000 pos and 1000 neg reviews. Evaluated SVM and NB. They used this one to train subjectivity classifiers for later use. Nevertheless, they also performed 10-fold cross validation on the sub dataset itself, and achieved 92% with NB and 90% with SVM.
  • Maas, et al 2011: They achieved 88.58%

IMDB Movie Reviews

This dataset is released at

  • Maas, et al 2011: They evaluted in two modes, using 25k reviews rated on 1-10 stars, and also usin 25k labeled reviews mixed with 50k unlabeled reviews. In both cases, they discarded the 50 most frequent words, then made use of only the next 5000 most frequent words. They did not remove stopwords or stem or remove non-alphanumeric tokens like "!" or ";)". They achieved 88.89%.

LiMcCallum2006 Rexa

This is not really a dataset; Rexa is a research paper search engine, and Li and McCallum tested PAM on a random sampling of 4000 documents. They didn't re-publish the result, nor did they report any results aside from publishing a topic graph derived from the data.


Li and McCallum used a collection of 1647 abstracts from NIPS. It does not appear that this dataset is released anywhere.

  • Li and McCallum 2006: In testing PAM, Li and McCallum evaluated several topic modelling approaches for "best" topic construction. They used a 75%-25% train-test split, computing the likelihood of the test set under the trained models. PAM parameters: 50 super, [20,180] sub-topics. They used Empirical Likelihood to gauge the quality of the results. Results: PAM always outperforms LDA, PAM peaks at 160 subtopics, CTM outperforms PAM at small numbers of topics, 60 being its best. HDP automatically leans the number of topics, and performs similarly to LDA. Li and McCallum also performed a human topic quality test, and the subjects tended to prefer PAM topics.

20 Newsgroup

This dataset is released at

  • Li and McCallum 2006: Li and McCallum also performed a document classification task. They did 5-way classification using the comp subset of the 20-newsgroup corpus, which contains 4836 documents. They divided comp into classes and used a 75-25 train-test split on each class. They used EL likelihood measure as described above and find that PAM outperforms LDA with statistical significance.


The bitterlemons corpus is a collection of articles about the Israeli-Palestinian conflict, annotated for perspective. It is released as a dataset here: