Meme Extractor

Hohyon Ryu (
School of Information, University of Texas at Austin

Using Hadoop Streaming, Meme Extractor generates memes from a text collection. The code is written in Python.

	* [] indicates a list
		1_MapDupReduce: Near Duplicate Detection (Preprocessing, Optional)
			Input: HTML-Stripped Text Data
			Output: HTML-Stripped Deduplicated Text Data

	Meme Extration:	
		2_seq_extraction: Sequence Extraction
			Input: HTML-Stripped Deduplicated Text Data
			Output: Common Phrase	[Doc_ID|source]
		3_rank_cps: Common Phrase Ranking
			Input: Common Phrase	[Doc_ID|source]
			Output: Date	[Common Phrase]
		4_cluster_cps: Common Phrase Clustering
			Input: Date	[Common Phrase]
			Output: Meme_ID	[Common Phrase]

	Data Preparation for Meme Browser:
		5_Data_for_MB: Data Preparation for Meme Browser (Optional)
			Input: All the outputs
			Output: memes.tar.gz (compressed file to be shipped to a web server)

Input Data:
	- HTML-Stripped Text Data
	- HTML Text Data  

Input Data Format (Tab Seperated):
	Doc_ID	Source	Text
	BLOG08-20080116-122-0009443413	Blaine has been real sad lately, I wish I could make him happier. 
		1. Upload data to HDFS
			$ hadoop fs -put undedup_stripped undedup_stripped
		2. cd to 1_MapDupReduce, and excute
			$ cd 1_MapDupReduce
			$ ./
		3. The output is in the following directory: dedup_stripped
	Meme Extration:
		1. Upload data to HDFS
			$ hadoop fs -put dedup_stripped dedup_stripped
		2. cd to 2_seq_extraction, and excute
			$ cd 2_seq_extraction
			$ ./
			The output is in blog_memes_3.
		3. run 3_rank_cps by running
		4. run 4_cluster_cps by running
	Data Preparation for Meme Browser:		
		5. run 5_Data_for_MB by running