1. softbass
  2. Meme Extractor

Overview

HTTPS SSH
Meme Extractor

Hohyon Ryu (hohyon@utexas.edu)
School of Information, University of Texas at Austin

Using Hadoop Streaming, Meme Extractor generates memes from a text collection. The code is written in Python.

Pipeline:
	* [] indicates a list
	Deduplication:
		1_MapDupReduce: Near Duplicate Detection (Preprocessing, Optional)
			Input: HTML-Stripped Text Data
			Output: HTML-Stripped Deduplicated Text Data

	Meme Extration:	
		2_seq_extraction: Sequence Extraction
			Input: HTML-Stripped Deduplicated Text Data
			Output: Common Phrase	[Doc_ID|source]
		3_rank_cps: Common Phrase Ranking
			Input: Common Phrase	[Doc_ID|source]
			Output: Date	[Common Phrase]
		4_cluster_cps: Common Phrase Clustering
			Input: Date	[Common Phrase]
			Output: Meme_ID	[Common Phrase]

	Data Preparation for Meme Browser:
		5_Data_for_MB: Data Preparation for Meme Browser (Optional)
			Input: All the outputs
			Output: memes.tar.gz (compressed file to be shipped to a web server)
				cp-doc-source.list  
				daily_top_memes.list  
				docs.set  
				memes.list    
				top_docs.data

Input Data:
	- HTML-Stripped Text Data
	- HTML Text Data  

Input Data Format (Tab Seperated):
	Doc_ID	Source	Text
	BLOG08-20080116-122-0009443413	apathy-fool.livejournal.com	Blaine has been real sad lately, I wish I could make him happier. 
  
  
Synopsis:
	Deduplication:
		1. Upload data to HDFS
			$ hadoop fs -put undedup_stripped undedup_stripped
		2. cd to 1_MapDupReduce, and excute runall.sh
			$ cd 1_MapDupReduce
			$ ./runall.sh
		3. The output is in the following directory: dedup_stripped
		
	Meme Extration:
		1. Upload data to HDFS
			$ hadoop fs -put dedup_stripped dedup_stripped
		2. cd to 2_seq_extraction, and excute runall.sh
			$ cd 2_seq_extraction
			$ ./runall.sh
			The output is in blog_memes_3.
		3. run 3_rank_cps by running runhadoop.sh
		4. run 4_cluster_cps by running runhadoop.sh
		
	Data Preparation for Meme Browser:		
		5. run 5_Data_for_MB by running run.sh