Source

zipstream /

Filename Size Date modified Message
src
53 B
227 B
32 B
100 B
1.1 KB
1004 B
1.4 KB
444 B

Zip Reader for Hadoop Streaming

This is a reader that will return (filename, line) key value pairs for a zip file in Hadoop streaming.

Note that currently only the first file in the zip will be processed, if you want more - submit a pull request :)

Usage

#!/bin/bash
# Unzip a file in HDFS

case $1 in
    -h | --help ) echo "usage: $(basename $0) INDIR OUTDIR"; exit;;
esac

if [ $# -ne 2 ]; then
    $0 -h
    exit 1
fi

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -libjars zipmapred-1.0-SNAPSHOT.jar \
    -mapper /bin/cat \
    -reducer /bin/cat \
    -inputformat com.mikitebeka.mapred.ZipInputFormat \
    -input $1 -output $2

FAQ

A. It uses the old(?) mapreduce API and doesn't work with CDH4
Q. Where does this project live?