Wiki

#Introduction to Demeter

What is Demeter?

###Overview

Demeter is a 100-node Hadoop cluster running Cloudera's Hadoop Distribution (CDH 5). It provides 3.6 petabytes of storage with 2400 total cores.

The cluster uses commodity machines that perform both computation and storage, compute tasks are run local to the data. This reduces the network overhead that is usually found in a traditional HPC cluster and lowers the storage costs. Also, Demeter provides many easy to program interfaces to access, manipulate and run computations over the data. Writing distributed code can be easy as writing SQL queries.

###Logistics

Demeter's filesystem is easy to browse through the web interface at [3]. Also, information about the datasets currently managed by Demeter is available at [4] the metadata browser that allows us to view the tables currently available on the structured datastore on Demeter.

Two easy to use interface to access data on Demeter are through SQL queries [5] and Search [6]. Impala allows us to access large data sets very quickly by using a distributed query engine. Search allows full-text or faceted search over indexed datasets.

What is Hadoop?

Hadoop is an open-source project containing a MapReduce job scheduler and a distributed filesystem (HDFS). Both are open-sourced versions of Google projects:

Hadoop MapReduce :: Google Map Reduce Paper
HDFS :: Google File System [http://research.google.com/archive/gfs.html]

Hadoop sets itself apart from HPC systems by integrating computing and storage. Instead of pulling data on to a computing cluster, compute tasks are sent to the data. Tasks are completed where the data lives.

Jobs are typically run using the Map Reduce paradigm with two phases: - Map: Filtering and transforming the data - Reduce: Aggregate records based on some key and aggregation function

Same idea is seen commonly in functional programming languages

x = xrange(10)
x2 = [ i^2 for i in x] # map phase, each element is transformed by square operation
total  = sum(x2) # reduce phase, all elements are combined into a single number

HDFS

HDFS is a distributed filesystem that serves as the backbone of Hadoop.

HDFS is easily scalable - more storage is just more machines. Also data is replicated across many machines so data is available even with disk or node failures.

Caveats:

All files are append only - no updates or edits.
All data is replicated, so if any machine fails no data is lost (but also larger storage overhead)

HDFS Details

How do I put files onto HDFS on Demeter from Minerva?

The fastest way to this is:

cat file.txt | ssh username@login.demeter.hpc.mssm.edu "hdfs dfs -put - /user/username/file.txt"

Query Frameworks

There are many frameworks available that make programming and accessing data on Hadoop more user-friendly. The underlying computation is converted to a map-reduce job, but many tools mask this and allow you to use more popular data manipulation operations (SQL)

Impala

The goal of Impala is to provide more interactive querying. The syntax mirrors that of Hive, however some of Hive's features not implemented yet.

Caveats

Impala is not launching MR jobs (also therefore not available in JobTracker, separate tracker at http://demeter.hpc.mssm.edu:25000)
No fault-tolerance, queries will fail

###Impala Examples

###Getting Started with ADAM

Hive

Hive is a data warehouse system for Hadoop to handle querying data on HDFS. Hive presents a SQL-like language (HiveQL) to query the data and translates this into MapReduce jobs. Most importantly you can use SQL skills without learning how to write MR program. HiveQL shares much of its syntax from MySQL.

The data is stored on HDFS and metadata is stored separately.

Hive DDL

Streaming MapReduce

If the task is more sophisticated than can be expressed in one of the query frameworks, there are are a handful of options:

Write raw MR code
Use a more flexible data processing framework (Apache Crunch)
Streaming MR

Streaming Map Reduce uses Hadoop to run any binaries (Bash, Python, etc.) as Map and Reduce tasks. Hadoop handles the underlying fault-tolerance, sorting and data management.

Process:

Write a Map task that reads from STDIN and outputs key, value pairs to STDOUT
Write a Reduce task that reads from STDIN (given 1 key and list of values) and produces key, value pairs to stdout
Call hadoop-streaming with map and reduce task set.

Basic command syntax for word count. --mapper and --reducer can be any executable.

hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper /bin/cat \
    -reducer /bin/wc

Examples: