Wiki
Clone wikidemeter-course / Home
#Introduction to Demeter
What is Demeter?
###Overview
Demeter is a 100-node Hadoop cluster running Cloudera's Hadoop Distribution (CDH 5). It provides 3.6 petabytes of storage with 2400 total cores.
The cluster uses commodity machines that perform both computation and storage, compute tasks are run local to the data. This reduces the network overhead that is usually found in a traditional HPC cluster and lowers the storage costs. Also, Demeter provides many easy to program interfaces to access, manipulate and run computations over the data. Writing distributed code can be easy as writing SQL queries.
###Logistics
Demeter's filesystem is easy to browse through the web interface at [3]. Also, information about the datasets currently managed by Demeter is available at [4] the metadata browser that allows us to view the tables currently available on the structured datastore on Demeter.
Two easy to use interface to access data on Demeter are through SQL queries [5] and Search [6]. Impala allows us to access large data sets very quickly by using a distributed query engine. Search allows full-text or faceted search over indexed datasets.
- 1 JobTracker
- 2 Hue
- 3 Filesystem Browser
- 4 Storage Metadata Browser
- 5 Query Interface
- 6 Search Interface
What is Hadoop?
Hadoop is an open-source project containing a MapReduce job scheduler and a distributed filesystem (HDFS). Both are open-sourced versions of Google projects:
- Hadoop MapReduce :: Google Map Reduce Paper
- HDFS :: Google File System [http://research.google.com/archive/gfs.html]
Hadoop sets itself apart from HPC systems by integrating computing and storage. Instead of pulling data on to a computing cluster, compute tasks are sent to the data. Tasks are completed where the data lives.
Jobs are typically run using the Map Reduce paradigm with two phases: - Map: Filtering and transforming the data - Reduce: Aggregate records based on some key and aggregation function
Same idea is seen commonly in functional programming languages
x = xrange(10) x2 = [ i^2 for i in x] # map phase, each element is transformed by square operation total = sum(x2) # reduce phase, all elements are combined into a single number
HDFS
HDFS is a distributed filesystem that serves as the backbone of Hadoop.
HDFS is easily scalable - more storage is just more machines. Also data is replicated across many machines so data is available even with disk or node failures.
Caveats:
- All files are append only - no updates or edits.
- All data is replicated, so if any machine fails no data is lost (but also larger storage overhead)
How do I put files onto HDFS on Demeter from Minerva?
The fastest way to this is:
cat file.txt | ssh username@login.demeter.hpc.mssm.edu "hdfs dfs -put - /user/username/file.txt"
Query Frameworks
There are many frameworks available that make programming and accessing data on Hadoop more user-friendly. The underlying computation is converted to a map-reduce job, but many tools mask this and allow you to use more popular data manipulation operations (SQL)
Impala
The goal of Impala is to provide more interactive querying. The syntax mirrors that of Hive, however some of Hive's features not implemented yet.
Caveats
- Impala is not launching MR jobs (also therefore not available in JobTracker, separate tracker at http://demeter.hpc.mssm.edu:25000)
- No fault-tolerance, queries will fail
Hive
Hive is a data warehouse system for Hadoop to handle querying data on HDFS. Hive presents a SQL-like language (HiveQL) to query the data and translates this into MapReduce jobs. Most importantly you can use SQL skills without learning how to write MR program. HiveQL shares much of its syntax from MySQL.
The data is stored on HDFS and metadata is stored separately.
Streaming MapReduce
If the task is more sophisticated than can be expressed in one of the query frameworks, there are are a handful of options:
- Write raw MR code
- Use a more flexible data processing framework (Apache Crunch)
- Streaming MR
Streaming Map Reduce uses Hadoop to run any binaries (Bash, Python, etc.) as Map and Reduce tasks. Hadoop handles the underlying fault-tolerance, sorting and data management.
Process:
- Write a Map task that reads from STDIN and outputs key, value pairs to STDOUT
- Write a Reduce task that reads from STDIN (given 1 key and list of values) and produces key, value pairs to stdout
- Call hadoop-streaming with map and reduce task set.
Basic command syntax for word count. --mapper and --reducer can be any executable.
hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input myInputDirs \ -output myOutputDir \ -mapper /bin/cat \ -reducer /bin/wc
Examples:
Using Spark/HDFS on Demeter From Scala
Updated