Wiki
Clone wikiENMA / architecture / installation
EMPOWERING platform - Installation of the components
Hadoop installation
On Empowering we use CDH 4.7 to develop new modules. We need to setup a pseudo-distributed cluster in order to test the data flow, i.e. we need the same architecture on the develop machine that we will use on production environment.
In order to help developers, we provide a guide to install a minimal architecture to make the whole thing work. This guide has been developed in a Virtual Machine with an Ubuntu Precise Pangolin (12.04.5) distribution. If you find this guide incomplete, you can read the |cloudera_doc_site| from Cloudera. Support will be limited on any other setup (cloudera distribution or operative system) as we can't ensure the needed components compatibility and work in a VM should be enough to develop new modules (despite it could be slow to process a lot of information).
Prelude
If we are testing with a Virtual Machine, we can safely set next things. In fact, we MUST set them to make everything work, if you are testing a Hadoop installation in any other scenario (without VM), you should figure out which values needs to be set and where.
- Set hostname
sudo hostname hadoop
- Add a new line on /etc/hosts by with the inet addr from ifconfig (discard 127.0.0.1 one)
ifconfig | grep "inet addr" inet addr:10.0.2.15 Bcast:10.0.2.255 Mask:255.255.255.0 inet addr:127.0.0.1 Mask:255.0.0.0 echo "10.0.2.15 hadoop hadoop.example.com" | sudo tee -a /etc/hosts
Java
We need an specific version of Java to make Hadoop work properly.
Download the .tar.gz file of the recommended versions of the Oracle JDK from the |java_download_site|. To make it straight-forward, use those commands to download v.1.7.0_55 and install it
sudo mkdir /usr/java cd /usr/java/ sudo wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/7u55-b13/jdk-7u55-linux-x64.tar.gz sudo tar zxvf jdk-7u55-linux-x64.tar.gz sudo rm jdk-7u55-linux-x64.tar.gz
Symbolically link the directory where the JDK is installed to /usr/java/default; for example:
sudo ln -s /usr/java/jdk.1.7.0_55 /usr/java/default
In /root/.bashrc , set JAVA_HOME to the directory where the JDK is installed; like this:
echo "export JAVA_HOME=/usr/java/default" | sudo tee -a /root/.bashrc
CDH4
- Add cloudera repository
echo "deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib" | sudo tee -a /etc/apt/sources.list.d/cloudera.list echo "deb-src http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib" | sudo tee -a /etc/apt/sources.list.d/cloudera.listsudo apt-get install curl curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key| sudo apt-key add - sudo apt-get update
- Install and start zookeeper
sudo apt-get install zookeeper-server sudo service zookeeper-server init sudo service zookeeper-server start
- Install hadoop components (HDFS and MapReduce)
sudo apt-get install hadoop-0.20-mapreduce-tasktracker hadoop-0.20-mapreduce-jobtracker hadoop-hdfs-datanode hadoop-hdfs-namenode
Setup namenode and datanode (be sure to change core-site.xml and hdfs-site.xml on /etc/hadoop/conf/)
- Set data folders for namenode and datanode (take /var/lib/hadoop-hdfs as example)
sudo mkdir -p /var/lib/hadoop-hdfs/cache/hdfs/dfs/name sudo chown -R hdfs:hdfs /var/lib/hadoop-hdfs/cache/hdfs/dfs/name sudo chmod 700 /var/lib/hadoop-hdfs/cache/hdfs/dfs/name
sudo mkdir -p /var/lib/hadoop-hdfs/cache/hdfs/dfs/data sudo chown -R hdfs:hdfs /var/lib/hadoop-hdfs/cache/hdfs/dfs/data sudo chmod 700 /var/lib/hadoop-hdfs/cache/hdfs/dfs/data
- Format NameNode (before starting for the first time)
sudo -u hdfs hadoop namenode -format
- Start namenode and datanodes
sudo service hadoop-hdfs-namenode start sudo service hadoop-hdfs-datanode start
Setup jobtracker and tasktracker (mapred-site.xml on /etc/hadoop/conf needs to be modified)
- Create files and set permissions
sudo mkdir -p /data/mapred/local sudo chown -R mapred:hadoop /data/mapred/local
- Create HDFS files
sudo -u hdfs hadoop fs -mkdir /tmp sudo -u hdfs hadoop fs -chmod -R 1777 /tmp sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system sudo -u hdfs hadoop fs -mkdir /user sudo -u hdfs hadoop fs -chmod -R 1777 /user
- Start services
sudo service hadoop-0.20-mapreduce-tasktracker start sudo service hadoop-0.20-mapreduce-jobtracker start
Install HBase
- Install hbase-master and regionserver
sudo apt-get install hbase-master hbase-regionserver hbase hbase-thrift
- Change configuration (/etc/hbase/conf/hbase-site.xml)
- Create HDFS folders
sudo -u hdfs hadoop fs -mkdir /hbase sudo -u hdfs hadoop fs -chown hbase /hbase
- Start master and regionserver
sudo service hbase-master start sudo service hbase-regionserver start
Install Hive
- Install needed packages
sudo apt-get install hive hive-metastore hive-server2 sudo apt-get install mysql-server libmysql-java sudo service mysql start sudo ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysql-connector-java.jar
- Create initial database schema
mysql -u root -p CREATE DATABASE metastore; USE metastore; SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql; CREATE USER 'hive'@'localhost' IDENTIFIED BY 'password'; REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'localhost'; GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'localhost'; FLUSH PRIVILEGES; quit;
- Change hive-site.xml (/etc/hive/conf/hive-site.xml)
- Start hive-metastore
sudo service hive-metastore start
- Set Hive permissions on HDFS
sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse sudo -u hdfs hadoop fs -chmod -R 1777 /user/hive/warehouse
- Start hive-servers
sudo service hive-server start
Other
Empowering modules are both python code and R scripts. We will try to give an installation guide for most important dependencies.
RHipe
Our main R library to run jobs on Hadoop cluster.
Install dependencies to build Rhipe as described in |rhipe_doc_site|.
Apache Ant, Apache Maven and R
sudo apt-get install ant sudo apt-get install maven sudo apt-get install r-base r-base-dev r-cran-rjava
Google protobuf and pkg-config
sudo apt-get install build-essential libtool pkg-config sudo apt-get install git cd /opt sudo git clone https://github.com/google/protobuf.git git checkout tags/v2.4.1 cd protobuf sudo ./autogen.sh sudo ./configure sudo make sudo make check sudo make install
Install some RHipe dependencies
- On R console (enter using 'sudo R' to be able of installing packages)
install.packages(c('digest', 'stringr', 'brew')) quit()
- Install roxygen2
sudo mkdir /opt/R-Packages cd /opt/R-Packages sudo wget https://cran.r-project.org/src/contrib/roxygen2_2.0.tar.gz sudo R CMD INSTALL roxygen2_2.0.tar.gz
Get RHipe and install it
Build Rhipe tarball for installation
cd /opt/ sudo git clone https://github.com/tesseradata/RHIPE.git cd RHIPE/ sudo git checkout v0.74 sudo ant build-distro -Dhadoop.version=cdh4
If build fails with a missing dependency tools.jar error try by installing java 1.6
sudo apt-get install openjdk-6-jdk
If build fails because of asm-3.1 problems try by downloading it manually
sudo wget http://search.maven.org/remotecontent?filepath=asm/asm/3.1/asm-3.1.jar sudo mv remotecontent?filepath=asm/asm/3.1/asm-3.1.jar asm-3.1.jar sudo cp asm-3.1.jar /root/.m2/repository/asm/asm/3.1/asm-3.1.jar sudo rm /root/.m2/repository/asm/asm/3.1/asm-3.1.pom
If build fails on Rcmd BUILD with a message like this, use the second command to build it manually:
r-build: [exec] /usr/lib/R/bin/Rcmd: 61: exec: BUILD: not found
sudo R CMD build package/R sudo cp Rhipe_0.74.1.tar.gz ../R-packages
Installation if we had to build it manually
cd /opt/R-packages sudo R CMD INSTALL Rhipe_0.74.1.tar.gz
Set environment variables
export HADOOP_HOME=/usr/lib/hadoop export HADOOP_CONF_DIR=/etc/hadoop/conf
Test if everythings works fine
cd ~/ R library(Rhipe) rhinit()
# Results in Rhipe: Using Rhipe.jar file Initializing Rhipe v0.74.1 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client-0.20/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client-0.20/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 2015-11-11 17:21:29,321 WARN [main][NativeCodeLoader] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Initializing mapfile caches > rhls('/') permission owner group size modtime file 1 drwxr-xr-x hbase supergroup 0 2015-11-11 13:01 /hbase 2 drwxrwxrwt hdfs supergroup 0 2015-11-11 12:54 /tmp 3 drwxr-xr-x hdfs supergroup 0 2015-11-11 13:31 /user 4 drwxr-xr-x hdfs supergroup 0 2015-11-11 12:52 /var
If fails because its unable to load rJava packages try with
export LD_LIBRARY_PATH=$JAVA_HOME/jre/lib/amd64:$JAVA_HOME/jre/lib/amd64/server
Python and packages
Python is the main responsible of executing modules. We have some minimum requirements to make tasks work as expected. To make it easy to handle packages, install pip
Installing pip
sudo apt-get install python-pip python-dev libbsasl2-dev
Installing modules
make a /tmp/requirements.txt file with the content (list has been shorten to avoid some not needed packages to run modules, inform us if you find some important missing dependency)
amqp==1.4.6 sasl==0.1.3 celery==3.1.13 happybase==0.8 mrjob==0.4 protobuf==2.5.0 snakebite==1.1.1 thrift==0.9.1 pyhs2==0.6.0 pymongo==2.7.2
Install dependencies by running
sudo pip install -r /tmp/requirements.txt `
Checks and tests
happybase
In [1]: import happybase In [2]: a = happybase.Co happybase.Connection happybase.ConnectionPool In [2]: a = happybase.Connection() In [3]: a.tables() Out[3]: []
snakebite
In [5]: from snakebite import client In [9]: a = client.Client('hadoop', 8020) In [10]: for i in a.ls(['/']): print i ....: {'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1447243264519L, 'length': 0L, 'blocksize': 0L, 'owner': u'hbase', 'path': '/hbase'} {'group': u'supergroup', 'permission': 1023, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1447242845223L, 'length': 0L, 'blocksize': 0L, 'owner': u'hdfs', 'path': '/tmp'} {'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1447245086688L, 'length': 0L, 'blocksize': 0L, 'owner': u'hdfs', 'path': '/user'} {'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1447242767505L, 'length': 0L, 'blocksize': 0L, 'owner': u'hdfs', 'path': '/var'}
More important packages official documentation (search for specific version pages if needed)
- snakebite: 'snakebite_doc_site'
- happybase: 'happybase_doc_site'
- mrjob: 'mrjob_doc_site'
RabbitMQ
Use this code to install rabbitmq. Celery will use it (and configure it)
echo "deb http://www.rabbitmq.com/debian/ testing main" | sudo tee -a /etc/apt/sources.list.d/rabbitmq.list wget http://www.rabbitmq.com/rabbitmq-signing-key-public.asc sudo apt-key add rabbitmq-signing-key-public.asc sudo apt-get update sudo apt-get install rabbitmq-server -y
MongoDB
We don't use lastest MongoDB release. For development, we use MongoDB 2.6
Install it using those lines
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10 echo 'deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' | sudo tee /etc/apt/sources.list.d/mongodb.list sudo apt-get update sudo apt-get install mongodb-org=2.6.0 mongodb-org-server=2.6.0 mongodb-org-shell=2.6.0 mongodb-org-mongos=2.6.0 mongodb-org-tools=2.6.0
Once installed we need to setup authentication (because code is not ready to work without it and there is no backdoor for development)
Edit mongodb conf
sudo vi /etc/mongod.conf
Enable authentication by uncomment auth=true line
Restart service
sudo service mongod restart
Add siteUserAdmin user
mongo use admin db.createUser( { user: "siteUserAdmin", pwd: "password", roles: [ { role: "userAdminAnyDatabase", db: "admin" } ] } )
Create user
mongo --port 27017 -u siteUserAdmin -p password --authenticationDatabase admin use admin db.createUser( { user: "root", pwd: "password", roles: [ "root" ] } )
mongo --port 27017 -u siteUserAdmin -p password --authenticationDatabase admin use rest_service db.createUser( { user: "root", pwd: "password", roles: [ { role: "readWrite", db: "rest_service" } ] } )
Enable remote connections to mongodb
Open mongodb configuration with sudo permissions
sudo vi /etc/mongod.conf
Comment bind_ip line
Restart mongodb service
Updated