EMPOWERING platform - Installation of the components

Hadoop installation

On Empowering we use CDH 4.7 to develop new modules. We need to setup a pseudo-distributed cluster in order to test the data flow, i.e. we need the same architecture on the develop machine that we will use on production environment.

In order to help developers, we provide a guide to install a minimal architecture to make the whole thing work. This guide has been developed in a Virtual Machine with an Ubuntu Precise Pangolin (12.04.5) distribution. If you find this guide incomplete, you can read the |cloudera_doc_site| from Cloudera. Support will be limited on any other setup (cloudera distribution or operative system) as we can't ensure the needed components compatibility and work in a VM should be enough to develop new modules (despite it could be slow to process a lot of information).

Prelude

If we are testing with a Virtual Machine, we can safely set next things. In fact, we MUST set them to make everything work, if you are testing a Hadoop installation in any other scenario (without VM), you should figure out which values needs to be set and where.

Set hostname

sudo hostname hadoop

Add a new line on /etc/hosts by with the inet addr from ifconfig (discard 127.0.0.1 one)

ifconfig | grep "inet addr"
     inet addr:10.0.2.15  Bcast:10.0.2.255  Mask:255.255.255.0
     inet addr:127.0.0.1  Mask:255.0.0.0

echo "10.0.2.15  hadoop hadoop.example.com" | sudo tee -a /etc/hosts

Java

We need an specific version of Java to make Hadoop work properly.

Download the .tar.gz file of the recommended versions of the Oracle JDK from the |java_download_site|. To make it straight-forward, use those commands to download v.1.7.0_55 and install it

sudo mkdir /usr/java
cd /usr/java/
sudo wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie"  http://download.oracle.com/otn-pub/java/jdk/7u55-b13/jdk-7u55-linux-x64.tar.gz
sudo tar zxvf jdk-7u55-linux-x64.tar.gz
sudo rm jdk-7u55-linux-x64.tar.gz

Symbolically link the directory where the JDK is installed to /usr/java/default; for example:
```
sudo ln -s /usr/java/jdk.1.7.0_55 /usr/java/default
```
In /root/.bashrc , set JAVA_HOME to the directory where the JDK is installed; like this:
```
echo "export JAVA_HOME=/usr/java/default" | sudo tee -a /root/.bashrc
```

CDH4

Add cloudera repository

echo "deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib" | sudo tee -a /etc/apt/sources.list.d/cloudera.list
echo "deb-src http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib" | sudo tee -a /etc/apt/sources.list.d/cloudera.list

sudo apt-get install curl
curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key| sudo apt-key add -
sudo apt-get update

Install and start zookeeper

sudo apt-get install zookeeper-server
sudo service zookeeper-server init
sudo service zookeeper-server start

Install hadoop components (HDFS and MapReduce)

sudo apt-get install hadoop-0.20-mapreduce-tasktracker hadoop-0.20-mapreduce-jobtracker hadoop-hdfs-datanode hadoop-hdfs-namenode

Setup namenode and datanode (be sure to change core-site.xml and hdfs-site.xml on /etc/hadoop/conf/)

Set data folders for namenode and datanode (take /var/lib/hadoop-hdfs as example)

sudo mkdir -p /var/lib/hadoop-hdfs/cache/hdfs/dfs/name
sudo chown -R hdfs:hdfs /var/lib/hadoop-hdfs/cache/hdfs/dfs/name
sudo chmod 700 /var/lib/hadoop-hdfs/cache/hdfs/dfs/name

sudo mkdir -p /var/lib/hadoop-hdfs/cache/hdfs/dfs/data
sudo chown -R hdfs:hdfs /var/lib/hadoop-hdfs/cache/hdfs/dfs/data
sudo chmod 700 /var/lib/hadoop-hdfs/cache/hdfs/dfs/data

Format NameNode (before starting for the first time)

sudo -u hdfs hadoop namenode -format

Start namenode and datanodes

sudo service hadoop-hdfs-namenode start
sudo service hadoop-hdfs-datanode start

Setup jobtracker and tasktracker (mapred-site.xml on /etc/hadoop/conf needs to be modified)

Create files and set permissions

sudo mkdir -p /data/mapred/local
sudo chown -R mapred:hadoop /data/mapred/local

Create HDFS files

sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system
sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system
sudo -u hdfs hadoop fs -mkdir /user
sudo -u hdfs hadoop fs -chmod -R 1777 /user

Start services

sudo service hadoop-0.20-mapreduce-tasktracker start
sudo service hadoop-0.20-mapreduce-jobtracker start

Install HBase

Install hbase-master and regionserver

sudo apt-get install hbase-master hbase-regionserver hbase hbase-thrift

Change configuration (/etc/hbase/conf/hbase-site.xml)
Create HDFS folders

sudo -u hdfs hadoop fs -mkdir /hbase
sudo -u hdfs hadoop fs -chown hbase /hbase

Start master and regionserver

sudo service hbase-master start
sudo service hbase-regionserver start

Install Hive

Install needed packages

sudo apt-get install hive hive-metastore hive-server2
sudo apt-get install mysql-server libmysql-java
sudo service mysql start
sudo ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysql-connector-java.jar

Create initial database schema

mysql -u root -p

CREATE DATABASE metastore;
USE metastore;
SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql;


CREATE USER 'hive'@'localhost' IDENTIFIED BY 'password';
REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'localhost';
GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'localhost';
FLUSH PRIVILEGES;
quit;

Change hive-site.xml (/etc/hive/conf/hive-site.xml)
Start hive-metastore

sudo service hive-metastore start

Set Hive permissions on HDFS

sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse
sudo -u hdfs hadoop fs -chmod -R 1777 /user/hive/warehouse

Start hive-servers

sudo service hive-server start

Other

Empowering modules are both python code and R scripts. We will try to give an installation guide for most important dependencies.

RHipe

Our main R library to run jobs on Hadoop cluster.

Install dependencies to build Rhipe as described in |rhipe_doc_site|.

Apache Ant, Apache Maven and R

sudo apt-get install ant
sudo apt-get install maven
sudo apt-get install r-base r-base-dev r-cran-rjava

Google protobuf and pkg-config

sudo apt-get install build-essential libtool pkg-config
sudo apt-get install git
cd /opt
sudo git clone https://github.com/google/protobuf.git
git checkout tags/v2.4.1
cd protobuf
sudo ./autogen.sh
sudo ./configure
sudo make
sudo make check
sudo make install

Install some RHipe dependencies

On R console (enter using 'sudo R' to be able of installing packages)

install.packages(c('digest', 'stringr', 'brew'))
quit()

Install roxygen2

sudo mkdir /opt/R-Packages
cd /opt/R-Packages
sudo wget https://cran.r-project.org/src/contrib/roxygen2_2.0.tar.gz
sudo R CMD INSTALL roxygen2_2.0.tar.gz

Get RHipe and install it

Build Rhipe tarball for installation

cd /opt/
sudo git clone https://github.com/tesseradata/RHIPE.git
cd RHIPE/
sudo git checkout v0.74
sudo ant build-distro -Dhadoop.version=cdh4

If build fails with a missing dependency tools.jar error try by installing java 1.6
```
sudo apt-get install openjdk-6-jdk
```

If build fails because of asm-3.1 problems try by downloading it manually

sudo wget http://search.maven.org/remotecontent?filepath=asm/asm/3.1/asm-3.1.jar
sudo mv remotecontent?filepath=asm/asm/3.1/asm-3.1.jar asm-3.1.jar
sudo cp asm-3.1.jar /root/.m2/repository/asm/asm/3.1/asm-3.1.jar
sudo rm /root/.m2/repository/asm/asm/3.1/asm-3.1.pom

If build fails on Rcmd BUILD with a message like this, use the second command to build it manually:

r-build:
    [exec] /usr/lib/R/bin/Rcmd: 61: exec: BUILD: not found

sudo R CMD build package/R
sudo cp Rhipe_0.74.1.tar.gz ../R-packages

Installation if we had to build it manually

cd /opt/R-packages
sudo R CMD INSTALL Rhipe_0.74.1.tar.gz

Set environment variables

export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_CONF_DIR=/etc/hadoop/conf

Test if everythings works fine

cd ~/
R
library(Rhipe)
rhinit()

# Results in
Rhipe: Using Rhipe.jar file
Initializing Rhipe v0.74.1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client-0.20/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client-0.20/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
2015-11-11 17:21:29,321 WARN  [main][NativeCodeLoader] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Initializing mapfile caches
> rhls('/')
  permission owner      group size          modtime   file
1 drwxr-xr-x hbase supergroup    0 2015-11-11 13:01 /hbase
2 drwxrwxrwt  hdfs supergroup    0 2015-11-11 12:54   /tmp
3 drwxr-xr-x  hdfs supergroup    0 2015-11-11 13:31  /user
4 drwxr-xr-x  hdfs supergroup    0 2015-11-11 12:52   /var

If fails because its unable to load rJava packages try with

export LD_LIBRARY_PATH=$JAVA_HOME/jre/lib/amd64:$JAVA_HOME/jre/lib/amd64/server

Python and packages

Python is the main responsible of executing modules. We have some minimum requirements to make tasks work as expected. To make it easy to handle packages, install pip

Installing pip

sudo apt-get install python-pip python-dev libbsasl2-dev

Installing modules

make a /tmp/requirements.txt file with the content (list has been shorten to avoid some not needed packages to run modules, inform us if you find some important missing dependency)
```
amqp==1.4.6
sasl==0.1.3
celery==3.1.13
happybase==0.8
mrjob==0.4
protobuf==2.5.0
snakebite==1.1.1
thrift==0.9.1
pyhs2==0.6.0
pymongo==2.7.2
```

Install dependencies by running

sudo pip install -r /tmp/requirements.txt
`

Checks and tests

happybase

In [1]: import happybase

In [2]: a = happybase.Co
happybase.Connection      happybase.ConnectionPool

In [2]: a = happybase.Connection()

In [3]: a.tables()
Out[3]: []

snakebite

In [5]: from snakebite import client

In [9]: a = client.Client('hadoop', 8020)

In [10]: for i in a.ls(['/']):
    print i
   ....:
{'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1447243264519L, 'length': 0L, 'blocksize': 0L, 'owner': u'hbase', 'path': '/hbase'}
{'group': u'supergroup', 'permission': 1023, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1447242845223L, 'length': 0L, 'blocksize': 0L, 'owner': u'hdfs', 'path': '/tmp'}
{'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1447245086688L, 'length': 0L, 'blocksize': 0L, 'owner': u'hdfs', 'path': '/user'}
{'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1447242767505L, 'length': 0L, 'blocksize': 0L, 'owner': u'hdfs', 'path': '/var'}

More important packages official documentation (search for specific version pages if needed)
- snakebite: 'snakebite_doc_site'
- happybase: 'happybase_doc_site'
- mrjob: 'mrjob_doc_site'

RabbitMQ

Use this code to install rabbitmq. Celery will use it (and configure it)

echo "deb http://www.rabbitmq.com/debian/ testing main" | sudo tee -a /etc/apt/sources.list.d/rabbitmq.list
wget http://www.rabbitmq.com/rabbitmq-signing-key-public.asc
sudo apt-key add rabbitmq-signing-key-public.asc
sudo apt-get update
sudo apt-get install rabbitmq-server -y

MongoDB

We don't use lastest MongoDB release. For development, we use MongoDB 2.6

Install it using those lines

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10
echo 'deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' | sudo tee /etc/apt/sources.list.d/mongodb.list
sudo apt-get update
sudo apt-get install mongodb-org=2.6.0 mongodb-org-server=2.6.0 mongodb-org-shell=2.6.0 mongodb-org-mongos=2.6.0 mongodb-org-tools=2.6.0

Once installed we need to setup authentication (because code is not ready to work without it and there is no backdoor for development)
1. Edit mongodb conf
```
sudo vi /etc/mongod.conf
```
2. Enable authentication by uncomment auth=true line
3. Restart service
```
sudo service mongod restart
```

Add siteUserAdmin user

mongo
use admin
db.createUser(
  {
    user: "siteUserAdmin",
    pwd: "password",
    roles: [ { role: "userAdminAnyDatabase", db: "admin" } ]
  }
)

Create user

mongo --port 27017 -u siteUserAdmin -p password --authenticationDatabase admin
use admin
db.createUser(
    {
      user: "root",
      pwd: "password",
      roles: [ "root" ]
    }
)

mongo --port 27017 -u siteUserAdmin -p password --authenticationDatabase admin
use rest_service
db.createUser(
  {
    user: "root",
    pwd: "password",
    roles: [ { role: "readWrite", db: "rest_service" } ]
  }
)

Enable remote connections to mongodb
1. Open mongodb configuration with sudo permissions
```
sudo vi /etc/mongod.conf
```
2. Comment bind_ip line
3. Restart mongodb service

Wiki

ENMA / architecture / installation