Wiki

Clone wiki

ENMA / architecture / installation

EMPOWERING platform - Installation of the components

'Return home'

Hadoop installation

On Empowering we use CDH 4.7 to develop new modules. We need to setup a pseudo-distributed cluster in order to test the data flow, i.e. we need the same architecture on the develop machine that we will use on production environment.

In order to help developers, we provide a guide to install a minimal architecture to make the whole thing work. This guide has been developed in a Virtual Machine with an Ubuntu Precise Pangolin (12.04.5) distribution. If you find this guide incomplete, you can read the |cloudera_doc_site| from Cloudera. Support will be limited on any other setup (cloudera distribution or operative system) as we can't ensure the needed components compatibility and work in a VM should be enough to develop new modules (despite it could be slow to process a lot of information).

Prelude

If we are testing with a Virtual Machine, we can safely set next things. In fact, we MUST set them to make everything work, if you are testing a Hadoop installation in any other scenario (without VM), you should figure out which values needs to be set and where.

  1. Set hostname
sudo hostname hadoop
  1. Add a new line on /etc/hosts by with the inet addr from ifconfig (discard 127.0.0.1 one)
ifconfig | grep "inet addr"
     inet addr:10.0.2.15  Bcast:10.0.2.255  Mask:255.255.255.0
     inet addr:127.0.0.1  Mask:255.0.0.0

echo "10.0.2.15  hadoop hadoop.example.com" | sudo tee -a /etc/hosts

Java

We need an specific version of Java to make Hadoop work properly.

  • Download the .tar.gz file of the recommended versions of the Oracle JDK from the |java_download_site|. To make it straight-forward, use those commands to download v.1.7.0_55 and install it

    sudo mkdir /usr/java
    cd /usr/java/
    sudo wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie"  http://download.oracle.com/otn-pub/java/jdk/7u55-b13/jdk-7u55-linux-x64.tar.gz
    sudo tar zxvf jdk-7u55-linux-x64.tar.gz
    sudo rm jdk-7u55-linux-x64.tar.gz
    
  • Symbolically link the directory where the JDK is installed to /usr/java/default; for example:

    sudo ln -s /usr/java/jdk.1.7.0_55 /usr/java/default
    
  • In /root/.bashrc , set JAVA_HOME to the directory where the JDK is installed; like this:

    echo "export JAVA_HOME=/usr/java/default" | sudo tee -a /root/.bashrc
    

CDH4

  1. Add cloudera repository
echo "deb [arch=amd64] http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib" | sudo tee -a /etc/apt/sources.list.d/cloudera.list
echo "deb-src http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh precise-cdh4 contrib" | sudo tee -a /etc/apt/sources.list.d/cloudera.list
sudo apt-get install curl
curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key| sudo apt-key add -
sudo apt-get update
  1. Install and start zookeeper
sudo apt-get install zookeeper-server
sudo service zookeeper-server init
sudo service zookeeper-server start
  1. Install hadoop components (HDFS and MapReduce)
sudo apt-get install hadoop-0.20-mapreduce-tasktracker hadoop-0.20-mapreduce-jobtracker hadoop-hdfs-datanode hadoop-hdfs-namenode
  1. Setup namenode and datanode (be sure to change core-site.xml and hdfs-site.xml on /etc/hadoop/conf/)

    1. Set data folders for namenode and datanode (take /var/lib/hadoop-hdfs as example)
    sudo mkdir -p /var/lib/hadoop-hdfs/cache/hdfs/dfs/name
    sudo chown -R hdfs:hdfs /var/lib/hadoop-hdfs/cache/hdfs/dfs/name
    sudo chmod 700 /var/lib/hadoop-hdfs/cache/hdfs/dfs/name
    
    sudo mkdir -p /var/lib/hadoop-hdfs/cache/hdfs/dfs/data
    sudo chown -R hdfs:hdfs /var/lib/hadoop-hdfs/cache/hdfs/dfs/data
    sudo chmod 700 /var/lib/hadoop-hdfs/cache/hdfs/dfs/data
    
    1. Format NameNode (before starting for the first time)
    sudo -u hdfs hadoop namenode -format
    
    1. Start namenode and datanodes
    sudo service hadoop-hdfs-namenode start
    sudo service hadoop-hdfs-datanode start
    
  2. Setup jobtracker and tasktracker (mapred-site.xml on /etc/hadoop/conf needs to be modified)

    1. Create files and set permissions
    sudo mkdir -p /data/mapred/local
    sudo chown -R mapred:hadoop /data/mapred/local
    
    1. Create HDFS files
    sudo -u hdfs hadoop fs -mkdir /tmp
    sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
    sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
    sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
    sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred
    sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system
    sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system
    sudo -u hdfs hadoop fs -mkdir /user
    sudo -u hdfs hadoop fs -chmod -R 1777 /user
    
    1. Start services
    sudo service hadoop-0.20-mapreduce-tasktracker start
    sudo service hadoop-0.20-mapreduce-jobtracker start
    
  3. Install HBase

    1. Install hbase-master and regionserver
    sudo apt-get install hbase-master hbase-regionserver hbase hbase-thrift
    
    1. Change configuration (/etc/hbase/conf/hbase-site.xml)
    2. Create HDFS folders
    sudo -u hdfs hadoop fs -mkdir /hbase
    sudo -u hdfs hadoop fs -chown hbase /hbase
    
    1. Start master and regionserver
    sudo service hbase-master start
    sudo service hbase-regionserver start
    
  4. Install Hive

    1. Install needed packages
    sudo apt-get install hive hive-metastore hive-server2
    sudo apt-get install mysql-server libmysql-java
    sudo service mysql start
    sudo ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysql-connector-java.jar
    
    1. Create initial database schema
    mysql -u root -p
    
    CREATE DATABASE metastore;
    USE metastore;
    SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql;
    
    
    CREATE USER 'hive'@'localhost' IDENTIFIED BY 'password';
    REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'localhost';
    GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'localhost';
    FLUSH PRIVILEGES;
    quit;
    
    1. Change hive-site.xml (/etc/hive/conf/hive-site.xml)
    2. Start hive-metastore
    sudo service hive-metastore start
    
    1. Set Hive permissions on HDFS
    sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse
    sudo -u hdfs hadoop fs -chmod -R 1777 /user/hive/warehouse
    
    1. Start hive-servers
    sudo service hive-server start
    

Other

Empowering modules are both python code and R scripts. We will try to give an installation guide for most important dependencies.

RHipe

Our main R library to run jobs on Hadoop cluster.

  1. Install dependencies to build Rhipe as described in |rhipe_doc_site|.

    1. Apache Ant, Apache Maven and R

      sudo apt-get install ant
      sudo apt-get install maven
      sudo apt-get install r-base r-base-dev r-cran-rjava
      
    2. Google protobuf and pkg-config

    sudo apt-get install build-essential libtool pkg-config
    sudo apt-get install git
    cd /opt
    sudo git clone https://github.com/google/protobuf.git
    git checkout tags/v2.4.1
    cd protobuf
    sudo ./autogen.sh
    sudo ./configure
    sudo make
    sudo make check
    sudo make install
    
    1. Install some RHipe dependencies

      1. On R console (enter using 'sudo R' to be able of installing packages)
      install.packages(c('digest', 'stringr', 'brew'))
      quit()
      
      1. Install roxygen2
      sudo mkdir /opt/R-Packages
      cd /opt/R-Packages
      sudo wget https://cran.r-project.org/src/contrib/roxygen2_2.0.tar.gz
      sudo R CMD INSTALL roxygen2_2.0.tar.gz
      
    2. Get RHipe and install it

      1. Build Rhipe tarball for installation

        cd /opt/
        sudo git clone https://github.com/tesseradata/RHIPE.git
        cd RHIPE/
        sudo git checkout v0.74
        sudo ant build-distro -Dhadoop.version=cdh4
        
        • If build fails with a missing dependency tools.jar error try by installing java 1.6

          sudo apt-get install openjdk-6-jdk
          
        • If build fails because of asm-3.1 problems try by downloading it manually

          sudo wget http://search.maven.org/remotecontent?filepath=asm/asm/3.1/asm-3.1.jar
          sudo mv remotecontent?filepath=asm/asm/3.1/asm-3.1.jar asm-3.1.jar
          sudo cp asm-3.1.jar /root/.m2/repository/asm/asm/3.1/asm-3.1.jar
          sudo rm /root/.m2/repository/asm/asm/3.1/asm-3.1.pom
          
        • If build fails on Rcmd BUILD with a message like this, use the second command to build it manually:

          r-build:
              [exec] /usr/lib/R/bin/Rcmd: 61: exec: BUILD: not found
          
          sudo R CMD build package/R
          sudo cp Rhipe_0.74.1.tar.gz ../R-packages
          
      2. Installation if we had to build it manually

        cd /opt/R-packages
        sudo R CMD INSTALL Rhipe_0.74.1.tar.gz
        
    3. Set environment variables

      export HADOOP_HOME=/usr/lib/hadoop
      export HADOOP_CONF_DIR=/etc/hadoop/conf
      
    4. Test if everythings works fine

      cd ~/
      R
      library(Rhipe)
      rhinit()
      
      # Results in
      Rhipe: Using Rhipe.jar file
      Initializing Rhipe v0.74.1
      SLF4J: Class path contains multiple SLF4J bindings.
      SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client-0.20/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client-0.20/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client/slf4j-log4j12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
      2015-11-11 17:21:29,321 WARN  [main][NativeCodeLoader] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      Initializing mapfile caches
      > rhls('/')
        permission owner      group size          modtime   file
      1 drwxr-xr-x hbase supergroup    0 2015-11-11 13:01 /hbase
      2 drwxrwxrwt  hdfs supergroup    0 2015-11-11 12:54   /tmp
      3 drwxr-xr-x  hdfs supergroup    0 2015-11-11 13:31  /user
      4 drwxr-xr-x  hdfs supergroup    0 2015-11-11 12:52   /var
      
      • If fails because its unable to load rJava packages try with

        export LD_LIBRARY_PATH=$JAVA_HOME/jre/lib/amd64:$JAVA_HOME/jre/lib/amd64/server
        

Python and packages

Python is the main responsible of executing modules. We have some minimum requirements to make tasks work as expected. To make it easy to handle packages, install pip

  1. Installing pip

    sudo apt-get install python-pip python-dev libbsasl2-dev
    
  2. Installing modules

    1. make a /tmp/requirements.txt file with the content (list has been shorten to avoid some not needed packages to run modules, inform us if you find some important missing dependency)

      amqp==1.4.6
      sasl==0.1.3
      celery==3.1.13
      happybase==0.8
      mrjob==0.4
      protobuf==2.5.0
      snakebite==1.1.1
      thrift==0.9.1
      pyhs2==0.6.0
      pymongo==2.7.2
      
    2. Install dependencies by running

      sudo pip install -r /tmp/requirements.txt
      `
      
    3. Checks and tests

      1. happybase

        In [1]: import happybase
        
        In [2]: a = happybase.Co
        happybase.Connection      happybase.ConnectionPool
        
        In [2]: a = happybase.Connection()
        
        In [3]: a.tables()
        Out[3]: []
        
      2. snakebite

        In [5]: from snakebite import client
        
        In [9]: a = client.Client('hadoop', 8020)
        
        In [10]: for i in a.ls(['/']):
            print i
           ....:
        {'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1447243264519L, 'length': 0L, 'blocksize': 0L, 'owner': u'hbase', 'path': '/hbase'}
        {'group': u'supergroup', 'permission': 1023, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1447242845223L, 'length': 0L, 'blocksize': 0L, 'owner': u'hdfs', 'path': '/tmp'}
        {'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1447245086688L, 'length': 0L, 'blocksize': 0L, 'owner': u'hdfs', 'path': '/user'}
        {'group': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1447242767505L, 'length': 0L, 'blocksize': 0L, 'owner': u'hdfs', 'path': '/var'}
        

RabbitMQ

Use this code to install rabbitmq. Celery will use it (and configure it)

echo "deb http://www.rabbitmq.com/debian/ testing main" | sudo tee -a /etc/apt/sources.list.d/rabbitmq.list
wget http://www.rabbitmq.com/rabbitmq-signing-key-public.asc
sudo apt-key add rabbitmq-signing-key-public.asc
sudo apt-get update
sudo apt-get install rabbitmq-server -y

MongoDB

We don't use lastest MongoDB release. For development, we use MongoDB 2.6

  1. Install it using those lines

    sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10
    echo 'deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' | sudo tee /etc/apt/sources.list.d/mongodb.list
    sudo apt-get update
    sudo apt-get install mongodb-org=2.6.0 mongodb-org-server=2.6.0 mongodb-org-shell=2.6.0 mongodb-org-mongos=2.6.0 mongodb-org-tools=2.6.0
    
  2. Once installed we need to setup authentication (because code is not ready to work without it and there is no backdoor for development)

    1. Edit mongodb conf

      sudo vi /etc/mongod.conf
      
    2. Enable authentication by uncomment auth=true line

    3. Restart service

      sudo service mongod restart
      
  3. Add siteUserAdmin user

    mongo
    use admin
    db.createUser(
      {
        user: "siteUserAdmin",
        pwd: "password",
        roles: [ { role: "userAdminAnyDatabase", db: "admin" } ]
      }
    )
    
  4. Create user

    mongo --port 27017 -u siteUserAdmin -p password --authenticationDatabase admin
    use admin
    db.createUser(
        {
          user: "root",
          pwd: "password",
          roles: [ "root" ]
        }
    )
    
    mongo --port 27017 -u siteUserAdmin -p password --authenticationDatabase admin
    use rest_service
    db.createUser(
      {
        user: "root",
        pwd: "password",
        roles: [ { role: "readWrite", db: "rest_service" } ]
      }
    )
    
  5. Enable remote connections to mongodb

    1. Open mongodb configuration with sudo permissions

      sudo vi /etc/mongod.conf
      
    2. Comment bind_ip line

    3. Restart mongodb service

Updated