Saturday, July 5, 2014

Installing Multi Cluster Apache Hadoop-2.3.0 in CentOS

1. Pre-planning
After getting frustrated on finding multi cluster hadoop  2.x.x installation tutorial in internet, I decided to wrote this tutorial . I have tried my best to simplify as much as I can. DigitalOcean blog and Apache web site documentation help me a lot to understand the procedures and concepts.
The distributed architecture of the Hadoop has a  master node,  couple of slave nodes and a secondary master node (optional) . In this tutorial, I am going to set up one master node and two slave nodes.
Ensure that all slave nodes are identical ( in terms of partitions, memory, CPU etc ), not a requirement of hadoop installation but for the ease of  management and operational perspective.
Make sure that clusters can communicate with each other using PKI infrastructure.
The clusters will have following names and it should be specified in /etc/hosts (or should have record resource in DNS. Due to the massive communication overload / heart beats among the clusters, it is recommended that /etc/hosts is used instead of DNS server. Alternatively,cachedns agent can be configured in all nodes for caching of DNS records to avoid DNS queries)
192.168.0.22 slave02.greendata.org slave02
192.168.0.11 slave01.greendata.org slave01
192.168.0.222 master.greendata.org master
Generating and Copying Public Keys to all other nodes. How PKI infrastrucutre works is behind the scope of this installation, please refer to the basic concepts of cryptography and RSA/DSA Asymmetric algorithms.
In Master
ssh-keygen -t rsa
#ssh-copy-id root@slave01.greendata.org
#ssh-copy-id root@slave02.greendata.org
In slave01
ssh-keygen -t rsa
#ssh-copy-id root@master.greendata.org
#ssh-copy-id root@slave02.greendata.org
In slave02
#ssh-keygen -t rsa
#ssh-copy-id root@master.greendata.org
#ssh-copy-id root@slave01.greendata.org
Now , all nodes can communicate with any nodes without password through the keys that were just generated and copied.
Installing Java in all nodes
In Master Node:
#yum install java-1.7.0-openjdk-devel
#ssh slave01 "yum install java-1.7.0-openjdk-devel"
#ssh slave02 "yum install java-1.7.0-openjdk-devel"
check java version
#java -version
2. Hadoop Installation and Configuration
Master Node
#curl -O http://www.carfab.com/apachesoftware/hadoop/common/hadoop-2.3.0/hadoop-2...
# tar -xvzf hadoop-2.3.0.tar.gz
#mv hadoop-2.4.1-src /opt/hadoop
# vim ~root/.bashrc /* ADD HADOOP ENVIRONMENTAL VARIABLES
--------------------Entry of .bashrc file---------------------------------------------------
# User specific aliases and functions
alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk
export HADOOP_INSTALL=/opt/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
-----------------End of bashrc file----------------------------------------------------
[root@master ~]# source ~/.bashrc /* Exporting Environmental Variables
/*Initializing Configuration Files
[root@master hadoop]# vim /opt/hadoop/etc/hadoop/hadoop-env.sh
# The java implementation to use.
JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk
export JAVA_HOME=${JAVA_HOME}

Configuring configuration .xml files
[root@master hadoop]# vim /opt/hadoop/etc/hadoop/core-site.xml
fs.default.name
hdfs://master:9000
[root@master hadoop]# vim /opt/hadoop/etc/hadoop/yarn-site.xml
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.nodemanager.aux-services.mapreduce.shuffle.class
org.apache.hadoop.mapred.ShuffleHandler
[root@master hadoop]#cp /opt/hadoop/etc/hadoop/mapred-site.xml.template /opt/hadoop/etc/hadoop/mapred-site.xml
[root@master hadoop]# vim /opt/hadoop/etc/hadoop/mapred-site.xml
mapreduce.framework.name
yarn

Configuring /usr/local/hadoop/etc/hadoop/hdfs-site.xml . This file has to be configured for each host in the cluster that is being used. It is used to specify the directories which will be used as the namenode and the datanode on that host.
[root@master hadoop]# mkdir -p /data/local/hadoop_store/hdfs/namenode / * In Master Node
[root@master hadoop]# vim /opt/hadoop/etc/hadoop/hdfs-site.xml
dfs.replication
2 /* The replication factor for files; ideally it should be at least 3 for reliability , depends upon how many data nodes you have.
dfs.namenode.name.dir
file:/data/local/hadoop_store/hdfs/namenode

For Slave Nodes:
Copy all hadoop files to slaves from Master
[root@master hadoop]#scp -r /opt/ha* slave01:/opt/
[root@master hadoop]#scp -r /opt/ha* slave02:/opt/
Create directories to be used in DataNodes
[root@master hadoop]# ssh slave01 "mkdir -p /data/local/hadoop_store/hdfs/datanode"
[root@master hadoop]# ssh slave02 "mkdir -p /data/local/hadoop_store/hdfs/datanode"
[root@slave01 ~]# vim /opt/hadoop/etc/hadoop/hdfs-site.xml
dfs.datanode.data.dir
file:/usr/local/hadoop_store/hdfs/datanode
[root@slave02 ~]# vim /opt/hadoop/etc/hadoop/hdfs-site.xml dfs.datanode.data.dir
file:/usr/local/hadoop_store/hdfs/datanode
hdfs namenode -format /*Formatting the hdfs file system
cat /data/local/hadoop_store/hdfs/namenode/current/VERSION /*Verifying the version
start-dfs.sh
start-yarn.sh
***************************
Verifying the Processes
***************************
[root@master hadoop]# jps
15217 NameNode
15379 SecondaryNameNode
15800 Jps
15535 ResourceManager
[root@master hadoop]# jps
16321 NameNode
16635 ResourceManager
16881 Jps
16489 SecondaryNameNode
[root@master hadoop]# ssh slave01 "jps"
11887 NodeManager
11799 DataNode
11997 Jps
[root@master hadoop]# ssh slave02 "jps"
9415 Jps
9306 NodeManager
9219 DataNode
*******************************************
Verifying Through GUI
*******************************************


Troubleshooting Nodes:
Check the log files of each processes
Ex: [root@slave02 ~]# tail /opt/hadoop/logs/hadoop-root-datanode-slave02.greendata.org.log /* All log files are located into logs folder inside hadoop
Make sure that IPTABLES or SELINUX is not blocking the communication between nodes, Typical error you get are:
2014-07-05 01:22:16,719 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.0.222:9000. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-07-05 01:22:17,721 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.0.222:9000. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

Note: I don't owe the greendata.org  public domain, its only used for my internal configuration.

1 comment:

  1. It's is step wise and easy to follow. I'm going to try this out.

    Thanx
    Deepak

    ReplyDelete