Build Hadoop Cluster in One Computer

If you are hadoop novice, I strongly suggest you beginning your study from single node building,you can learn from this website, after you having finshed build one single node, then you can reading my blog to learn how to run a N-node clusters just in your computer.

Abstract
#

This blog is introduce using one computer to build a N-node clusters.I suggest you use ubuntu to build. You can also use Windows, but you’d better install virtualbox to install one desktop ubuntu as your base server.In this blog, we will try two different way to build hadoop clusters in one computer.

Introduction
#

Before you start learning, you can download these required softwares from Intelnet.

JDK8(optional)

we can also install it by apt tool, but may be slow in China.So you’d better download it from website.

https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

choose “jdk-8u181-linux-x64.tar.gz” to download.You can alse install in your master computer later, you can read from this blog

Hadoop(2.85)

I choose latest 2.85 version, you can download from this website.

https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.8.5/hadoop-2.8.5-src.tar.gz

Ubuntu Image

In this trip, we choose Ubuntu16.04 server for build clusters.You can use 163 mirrors to speed up your download.

http://mirrors.163.com/ubuntu-releases/16.04.5/ubuntu-16.04.5-server-amd64.iso

Virtualbox

we need virtualbox to create our clusters. It’s easy for you to install virtualbox in ubuntu. You can read this article to install virtualbox-5.2

Docker

we will try use docker build our clusters, it’s easy install in ubuntu.The install tutorials is https://docs.docker.com/install/linux/docker-ce/ubuntu/#uninstall-old-versions

Clusters On VirtualBox
#

Now I assuming you’re working on a ubuntu16.04 desktop OS.Now let’s begining our trip.

First,let’s init a master, after we install required software in master, we can use virtualbox clone function to easy to build slave.

Build Ubuntu VMS
#

new a machine named master

Choose 2G RAM, VDI

then run this image, load the iso file you downloaded.Pay attention to make true install ssh server( Or you can install after installing os by apt)

Before we start install hadoop and java skd, let me tell you something about the internet require.

For our clusters runing, we need a connected internet between master and slaves.If we have many computers, it’s simple, we just need they both have public IP or private IP in one LAN.But if we just in one computer, how can we have independent IP for our master and slaves.

This is why we install virtualbox, virtualbox provide our independent computers in just one computer.Moreover, it can provide a simulate NIC for each computer.By using that, each computer can have they own private IP in LAN.

So the key our cluster running is the bridge

we need choose the bridged adapter to make master and slaves just in same LAN.Pay attention to make true you need choose your real NIC.In ubuntu you just run ifconfig and find out have one line inet addr:192.168.1.12 .Usually it’s eth0 in ubuntu.

When you have finished OS installment.You can login in and start installing hadoop clusters.

Step 1. Configure Static IP

In your virtual machine, your IP is changeable when reboot.Because ubuntu use DHCP for init your IP from gateway.We need make true our master and slaves have changeless IP to protect their connection.

To do this, first you need make true your installment is ok. Try ping baidu.com to check you connected Internet or not.Then we need know our gateway address.Try run route in shell, you can find a table, in the row Gateway, you can find one or more static IP like 192.168.0.1 , this is your gateway.Now we open our internet settings.

	cat /etc/network/interfaces

you can see something like this

auto eth0  
iface eth0 inet dhcp

eth0 is your NIC(yours maybe different). and we use dhcp to get IP. Now we need change it to static.

auto eth0  
iface eth0 inet static  
address 192.168.0.105  
netmask 255.255.255.0  
gateway 192.168.0.1

PS: make true, you need change the eth0 and gateway IP to yours.The address IP must be subnet of gateway under the control of netmast.eg, you can’t set you ip address to 10.1.1.1 if your gateway is 192.168.0.1.The easiest way is set by dhcp format.And just change the last number.If you still can’t connect the Internet.Try add one line dns-nameservers 8.8.8.8 .

	ifdown eth0  
	ifup  eth0

now run upper commands in your vm(eth0 need your NIC name).If run ifconfig again, you can see our IP address chage to 192.168.0.105 now!

Step 2. Add Hostname alias

Becase hadoop need hostname to identify their ID, so now we add Hostname-IP pair to smooth our connection.

Just edit /etc/hosts/ and add three line below

192.168.0.105 master  
192.168.0.104 slave1  
192.168.0.103 slave2

Step 3. Make SSH Login

Becase hadoop need login by root with SSH , so we need make root can login in in ubuntu.Open /etc/ssh/sshd_config and change line PermitRootLogin prohibit-password to PermitRootLogin yes, then service ssh restart .

Also you need use your sudo to set password for root

sudo passwd root

now check you can login in with root

ssh [email protected]

Step 4. Set Hadoop Env

First, we need install JDK for hadoop, now back to your host computer. And use scp to upload JDK to vm.You can add below to /etc/hosts in your host machine.

192.168.0.105 master  
192.168.0.104 slave1  
192.168.0.103 slave2

then you can easy upload your JDK and Hadoop to your vm(you need unpack this tar.gz file first)

scp -r /path/your/jdk root@master:/usr/lib/jvm/java-8-oracle  
  
scp -r /path/your/hadooproot@master:/usr/local/hadoop

PS: you can also install Java8 by apt

Now, we installed JDK and Hadoop in our VM.Then we back to VM and initialize our Hadoop.

Set JDK Home

edit hadoop-env.sh(in /usr/local/hadoop/etc/hadoop/) file add export JAVA_HOME=/usr/lib/jvm/java-8-oracle to tell Hadoop JDK local address.

Set Core IP

We need a boss to handle all employer.So edit core-site.xml(in /usr/local/hadoop/etc/hadoop/) and add a property in configuration

<property>  
    <name>fs.defaultFS</name>  
    <value>hdfs://master:9000/</value>  
</property>

each cluster will send heartbeat to master:9000.

Set HDFS replication and file dir

The hadoop basement is HDFS, edit hdfs-site.xml and add three property

<property>  
		<name>dfs.replication</name>  
		<value>2</value>  
	</property>  

 <property>  
		<name>dfs.namenode.name.dir</name>  
		<value>file:///root/hdfs/namenode</value>  
		  
	</property>  
	<property>  
		<name>dfs.datanode.data.dir</name>  
		<value>file:///root/hdfs/datanode</value>  
	  
	</property>

The dfs.replication meaning the backups of HDFS, dfs.namenode.name.dir and dfs.datanode.data.dir is optional.If you not set this, it will store under /tmp (when reboot ,it will delete).

Set Yarn for MapReduce

In hadoop2, we use Yarn to manage our MapReduce, run cp mapred-site.xml.template mapred-site.xml and then add property to set Yarn as our MapReduce framework

<property>  
    <name>mapreduce.framework.name</name>  
    <value>yarn</value>  
</property>

and we also need tell yarn the master of the clusters and our need open MapReduce Shuffle Fuction effective our MapReduce, edit yarn-site.xml, and add two property

<property>  
    <name>yarn.nodemanager.aux-services</name>  
    <value>mapreduce_shuffle</value>  
</property>  
 
<property>  
    <name>yarn.resourcemanager.hostname</name>  
    <value>master</value>  
</property>

yarn.nodemanager.aux-services open shuffle, and yarn.resourcemanager.hostname set ResourceManager hostname.

Now we complete the base Hadoop settings, now we can try run hadoop on master

cd /usr/local/hadoop/  
bin/hadoop namenode -format  
sbin/start-dfs.sh

We try format our namenode, and start dfs server, now run jps, you can see NameNode and SecondaryNameNode server started.

Now we try start Yarn to start MapReduce FrameWork.

sbin/start-yarn.sh

Now, rerun jps, you can see ResourceManager running.You can also try netstat -tuplen|grep 8088, you will find the ResourceManager open some tcp port like 8080,8031,8033,etc.And the 8088 is the website of managing clusters.You can open http://master:8088 to see the clusters status.Now you can only see blank node in clusters, for we have not started one slave yet.

Congratulation, our master is starting, in the running, we need input our password when start, after complete all slave building, we can use ssh-key to autologin.

Now let’s build our slaves.

Use virtualbox clone function, we clone master to a new VM named slave1.

Because we clone every thing to the slave1, so we need close master and goto slave1 change its hostname and static IP make it to be a slave

First we need do is rename the VM,edit /etc/hostname change it to slave1, then we need do is setting slave1 Static IP, we do like upper.Just replace IP to 192.168.0.104, and then we reboot and start master and slave1 at meatime.

Now let’s check master to start our slave1, in our master VM, we edit /usr/local/hadoop/etc/hadoop/slaves file, and one line

slave1

and make true you have add slaves’ hostname alias in master VM.
Then we try start our Cluster

cd /usr/local/hadoop  
bin/hadoop datanode -format  
sbin/start-dfs.sh && sbin/start-yarn.sh

After running these command, check http://master:8088 to find the master have one slave online named slave1.

PS: Now you can generate ssh-key for your login in slaves, just run ssh-keygen -t rsa && ssh-copy-id slave1, you don’t need input your password to start your clusters.

Now we have one node clusters, if you want more, you can add more slaves repeatting upper produce.

After you build your N-Clusters , you can now run those commands to check the hadoop working or not.

# create input files  
  
mkdir input  
echo "Hello Docker" >input/file2.txt  
echo "Hello Hadoop" >input/file1.txt  

# create input directory on HDFS  
hadoop fs -mkdir -p input  

# put input files to HDFS  
hdfs dfs -put ./input/* input  
  
  
# run wordcount   
cd /usr/local/hadoop/bin  
hadoop jar ../share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-*-sources.jar org.apache.hadoop.examples.WordCount input output  

# print the input files  
echo -e "\ninput file1.txt:"  
hdfs dfs -cat input/file1.txt  

echo -e "\ninput file2.txt:"  
hdfs dfs -cat input/file2.txt  

# print the output of wordcount  
echo -e "\nwordcount output:"  
hdfs dfs -cat output/part-r-00000

PS: By the way, if you want to running this clusters for a long time, you can try use vboxmanage to manage the vm. You can simple use vboxmanage startvm master --type headless to start master background(change master to other VM name can start them too)

Conclusion
#

The difficult of build a clusters in virtualbox is know how master and slaves connecting each other.If you set a right network, it’s easy to running the cluster.But there’re some problem in virtualbox, we can’t share our network in the host LAN with virtualbox bridge. So we will introduce you build clusters in Docker and we can run our clusters in a swarm clusters in a real envirment.

Clusters On Docker
#

Building clusters is much easily in docker, for docker provide a easy network bride in sigle computer or in a swarm clusters.

we use kiwenlau/hadoop:1.0 image to our test(which hadoop version is 2.7).Just run

sudo docker pull kiwenlau/hadoop:1.0

After few minutes, we can have a hadoop images, now we need set our private LAN Net just use this(If you want to run a swarm clusters above many computers, just change bridge to overlay, powerful, isn’t it)

sudo docker network create --driver=bridge hadoop

Now let start our master server

sudo docker run -itd \  
				--net=hadoop \  
				-p 50070:50070 \  
				-p 8088:8088 \  
				--name hadoop-master \  
				--hostname hadoop-master \  
				kiwenlau/hadoop:1.0 &> /dev/null

In the command, we set the master hostname to hadoop-master.and we needn’t change /etc/hosts to add it like in virtualbox, docker will do it for us.

Now we start our slaves

sudo docker run -itd \  
                --net=hadoop \  
                --name hadoop-slave1 \  
                --hostname hadoop-slave1 \  
                kiwenlau/hadoop:1.0 &> /dev/null  
				  
sudo docker run -itd \  
                --net=hadoop \  
                --name hadoop-slave2 \  
                --hostname hadoop-slave2 \  
                kiwenlau/hadoop:1.0 &> /dev/null

After doing that, we have finshed all softwares build.Now just runsudo docker exec -it hadoop-master bash into master, and then start our clusters bash start-hadoop.sh.

Now you can enjoy your clusters in few minutes, open http://127.0.0.1:8088/ to see our clusters running happily.

Conclusion
#

After introducing two way to build a hadoop clusters, you will find it’s easy to build a clusters if you know how they work together.In a word, we kind of like using Docker to running hadoop clusters, we can easy move add more Hadoop slaves in just one command.Meantime we can use bridge or overlay network for us building a more safe hadoop clusters.