If you are hadoop novice, I strongly suggest you beginning your study from single node building,you can learn from this website, after you having finshed build one single node, then you can reading my blog to learn how to run a N-node clusters just in your computer.
Abstract #
This blog is introduce using one computer to build a N-node clusters.I suggest you use ubuntu to build. You can also use Windows, but you’d better install virtualbox to install one desktop ubuntu as your base server.In this blog, we will try two different way to build hadoop clusters in one computer.
Introduction #
Before you start learning, you can download these required softwares from Intelnet.
- JDK8(optional)
we can also install it by apt tool, but may be slow in China.So you’d better download it from website.
https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
choose “jdk-8u181-linux-x64.tar.gz” to download.You can alse install in your master computer later, you can read from this blog
- Hadoop(2.85)
I choose latest 2.85 version, you can download from this website.
https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.8.5/hadoop-2.8.5-src.tar.gz
- Ubuntu Image
In this trip, we choose Ubuntu16.04 server for build clusters.You can use 163 mirrors to speed up your download.
http://mirrors.163.com/ubuntu-releases/16.04.5/ubuntu-16.04.5-server-amd64.iso
- Virtualbox
we need virtualbox
to create our clusters. It’s easy for you to install virtualbox in ubuntu. You can read this article to install virtualbox-5.2
- Docker
we will try use docker build our clusters, it’s easy install in ubuntu.The install tutorials is https://docs.docker.com/install/linux/docker-ce/ubuntu/#uninstall-old-versions
Clusters On VirtualBox #
Now I assuming you’re working on a ubuntu16.04 desktop OS.Now let’s begining our trip.
First,let’s init a master, after we install required software in master, we can use virtualbox clone
function to easy to build slave.
Build Ubuntu VMS #
- new a machine named master
- Choose 2G RAM, VDI
then run this image, load the iso file you downloaded.Pay attention to make true install ssh server( Or you can install after installing os by apt
)
Before we start install hadoop and java skd, let me tell you something about the internet require.
For our clusters runing, we need a connected internet between master and slaves.If we have many computers, it’s simple, we just need they both have public IP or private IP in one LAN.But if we just in one computer, how can we have independent IP for our master and slaves.
This is why we install virtualbox
, virtualbox
provide our independent computers in just one computer.Moreover, it can provide a simulate NIC for each computer.By using that, each computer can have they own private IP in LAN.
So the key our cluster running is the bridge
we need choose the bridged adapter
to make master and slaves just in same LAN.Pay attention to make true you need choose your real NIC.In ubuntu you just run ifconfig
and find out have one line inet addr:192.168.1.12
.Usually it’s eth0
in ubuntu.
When you have finished OS installment.You can login in and start installing hadoop clusters.
Step 1. Configure Static IP
In your virtual machine, your IP is changeable when reboot.Because ubuntu use DHCP for init your IP from gateway.We need make true our master and slaves have changeless IP to protect their connection.
To do this, first you need make true your installment is ok. Try ping baidu.com
to check you connected Internet or not.Then we need know our gateway address.Try run route
in shell, you can find a table, in the row Gateway
, you can find one or more static IP like 192.168.0.1
, this is your gateway.Now we open our internet settings.
cat /etc/network/interfaces
you can see something like this
auto eth0
iface eth0 inet dhcp
eth0
is your NIC(yours maybe different). and we use dhcp
to get IP. Now we need change it to static.
auto eth0
iface eth0 inet static
address 192.168.0.105
netmask 255.255.255.0
gateway 192.168.0.1
PS: make true, you need change the eth0
and gateway
IP to yours.The address IP
must be subnet of gateway under the control of netmast.eg, you can’t set you ip address to 10.1.1.1 if your gateway is 192.168.0.1.The easiest way is set by dhcp
format.And just change the last number.If you still can’t connect the Internet.Try add one line dns-nameservers 8.8.8.8
.
ifdown eth0
ifup eth0
now run upper commands in your vm(eth0
need your NIC name).If run ifconfig
again, you can see our IP address chage to 192.168.0.105
now!
Step 2. Add Hostname alias
Becase hadoop need hostname to identify their ID, so now we add Hostname-IP
pair to smooth our connection.
Just edit /etc/hosts/
and add three line below
192.168.0.105 master
192.168.0.104 slave1
192.168.0.103 slave2
Step 3. Make SSH Login
Becase hadoop need login by root with SSH
, so we need make root
can login in in ubuntu.Open /etc/ssh/sshd_config
and change line PermitRootLogin prohibit-password
to PermitRootLogin yes
, then service ssh restart
.
Also you need use your sudo
to set password for root
sudo passwd root
now check you can login in with root
ssh [email protected]
Step 4. Set Hadoop Env
First, we need install JDK
for hadoop, now back to your host computer. And use scp
to upload JDK
to vm.You can add below to /etc/hosts
in your host machine.
192.168.0.105 master
192.168.0.104 slave1
192.168.0.103 slave2
then you can easy upload your JDK
and Hadoop
to your vm(you need unpack this tar.gz file first)
scp -r /path/your/jdk root@master:/usr/lib/jvm/java-8-oracle
scp -r /path/your/hadooproot@master:/usr/local/hadoop
PS: you can also install Java8
by apt
Now, we installed JDK
and Hadoop
in our VM.Then we back to VM and initialize our Hadoop
.
- Set JDK Home
edit hadoop-env.sh
(in /usr/local/hadoop/etc/hadoop/
) file add export JAVA_HOME=/usr/lib/jvm/java-8-oracle
to tell Hadoop
JDK local address.
- Set Core IP
We need a boss to handle all employer.So edit core-site.xml
(in /usr/local/hadoop/etc/hadoop/
) and add a property in configuration
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000/</value>
</property>
each cluster will send heartbeat to master:9000
.
- Set
HDFS
replication and file dir
The hadoop basement is HDFS
, edit hdfs-site.xml
and add three property
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///root/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///root/hdfs/datanode</value>
</property>
The dfs.replication
meaning the backups of HDFS
, dfs.namenode.name.dir
and dfs.datanode.data.dir
is optional.If you not set this, it will store under /tmp
(when reboot ,it will delete).
- Set
Yarn
forMapReduce
In hadoop2, we use Yarn
to manage our MapReduce
, run cp mapred-site.xml.template mapred-site.xml
and then add property to set Yarn
as our MapReduce
framework
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
and we also need tell yarn
the master
of the clusters and our need open MapReduce
Shuffle
Fuction effective our MapReduce
, edit yarn-site.xml
, and add two property
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
yarn.nodemanager.aux-services
open shuffle
, and yarn.resourcemanager.hostname
set ResourceManager
hostname.
Now we complete the base Hadoop
settings, now we can try run hadoop
on master
cd /usr/local/hadoop/
bin/hadoop namenode -format
sbin/start-dfs.sh
We try format our namenode, and start dfs server, now run jps
, you can see NameNode
and SecondaryNameNode
server started.
Now we try start Yarn
to start MapReduce FrameWork
.
sbin/start-yarn.sh
Now, rerun jps
, you can see ResourceManager
running.You can also try netstat -tuplen|grep 8088
, you will find the ResourceManager
open some tcp port like 8080,8031,8033,etc.And the 8088
is the website of managing clusters.You can open http://master:8088 to see the clusters
status.Now you can only see blank node in clusters, for we have not started one slave yet.
Congratulation, our master is starting, in the running, we need input our password when start, after complete all slave building, we can use ssh-key to autologin.
Now let’s build our slaves.
Use virtualbox
clone function, we clone master
to a new VM named slave1
.
Because we clone every thing to the slave1
, so we need close master
and goto slave1
change its hostname and static IP make it to be a slave
First we need do is rename the VM,edit /etc/hostname
change it to slave1
, then we need do is setting slave1
Static IP, we do like upper.Just replace IP to 192.168.0.104
, and then we reboot and start master
and slave1
at meatime.
Now let’s check master to start our slave1
, in our master
VM, we edit /usr/local/hadoop/etc/hadoop/slaves
file, and one line
slave1
and make true you have add slaves’ hostname alias in master
VM.
Then we try start our Cluster
cd /usr/local/hadoop
bin/hadoop datanode -format
sbin/start-dfs.sh && sbin/start-yarn.sh
After running these command, check http://master:8088 to find the master have one slave online named slave1
.
PS: Now you can generate ssh-key for your login in slaves, just run ssh-keygen -t rsa && ssh-copy-id slave1
, you don’t need input your password to start your clusters.
Now we have one node clusters, if you want more, you can add more slaves repeatting upper produce.
After you build your N-Clusters , you can now run those commands to check the hadoop working or not.
# create input files
mkdir input
echo "Hello Docker" >input/file2.txt
echo "Hello Hadoop" >input/file1.txt
# create input directory on HDFS
hadoop fs -mkdir -p input
# put input files to HDFS
hdfs dfs -put ./input/* input
# run wordcount
cd /usr/local/hadoop/bin
hadoop jar ../share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-*-sources.jar org.apache.hadoop.examples.WordCount input output
# print the input files
echo -e "\ninput file1.txt:"
hdfs dfs -cat input/file1.txt
echo -e "\ninput file2.txt:"
hdfs dfs -cat input/file2.txt
# print the output of wordcount
echo -e "\nwordcount output:"
hdfs dfs -cat output/part-r-00000
PS: By the way, if you want to running this clusters for a long time, you can try use vboxmanage
to manage the vm. You can simple use vboxmanage startvm master --type headless
to start master background(change master
to other VM name can start them too)
Conclusion #
The difficult of build a clusters in virtualbox is know how master and slaves connecting each other.If you set a right network, it’s easy to running the cluster.But there’re some problem in virtualbox
, we can’t share our network in the host LAN with virtualbox bridge
. So we will introduce you build clusters in Docker
and we can run our clusters in a swarm clusters in a real envirment.
Clusters On Docker #
Building clusters is much easily in docker, for docker provide a easy network bride in sigle computer or in a swarm clusters.
we use kiwenlau/hadoop:1.0
image to our test(which hadoop version is 2.7).Just run
sudo docker pull kiwenlau/hadoop:1.0
After few minutes, we can have a hadoop images, now we need set our private LAN Net just use this(If you want to run a swarm clusters above many computers, just change bridge
to overlay
, powerful, isn’t it)
sudo docker network create --driver=bridge hadoop
Now let start our master server
sudo docker run -itd \
--net=hadoop \
-p 50070:50070 \
-p 8088:8088 \
--name hadoop-master \
--hostname hadoop-master \
kiwenlau/hadoop:1.0 &> /dev/null
In the command, we set the master hostname to hadoop-master
.and we needn’t change /etc/hosts
to add it like in virtualbox
, docker will do it for us.
Now we start our slaves
sudo docker run -itd \
--net=hadoop \
--name hadoop-slave1 \
--hostname hadoop-slave1 \
kiwenlau/hadoop:1.0 &> /dev/null
sudo docker run -itd \
--net=hadoop \
--name hadoop-slave2 \
--hostname hadoop-slave2 \
kiwenlau/hadoop:1.0 &> /dev/null
After doing that, we have finshed all softwares build.Now just runsudo docker exec -it hadoop-master bash
into master
, and then start our clusters bash start-hadoop.sh
.
Now you can enjoy your clusters in few minutes, open http://127.0.0.1:8088/ to see our clusters running happily.
Conclusion #
After introducing two way to build a hadoop clusters, you will find it’s easy to build a clusters if you know how they work together.In a word, we kind of like using Docker
to running hadoop clusters, we can easy move add more Hadoop
slaves in just one command.Meantime we can use bridge
or overlay
network for us building a more safe hadoop clusters.