跳过正文
  1. 博客/
  2. 后端/
  3. 软件/

Build Hadoop Cluster in One Computer

·4 分钟· ·
后端 软件
目录

If you are hadoop novice, I strongly suggest you beginning your study from single node building,you can learn from this website, after you having finshed build one single node, then you can reading my blog to learn how to run a N-node clusters just in your computer.

Abstract
#

This blog is introduce using one computer to build a N-node clusters.I suggest you use ubuntu to build. You can also use Windows, but you’d better install virtualbox to install one desktop ubuntu as your base server.In this blog, we will try two different way to build hadoop clusters in one computer.

Introduction
#

Before you start learning, you can download these required softwares from Intelnet.

  • JDK8(optional)

we can also install it by apt tool, but may be slow in China.So you’d better download it from website.

https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

choose “jdk-8u181-linux-x64.tar.gz” to download.You can alse install in your master computer later, you can read from this blog

  • Hadoop(2.85)

I choose latest 2.85 version, you can download from this website.

https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.8.5/hadoop-2.8.5-src.tar.gz

  • Ubuntu Image

In this trip, we choose Ubuntu16.04 server for build clusters.You can use 163 mirrors to speed up your download.

http://mirrors.163.com/ubuntu-releases/16.04.5/ubuntu-16.04.5-server-amd64.iso

  • Virtualbox

we need virtualbox to create our clusters. It’s easy for you to install virtualbox in ubuntu. You can read this article to install virtualbox-5.2

  • Docker

we will try use docker build our clusters, it’s easy install in ubuntu.The install tutorials is https://docs.docker.com/install/linux/docker-ce/ubuntu/#uninstall-old-versions

Clusters On VirtualBox
#

Now I assuming you’re working on a ubuntu16.04 desktop OS.Now let’s begining our trip.

First,let’s init a master, after we install required software in master, we can use virtualbox clone function to easy to build slave.

Build Ubuntu VMS
#

  • new a machine named master

  • Choose 2G RAM, VDI

then run this image, load the iso file you downloaded.Pay attention to make true install ssh server( Or you can install after installing os by apt)

Before we start install hadoop and java skd, let me tell you something about the internet require.

For our clusters runing, we need a connected internet between master and slaves.If we have many computers, it’s simple, we just need they both have public IP or private IP in one LAN.But if we just in one computer, how can we have independent IP for our master and slaves.

This is why we install virtualbox, virtualbox provide our independent computers in just one computer.Moreover, it can provide a simulate NIC for each computer.By using that, each computer can have they own private IP in LAN.

So the key our cluster running is the bridge

we need choose the bridged adapter to make master and slaves just in same LAN.Pay attention to make true you need choose your real NIC.In ubuntu you just run ifconfig and find out have one line inet addr:192.168.1.12 .Usually it’s eth0 in ubuntu.

When you have finished OS installment.You can login in and start installing hadoop clusters.

Step 1. Configure Static IP

In your virtual machine, your IP is changeable when reboot.Because ubuntu use DHCP for init your IP from gateway.We need make true our master and slaves have changeless IP to protect their connection.

To do this, first you need make true your installment is ok. Try ping baidu.com to check you connected Internet or not.Then we need know our gateway address.Try run route in shell, you can find a table, in the row Gateway, you can find one or more static IP like 192.168.0.1 , this is your gateway.Now we open our internet settings.

	cat /etc/network/interfaces	  

you can see something like this

auto eth0  
iface eth0 inet dhcp  

eth0 is your NIC(yours maybe different). and we use dhcp to get IP. Now we need change it to static.

auto eth0  
iface eth0 inet static  
address 192.168.0.105  
netmask 255.255.255.0  
gateway 192.168.0.1  

PS: make true, you need change the eth0 and gateway IP to yours.The address IP must be subnet of gateway under the control of netmast.eg, you can’t set you ip address to 10.1.1.1 if your gateway is 192.168.0.1.The easiest way is set by dhcp format.And just change the last number.If you still can’t connect the Internet.Try add one line dns-nameservers 8.8.8.8 .

	ifdown eth0  
	ifup  eth0  

now run upper commands in your vm(eth0 need your NIC name).If run ifconfig again, you can see our IP address chage to 192.168.0.105 now!

Step 2. Add Hostname alias

Becase hadoop need hostname to identify their ID, so now we add Hostname-IP pair to smooth our connection.

Just edit /etc/hosts/ and add three line below

192.168.0.105 master  
192.168.0.104 slave1  
192.168.0.103 slave2  

Step 3. Make SSH Login

Becase hadoop need login by root with SSH , so we need make root can login in in ubuntu.Open /etc/ssh/sshd_config and change line PermitRootLogin prohibit-password to PermitRootLogin yes, then service ssh restart .

Also you need use your sudo to set password for root

sudo passwd root  

now check you can login in with root

ssh [email protected]  

Step 4. Set Hadoop Env

First, we need install JDK for hadoop, now back to your host computer. And use scp to upload JDK to vm.You can add below to /etc/hosts in your host machine.

192.168.0.105 master  
192.168.0.104 slave1  
192.168.0.103 slave2  

then you can easy upload your JDK and Hadoop to your vm(you need unpack this tar.gz file first)

scp -r /path/your/jdk root@master:/usr/lib/jvm/java-8-oracle  
  
scp -r /path/your/hadooproot@master:/usr/local/hadoop  

PS: you can also install Java8 by apt

Now, we installed JDK and Hadoop in our VM.Then we back to VM and initialize our Hadoop.

  • Set JDK Home

edit hadoop-env.sh(in /usr/local/hadoop/etc/hadoop/) file add export JAVA_HOME=/usr/lib/jvm/java-8-oracle to tell Hadoop JDK local address.

  • Set Core IP

We need a boss to handle all employer.So edit core-site.xml(in /usr/local/hadoop/etc/hadoop/) and add a property in configuration

<property>  
    <name>fs.defaultFS</name>  
    <value>hdfs://master:9000/</value>  
</property>  

each cluster will send heartbeat to master:9000.

  • Set HDFS replication and file dir

The hadoop basement is HDFS, edit hdfs-site.xml and add three property

<property>  
		<name>dfs.replication</name>  
		<value>2</value>  
	</property>  

 <property>  
		<name>dfs.namenode.name.dir</name>  
		<value>file:///root/hdfs/namenode</value>  
		  
	</property>  
	<property>  
		<name>dfs.datanode.data.dir</name>  
		<value>file:///root/hdfs/datanode</value>  
	  
	</property>  

The dfs.replication meaning the backups of HDFS, dfs.namenode.name.dir and dfs.datanode.data.dir is optional.If you not set this, it will store under /tmp (when reboot ,it will delete).

  • Set Yarn for MapReduce

In hadoop2, we use Yarn to manage our MapReduce, run cp mapred-site.xml.template mapred-site.xml and then add property to set Yarn as our MapReduce framework

<property>  
    <name>mapreduce.framework.name</name>  
    <value>yarn</value>  
</property>  

and we also need tell yarn the master of the clusters and our need open MapReduce Shuffle Fuction effective our MapReduce, edit yarn-site.xml, and add two property

<property>  
    <name>yarn.nodemanager.aux-services</name>  
    <value>mapreduce_shuffle</value>  
</property>  
 
<property>  
    <name>yarn.resourcemanager.hostname</name>  
    <value>master</value>  
</property>  

yarn.nodemanager.aux-services open shuffle, and yarn.resourcemanager.hostname set ResourceManager hostname.


Now we complete the base Hadoop settings, now we can try run hadoop on master

cd /usr/local/hadoop/  
bin/hadoop namenode -format  
sbin/start-dfs.sh  

We try format our namenode, and start dfs server, now run jps, you can see NameNode and SecondaryNameNode server started.

Now we try start Yarn to start MapReduce FrameWork.

sbin/start-yarn.sh  

Now, rerun jps, you can see ResourceManager running.You can also try netstat -tuplen|grep 8088, you will find the ResourceManager open some tcp port like 8080,8031,8033,etc.And the 8088 is the website of managing clusters.You can open http://master:8088 to see the clusters status.Now you can only see blank node in clusters, for we have not started one slave yet.

Congratulation, our master is starting, in the running, we need input our password when start, after complete all slave building, we can use ssh-key to autologin.

Now let’s build our slaves.

Use virtualbox clone function, we clone master to a new VM named slave1.

Because we clone every thing to the slave1, so we need close master and goto slave1 change its hostname and static IP make it to be a slave

First we need do is rename the VM,edit /etc/hostname change it to slave1, then we need do is setting slave1 Static IP, we do like upper.Just replace IP to 192.168.0.104, and then we reboot and start master and slave1 at meatime.

Now let’s check master to start our slave1, in our master VM, we edit /usr/local/hadoop/etc/hadoop/slaves file, and one line

slave1  

and make true you have add slaves’ hostname alias in master VM.
Then we try start our Cluster

cd /usr/local/hadoop  
bin/hadoop datanode -format  
sbin/start-dfs.sh && sbin/start-yarn.sh  

After running these command, check http://master:8088 to find the master have one slave online named slave1.

PS: Now you can generate ssh-key for your login in slaves, just run ssh-keygen -t rsa && ssh-copy-id slave1, you don’t need input your password to start your clusters.

Now we have one node clusters, if you want more, you can add more slaves repeatting upper produce.

After you build your N-Clusters , you can now run those commands to check the hadoop working or not.

# create input files  
  
mkdir input  
echo "Hello Docker" >input/file2.txt  
echo "Hello Hadoop" >input/file1.txt  

# create input directory on HDFS  
hadoop fs -mkdir -p input  

# put input files to HDFS  
hdfs dfs -put ./input/* input  
  
  
# run wordcount   
cd /usr/local/hadoop/bin  
hadoop jar ../share/hadoop/mapreduce/sources/hadoop-mapreduce-examples-*-sources.jar org.apache.hadoop.examples.WordCount input output  

# print the input files  
echo -e "\ninput file1.txt:"  
hdfs dfs -cat input/file1.txt  

echo -e "\ninput file2.txt:"  
hdfs dfs -cat input/file2.txt  

# print the output of wordcount  
echo -e "\nwordcount output:"  
hdfs dfs -cat output/part-r-00000  

PS: By the way, if you want to running this clusters for a long time, you can try use vboxmanage to manage the vm. You can simple use vboxmanage startvm master --type headless to start master background(change master to other VM name can start them too)

Conclusion
#

The difficult of build a clusters in virtualbox is know how master and slaves connecting each other.If you set a right network, it’s easy to running the cluster.But there’re some problem in virtualbox, we can’t share our network in the host LAN with virtualbox bridge. So we will introduce you build clusters in Docker and we can run our clusters in a swarm clusters in a real envirment.

Clusters On Docker
#

Building clusters is much easily in docker, for docker provide a easy network bride in sigle computer or in a swarm clusters.

we use kiwenlau/hadoop:1.0 image to our test(which hadoop version is 2.7).Just run

sudo docker pull kiwenlau/hadoop:1.0  

After few minutes, we can have a hadoop images, now we need set our private LAN Net just use this(If you want to run a swarm clusters above many computers, just change bridge to overlay, powerful, isn’t it)

sudo docker network create --driver=bridge hadoop  

Now let start our master server

sudo docker run -itd \  
				--net=hadoop \  
				-p 50070:50070 \  
				-p 8088:8088 \  
				--name hadoop-master \  
				--hostname hadoop-master \  
				kiwenlau/hadoop:1.0 &> /dev/null  

In the command, we set the master hostname to hadoop-master.and we needn’t change /etc/hosts to add it like in virtualbox, docker will do it for us.

Now we start our slaves

sudo docker run -itd \  
                --net=hadoop \  
                --name hadoop-slave1 \  
                --hostname hadoop-slave1 \  
                kiwenlau/hadoop:1.0 &> /dev/null  
				  
sudo docker run -itd \  
                --net=hadoop \  
                --name hadoop-slave2 \  
                --hostname hadoop-slave2 \  
                kiwenlau/hadoop:1.0 &> /dev/null  

After doing that, we have finshed all softwares build.Now just runsudo docker exec -it hadoop-master bash into master, and then start our clusters bash start-hadoop.sh.

Now you can enjoy your clusters in few minutes, open http://127.0.0.1:8088/ to see our clusters running happily.

Conclusion
#

After introducing two way to build a hadoop clusters, you will find it’s easy to build a clusters if you know how they work together.In a word, we kind of like using Docker to running hadoop clusters, we can easy move add more Hadoop slaves in just one command.Meantime we can use bridge or overlay network for us building a more safe hadoop clusters.

References
#

https://github.com/kiwenlau/hadoop-cluster-docker

相关文章

Ubuntu16.04安装Tensorflow的CPU优化
·3 分钟
后端 软件
由于我的笔记本是农卡,没法安装CUDA加速,而且我的显卡只有2G显存,安装OpenCL费力不讨好,而且由于我有一个Google云的300美元的体验,所以可以在Google云上使用TPU来进行加速,所以我就干脆不安装显卡加速,但是Tensorflow提供了指令集优化,由于默认使用pip安装没有提供这个功能,所以只能手动编译安装
巧用Git钩子
·5 分钟
后端 软件 Git
以前听学长提过Git钩子,但是自己一直没有仔细了解过,记得我还写过一个github更新的Python包,现在想想其实用自带的钩子就能很好的完成
ansible管理nginx负载均衡
·4 分钟
后端 软件
前言 # 因为手头自己有三个服务器,所以想折腾一下负载均衡。
git 工作流程
·3 分钟
后端 软件
git是当今流行的版本控制工具,一般我们可能只要会push, pull就可以了, 但是当我们同别人共同工作的时候,我们必须要了解git协同开发的一些重要流程.
GitHub Education Pack
·3 分钟
后端 软件
GitHub推出一个对学生和教师的福利包,对于学生来说这是一个不小 的福利,只要通过一个edu邮箱就可以领取,但奈何国内有些无良人买卖 邮箱,所以GitHub对于.cn的邮箱一律拒绝,但是可以通过上传学生证的方法 得到验证,题主刚开始用学校邮箱试了试,失败了,抱着试一试的心态,上传了 学生证,没想到第二天就给我回复,并给我这个豪华大礼包,接下来我就介绍介绍 如何用这个包来.
在Linux下玩转Vim
·3 分钟
后端 软件
学了鸟哥的书前面基础后,突然想在Linux下用gcc玩C语言,然后了解到了Vim这个神一样的编译器,接下来经过超长时间虐心的安装无数插件无数依赖包,突然有种打自己一顿的感觉,还好终于把Vim装的和VS差不多了,接下来我介绍我安装Vim的经验吧。 # 我虚拟机下的Linux原来是红旗6的,但是我改了一下yum的包源成CentOS的并且全部update一下后就神奇的变成了CentOS6,虽然他们两个是同一家公司,但是总给我一种由盗版成了正版的感觉。。 # 闲话不多说,刚开始装第一个插件是Ctags # 刚开始装的时候我是在X-Windows里面的这里下载再转回shell敲