Install Hadoop in single node Vagrant


Vagrant is a nice tool for developers specially who are love to play with new technologies. Recently i have successfully installed Hadoop in local vagrant. The work was not smooth, i stuck several times. So the objective of this post to help people who wants to explore Hadoop using Vagrant. Specially for Windows users, Vagrant can be a magnificent choice for Hadoop learning.

1. Prerequisite:

  • Latest version of Vagrant
  • Git
  • Putty

2. Prepare single node Vagrant:
Create a directory in Windows machine and then clone below github repository using GIT bash in this directory,

cmd > git clone https://github.com/khayer117/hadoop-in-vagrant.git

I have created a Vagrant configuration file with neccessary setting to install Hadoop in Guest Machine. Hadoop will be installed single node(I will write another post for multi-node). Below is the vagrant node configuration:

  • Ubuntu version : Server 14.04
  • Vagrant box name: ubuntu/trusty64
  • Java: Sun Java java 8
  • Cpu: 2
  • Memory: 1024
  • Private IP: 192.168.33.50

Now it is time to up the vagrant machine. Go to project directory, open Windows command prompt there, write below command,

cmd> vagrant up hnname

As we are using Vagrant in Widnows machine, we will have to connect vagrant guest using SSL client. I prefer to use PuTTy. Below Vagrant command display SSH connection information,

cmd> vagrant ssh hnname

3. Configure Ubuntu SSH server:
SSH server is already pre-installed in vagrant host Ubuntu. We will have to configure SSH server for Hadoop becuase Hadoop manage distribute notes over SSH. Here we have created SSH key. As we are preparing local development environment, so we can left SSH password blank.

$ ssh-keygen
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

4. Downlaoding Hadoop:
I have used hadoop-2.7.2.tar.gz for this article. Hadoop will be install in /usr/local/hadoop folder. But this is optional, hadoop is fine to install other location. Hadoop will be installed under default “Vagrant” user. But dedicated user for Hadoop is recommended.

$ wget http://apache.mirrors.pair.com/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
$ tar -zxvf hadoop-2.7.2.tar.gz
$ sudo cp -r hadoop-2.7.2.tar.gz /usr/local/hadoop

5. Configure Ubuntu bash:
Java home directory and Hadoop base path will have to set in Ubuntu bash file. Below are the steps to modify bash file,

# open bash in vi editor
$ sudo vi $HOME/.bashrc

# append below line in bash file
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export JAVA_HOME=/usr/lib/jvm/java-8-oracle

# save and exit from VI(Esc + :wq!)
# compile bash
$ excec bash

6. Disable IPV6:
Hadoop does not support IPV6. In Vagrant Ubuntu IPV6 is enable by default. So IPV6 support will have to disabled following below instruction.

6.1 Modify Hadoop Env setting:

# Edit hadoop env file
$ sudo vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh

# modify HADOOP_OPTS value
HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

# Save and exist vi

 

6.2 Modify Ubuntu network setting to disable IPV6

# Modify sysctl.conf
$ sudo vi /etc/sysctl.conf

# Add below IPv6 configuration
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

# Save and exit vi

# Reload sysctl.conf configuration
$ sudo sysctl -p

# check IPV6 status. value =1 means IPV6 disable
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

7. Ubuntu Hosts(/etc/hosts) file entry:
Hadoop does not require any special entry in Hosts file. But for better understanding below i have added the working copy of my hosts file,

127.0.0.1	hnname	hnname
127.0.0.1 localhost
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

8. Configure Hadoop site setting:
This part i got difficulty due to vagrant. I have tried configured using host name “hnname” or “localhost”, unfortunately i am not succeed to up hadoop component properly using those host name. So i have configured sites using default ip “0.0.0.0”.

Hadoop stores configuration files under /usr/local/hadoop/etc/hadoop directory. Configuration file will have to open using vi, add corresponding setting, then save the file.

8.1 core-site.xml file:

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://0.0.0.0:10001</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
</configuration>

8.2 mapred-site.xml file:

# Create template mapred-site.xml file from defaul one
$ sudo cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

# Edit mapred-site.xml file
$ sudo vi /usr/local/hadoop/etc/hadoop/mapred-site.xml

# add below setting
<configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value>0.0.0.0:10002</value>
</property>
</configuration>

8.3 hdfs-site.xml file:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/hdfs</value>
</property>

</configuration>

9. Create a temporary directory for hadoop:

$ sudo mkdir /home/hadoop/tmp
$ sudo chown Vagrant /home/hadoop/tmp

# Set folder permissions
$ sudo chmod 750 /home/hadoop

10. Create data folder for data node:

$ sudo mkdir /home/hadoop/hdfs
$ sudo chown vagrant /home/hadoop/hdfs
$ sudo chmod 750 /home/hadoop/hdfs

Configuration is done here. Now it time to test hadoop. Every process should start smoothly. We will have to verify log(/user/local/hadoop/logs) if any unexpected issue raised.

11. Starting services:

# Formate data node. This require one time. This will cleanup hdfs data folder.
$ hdfs namenode -format

#Start hdfs and yarn services.
$ /usr/local/hadoop/sbin/start-dfs.sh
$ /usr/local/hadoop/sbin/start-yarn.sh

# If every thing ok, then below java process will be run
$ jps

# dfs process
2034 DataNode
2263 SecondaryNameNode
1887 NameNode

# yarn process
2441 ResourceManager
2586 NodeManager

12. Test from Browser:
Hadoop have basic web UI to view and track activities. After starting all Hadoop process, Hadoop sites can be view from window machine browser,
http://192.168.33.50:50070

If site is not display from browser, try to test site using telnet from Ubuntu. If telnet connect successfully then Window machine browser should connect to the Hadoop web successfully.

$ telnet 192.168.33.50 50070

Please note that 192.168.33.50 is private IP for vagrant machine which is configure in vagrant config file.

Below site for Hadoop data node:
http://192.168.33.50:50070

Cluster Yarn(Job tracker) site:

http://192.168.33.50:8088

Important ports:

http://blog.cloudera.com/blog/2009/08/hadoop-default-ports-quick-reference/

13. Running a job:

I like to create word count job in Hadoop to count word from source text file. A sample source file will be download from textfiles.com. Job will execute MapReduce example Jar which already in installation path.

$ cd /usr/local/hadoop

# make directory for sample data and download test data from texfiles.com
$ mkdir sampledata/science
$cd sampledata/science
$wget http://www.textfiles.com/science/ast-list.txt

# Create a directory in dfs and put sample data
$ hdfs dfs -mkdir /project01
$ hdfs dfs -put /usr/local/hadoop/sampledata /project01

# view dfs data list. you can also view data from data node web site (http://192.168.33.50:50070/explorer.html).
$ hdfs dfs -ls /

# execute example job for word count.
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /project01/sampledata/science /project01/sampledata/science/output

# Below show the output.
$ hdfs dfs -cat /project01/sampledata/science/output/part-r-00000

# stopping services
$ /usr/local/hadoop/sbin/stop-dfs.sh
$ /usr/local/hadoop/sbin/stop-yarn.sh

The working copy of Hadoop and Ubuntu configuration file has been added in git “config-files” folder. Hadoop is highly configurable distributed system. Above describe setting is minimum level to run Hadoop in single node cluster. In production server, Hadoop is generally installed in multi node cluster.

References:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s