Installing Hadoop using Ambari Server

What Is This Manual About

This manual describes how to install a virtual machine with Hadoop (HDP 2.2).

Hadoop has several methods of installation, but this one allows making the least amount of problems.

Installation of Linux

Prerequisites

Download the following:

Linux Installation

Install Virtualbox and VM Virtualbox Extension pack (see the download site above).

In the VirtualBox create a new virtual machine, set the type to be “Linux” and version to be “Redhat x64″, set up 8192 MB of memory and a disk of size 500Gb. Leave the rest parameters by default.

Open the Settings dialog for this VM and set up the following:

  • System / Processor: set the number of processors to 2 or more
  • Storage / Controller IDE: set the path for DVD to point to CentOS-6.6-x86_64-bin-DVD1.iso
  • Network: configure two adapters – one bridge adapter, another one – host only adapter

Run the VM and install CentOS. During installation do the following:

  • when it will ask you what components to include, select “Software Development Workstation”:

  • when it asks for root password – set it to “hadoop”
  • when is asks you to create a user – name it “hduser” with password “hduser”
  • leave the rest parameters as they are – don’t change anything.

After VM is installed, log on under root.

Note: we will always log on under root starting from this point, therefore, if the opposite not specified, remember that you always have to log on under root.

Install VirtualBox guest additions (select menu item “Devices / Insert Guest Additions CD image”, then choose “yes” when it will offer you to “open autorun prompt” from the inserted CD) and reboot again.

Log on under root.

We suggest to select menu item “Devices / Shared Clipboard / Bidirectional” for future convenience.

Network Settings

Enable Networks

You may notice that after the installation both network adapters are disconnected:

Click on both network interfaces (on the label “System eth1″ and on label “System eth1″) to enable them.

Inside the VM run terminal (Applications / System Tools / Terminal) and run the command

ifconfig

It will give you a similar output:

It shows several network adapters, the first two will be the bridged one (that has IP address of the same range as your host machine), and the host-only network (in my case – 192.168.56.102).

Run the command

ping 192.168.56.1

– that should be the IP address of host-only virtual adapter at your host, and then run

ping google.com

Both ping commands should work fine.

Note: from here we will be using gedit (Application / Accessories / gedit text editor) as the text editor for configuration files. If you have a better one – feel free to use it.

Open the file /etc/sysconfig/network-scripts/ifcfg-eth0 in a text editor and fix the line

ONBOOT=no

to be

ONBOOT=yes

Do the same thing with the file /etc/sysconfig/network-scripts/ifcfg-eth1

Disabling IP Version 6

Open /etc/sysctl.conf in a text editor and add there two lines

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1

to the end of the file.

Then run the following commands in the terminal:

sysctl -w net.ipv6.conf.all.disable_ipv6=1

sysctl -w net.ipv6.conf.default.disable_ipv6=1

Disabling Firewall

Run the following commands in the terminal:

service iptables save

service iptables stop

chkconfig iptables off

service ip6tables save

service ip6tables stop

chkconfig ip6tables off

service libvirtd stop

chkconfig libvirtd off

Disable the THP

Run the following commands in the terminal:

echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

Open the file /etc/rc.local in gedit and add the following lines to the end:

if test -f /sys/kernel/mm/redhat_transparent_hugepage/enabled; then

echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

fi

 

if test -f /sys/kernel/mm/redhat_transparent_hugepage/defrag; then

echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

fi

For more information about THP see here:

http://answers.splunk.com/answers/188875/how-do-i-disable-transparent-huge-pages-thp-and-co.html

Enable NTPD

Run the following commands in the terminal:

service ntpd start

chkconfig ntpd on

Fixing Hostname

I have a corporate network, and the DNS name is assigned to my VM automatically. If I run the ifconfig and nslookup commands, I see this:

 

Seems as everything is fine, but it is still preferable to fix some more configuration files.

Open the file /etc/hosts in a text editor and make the contents to be like this:

127.0.0.1 yourhostname

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

(where yourhostname should be replaced by the real host name. If you don’t see any hostname when running nslookup command – choose any host name that you are going to use).

In my case the /etc/hosts file looks like this:


Open the file /etc/sysconfig/network in a text editor and make the contents to be like this:

NETWORKING=yes

HOSTNAME=yourhostname

In my case the file looks like this:

 

Final Network Check

Reboot your VM and log on under root. Run a terminal window. You should see this:

[root@yourhostname ~]

Run the following commands in the terminal:

service iptables status

service ip6tables status

they both should print “firewall is not running”.

Run command

ping google.com

it must correctly display that you can reach the internet.

Adding Repositories

About Repositories

Installation of software in CentOS is done using yum. It looks like this:

yum install somepackage

yum looks up for “somepackage” at one of known repositories (web sites), if it finds it – it downloads it and performs the installation.

There is a configuration directory /etc/yum.repos.d/ that contains multiple repo files – configurations of the known repositories where yum will be searching for files. We will add some more repositories there.

Adding Ambari Repository

In the terminal run the commands

cd /etc/yum.repos.d/

wget http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.0.0/ambari.repo

now you have a file called “ambari.repo” in the directory /etc/yum.repos.d

Note: https://cwiki.apache.org/confluence/display/AMBARI/Install+Ambari+2.0.0+from+Public+Repositories

here is the instruction which told exactly where to get the file

Now run the command

yum repolist

it should output you several lines:

Install the Ambari Server

Run the command

yum -y install ambari-server

ambari-server setup

During the setup process, it will ask you about choice of the JDK – select “Oracle JDK 1.7″, for the rest options simply press enter and accept the defaults. It must output you “Ambari Server ‘setup’ completed successfully.” At the end.

Run the command

ambari-server start

Run a browser and open the website http://yourhostname:8080 (replace yourhostname with real full hostname that you set up on previous steps), log on under user “admin” password “admin”. You should see the following picture:

If you see this, it means that Ambari Server is installed, and you’re ready to install Hadoop components now.

Install Hadoop Components Using Ambari

Click “launch install wizard” button on the initial page of Ambari website.

In the “Name your cluster” enter something like hadoopcluster (choose your own name):

Press “Next”.

Select “HDP 2.2″ stack:

Press “Next”.

On the next step we will have to set up the hosts to be included in the cluster, and the private key.

As to the hosts everything is simple: we need to set up just this host: yourhostname

As to the key, this is a little bit tricky: not everyone knows what it is. Here is a little theory (feel free to skip the paragraph). One method to log on to SSH is to pass the login and the password. Another method how machine X can recognize that user A is trying to log on is next:

  • Machine X knows the public key of user A
  • When logging on, user A sends a message to machine X encrypted by private key
  • Machine X tries to decrypt the message using the public key (taking it from “.ssh” folder in the home directory of the user A). If it succeeds, it knows that this is really user A.

Therefore, Ambari Server needs the SSH private key to be able to log on to all machines in the cluster and run commands at those machines.

So, let us generate a key pair. Run terminal, run the following commands:

cd /root/.ssh

ssh-keygen

press enter several times. Run command

ls

it will show you that there are two files in the directory .ssh:

The first one contains the private key, the second one contains the public key for user root.

Now run the command

cat /root/.ssh/id_rsa.pub > /root/.ssh/authorized_keys

it will add the contents of the root’s file id_rsa.pub to a file authorized_keys.

Therefore, if a user will connect to this machine by SSH and will specify the correct private key, this machine will look up into authorized_keys and will know that this is root.

To test this run

ssh yourhostname

(replace yourhostname with real host name)

the command will prompt you for confirmation, and then will successfully show a prompt [root@yourhostname ~] that means that we’ve set up a connection by SSH to the host yourhostname (actually, we already were on the same host, but this doesn’t matter).

Press Ctrl+D, you will disconnect from the host yourhostname and get the message “connection to yourhostname closed”.

Now run the command:

cat id_rsa

will dump the file contents to the terminal. Select the region between “begin” and “end” markers, and copy it to the clipboard:

Switch to the Firefox and paste the key on the web page:

 

Press “Register and Confirm”.

You will see the following screen:

After it finishes, it will give you the following result:

On the following step let all the services be selected:

Press “Next”. If you get this warning, simply ignore it by pressing “Proceed”:

On the following step

Leave everything by default and press “Next”.

On this step also press “Next”.

This form

Warns us that there are some services which require attention. Click on “Hive”:

Scroll the page up and click on “Oozie”, and do the same thing with oozie: set the password to be “oozie”. Do the same for Knox – set the password “knox”.

Press “Next”.

It will output you a warning

But do not take it into account – simply press “Proceed anyway”. We will disable many of those services later, we simply want to have them installed for now.

Press “Deploy” button on the next page.

You will see the total progress

This will last for a while (an hour or so).

When this process finishes, all components are installed.

Running Hadoop

If you enter the Ambari site (http://localhost:8080) you will see this:

In the case if you reboot your machine, you will have to start all the services.

Configuring Hadoop Components

Hive

First of all, I would like to try hive. Run command

hive

and here you are:

If you run the command

su – hive

hive

then everything will be fine.

The reason is obvious:

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=WRITE, inode=”/user”:hdfs:hdfs:drwxr-xr-x

The permissions on HDFS “/user” directory are initially wrong.

Run the following:

su –

hdfs dfs -ls /

it will show you that permissions are next:

On the “user” folder there are permissions rwx for the user “hdfs” (who is the owner of this directory), but just rx on the group hdfs! Let us add wrire permission for group hdfs, add the root to “hdfs” group. But we shall do this from the user hdfs, so we’ll start from changing its password:

passwd hdfs

it will promt you for password – enter hdfs twice. Then run

su – hdfs

it will switch you under that user. The run

usermod -a -G hdfs root

hdfs dfs -chmod -R 775 /user

hdfs dfs -ls /

And here is the result:

Now switch back to the user root and run hive:

su –

hive

And here you are:

Configuring HDFS for other users

For other users (who later will be using this all) it is necessary to do the following:

hdfs dfs -mkdir /user/Ihor_Bobak

hdfs dfs -chown Ihor_Bobak:Ihor_Bobak /user/Ihor_Bobak

Reason: we need everyone to have the home directory in the HDFS.

HDFS Browser

There is a nice tool which significantly speeds up your file operations with HDFS.

https://drive.google.com/file/d/0B3DMXMfcPWF3TmlTV19vSHM4YUE/view?usp=sharing

Create a directory devtools in the root, and put there this jar file to “muCommander” subdirectory:

Then create a text file mucommander.sh with the following contents:

java -jar /devtools/muCommander/mucommander-hadoop-qfs.jar –hadoop /usr/hdp/current/hadoop-client

using any text editor, and

chmod 777 mucommander.sh

so that you can see an executable:

And then you can make a link on the panel:

As soon as you run muCommander, you will see this:

Press “connect to server” write down your hostname, press Connect. At the left bar you will see the list of directories of HDFS, where you can copy/paste using F5, F6, etc.

 

3 Comments

  1. Pingback: Spark SQL Bad Performance | Big Data, BI, DW

  2. Blake Colson

    Nice screenshots, easy to follow, and best part is that it works. Good tutorial ! Any chance of building 3 virtual host hadoop cluster?

    Reply
    1. Ihor Bobak (Post author)

      It is just necessary to add more nodes after installing Ambari on its web interface. Nothing too much different from installing just one node cluster.

      Reply

Leave a Reply to Blake Colson Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>