In this article, we set up a Hadoop cluster on Azure using virtual machines running Linux. More specifically, we use the HDP 2.1 on Linux distribution by Hortonworks that also provides the HDP distributions for the Windows platform. Furthermore, we install Hadoop with Ambari, an Apache project that provides an intuitive UI for provisioning, managing and monitoring a Hadoop cluster.

Contents

1 Introduction
2 Step-by-Step: Build the Infrastructure
3 Install a Hadoop Distribution

Step-by-Step: Install a Hadoop Distribution

  1. Install Ambari Server
  2. Install Hadoop

Now that we have set up the infrastructure for a Hadoop cluster in Azure, it is time to get our hands dirty with installing the actual Hadoop distribution.

1. Install Ambari Server

We start off with installing an Ambari Server that allows for a “graphical” way of installing and deploying Hadoop.

a. Set Up Bits
b. Set Up Ambari
c. Start Ambari

a. Set Up Bits

Log onto your master node (in this case oldkHDPm) as root. This node will serve as the main Installation host. Download the Ambari repository. Since we use CentOS 6 as our platform, access the repository as follows:

wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.5.1/ambari.repo

Following, copy the files to your repos.d:

cp ambari.repo /etc/yum.repos.d

More information can be found here in the Hortonworks documentation.

8 ambari 1 setup (1)

You can confirm that the repository is configured, by running yum repolist. You then obtain a list of repo id’s and repo names as marked in blue below. The command may vary depending on the platform (see here for more information).

8 ambari 1 setup (2)

Now, we can install the Ambari bits by running yum install ambari-server.

8 ambari 1 setup (3)

8 ambari 1 setup (4)

8 ambari 1 setup (5)

 

b. Set Up Ambari

Now that the Ambari server is installed, let us set it up. Run ambari-server setup.

8 ambari 2 setup

Here, we do not customise the user account for the ambari-server daemon since we have already changed the root password. Likewise, we accept the default settings. More Information can be found here in the Hortonworks documentation.

c. Start Ambari

The Ambari server is set up and installed – ready to be started:

ambari-server start

To have a look at the Ambari server processes, type in:

ps –ef | grep ambari

8 ambari 3 start (1)

In case, more than one process is running ambari, kill the other process as follows:

8 ambari 3 start (2)

2 Install Hadoop

Now we are ready to install the Hadoop distribution, i.e. HDP 2.1, using Ambari. Likewise, we will go along the Hortonworks documentation (here).

Log into your DNS server and open an internet browser pointint to

http://{ambari-server}:8080

In this case: http://oldkHDPm.oldkHDP.oliviak.com:8080

9 hdp 1

Name your cluster (see Hortonworks documentation), e.g. oldkHDPcluster:

9 hdp 2

Select your desired stack (see Hortonworks docs). We choose the latest for the time being, i.e. HDP 2.1:

9 hdp 3

The next window specifies the install options. Before we go into it, we take a little de-tour, i.e. how to copy the SSH private key onto the DNS server.

Detour: How to Copy the SSH Private Key to the Local Machine

For that purpose, we install WinSCP that enables the secure file Transfer between a local and a remote computer. Once installed, log in using the credentials to the master node (i.e. oldkHDP.cloudapp.net, port 22):

9 hdp 4 (1)

9 hdp 4 (2)

Use the WinSCP client to download the private SSH key (i.e. id_rsa) of the master node into the DNS server:

9 hdp 4 (3)

9 hdp 4 (4)

Once downloaded into the “local” machine, i.e. our DNS server, we can browse for it in the “Install Options” window:

9 hdp 4 (5)

Additionally, type in all the target hosts of your Hadoop cluster. In this case, it includes the master node and the three worker nodes:

oldkHDPm.oldkHDP.oliviak.com
oldkHDPw[1-3].oldkHDP.oliviak.com

9 hdp 4 (6)

When registering and confirming, you will be prompted with another window containing the host name pattern expressions:

9 hdp 4 (7)

Success – the hosts are confirmed. Have a look at the Hortonworks documentation for more information.

9 hdp 5 (1)

You may or may not receive some warnings as shown in the yellowish area:

9 hdp 5 (2)

9 hdp 5 (4)

It turns out that the ntpd services are not running but are required to be. You could run the HostCleanup Python script on each host…

9 hdp 5 (5)

…or manually get the ntpd services to run, by running

chkconfig ntpd on

on each host, i.e. the master and all three worker nodes:

9 hdp 5 (7)

To check the status of the ntpd services, run

service ntpd status

Back in the browser on the DNS server, rerun the checks:

9 hdp 5 (6)

9 hdp 5 (8)

Next, you can choose the services you wish to install on your Hadoop Cluster (see HDP documentation).

9 hdp 6

Next, select the hosts on which certain master components should run (see HDP doc). In this case, I choose to assign the master components of the Hive Server and the Oozie Server to the master node.

9 hdp 7

With the Ambari wizard, slave components (i.e. DataNodes, NodeManagers and RegionServers) can be appropriately assigned to certain hosts in the next window (see HDP doc).

9 hdp 8

Now you can manage the configuration settings for the Hadoop components along the tabs:

9 hdp 9 0

For instance, under HDFS we change the directories from

9 hdp 9 2 hdfs (1)

to the following:

9 hdp 9 2 hdfs (2)

For the remaining tabs marked with warnings, credentials are required, such as Nagios,

9 hdp 9 1 nagios (1)

9 hdp 9 1 nagios (2)

Hive

9 hdp 9 3 hive

…and Oozie:

9 hdp 9 4 oozie

More Information on customising the services related to your Hadoop cluster can be found here.

Finally, before deploying the Hadoop cluster you obtain the usual summary of configuration settings:

9 hdp 10

It contains the following information:

Admin Name admin
Cluster Name oldkHDPcluster
Total Hosts 4 (4 new)
Repositories

Services

HDFS

NameNode : oldkHDPm.oldkHDP.oliviak.com
SecondaryNameNode : oldkHDPw1.oldkHDP.oliviak.com
DataNodes : 3 Hosts

YARN + MapReduce2

NodeManager : 3 hosts
ResourceManager : oldkHDPw1.oldkHDP.oliviak.com
History Server : oldkHDPw1.oldkHDP.oliviak.com
App Timeline Server : oldkHDPw1.oldkHDP.oliviak.com

Tez

Clients : 1 host

Nagios

Server : oldkHDPm.oldkHDP.oliviak.com
Administrator : nagiosadmin / (<your-email-address>@blabla.com)

Ganglia

Server : oldkHDPm.oldkHDP.oliviak.com

Hive + HCatalog

Hive Metastore : oldkHDPm.oldkHDP.oliviak.com
Database : MySQL (New Database)

HBase

Master : oldkHDPm.oldkHDP.oliviak.com
RegionServers : 3 hosts

Pig

Clients : 1 host

Sqoop

Clients : 1 host

Oozie

Server : oldkHDPm.oldkHDP.oliviak.com
Database : Derby (New Derby Database)

Zookeeper

Servers : 3 hosts

Falcon

Server : oldkHDPw1.oldkHDP.oliviak.com

Storm

Nimbus : oldkHDPm.oldkHDP.oliviak.com
Storm REST API Server : oldkHDPm.oldkHDP.oliviak.com
Storm UI Server : oldkHDPm.oldkHDP.oliviak.com
DRPC Server : oldkHDPm.oldkHDP.oliviak.com
Supervisor : 3 Hosts

And away you deploy:

9 hdp 11 (1)

9 hdp 11 (2)

9 hdp 11 (3)

9 hdp 11 (4)

Finally, you obtain a summary of your Hadoop installation efforts:

9 hdp 11 (5)

Done!

You have the Ambari GUI nicely displayed in front of you:

9 hdp 12 (1)

While hovering over the tiles, you obtain more information, such as on the network usage:

9 hdp 12 (4)

9 hdp 12 (3)

or the cluster load:

9 hdp 12 (2)