Olivia's Blog

All on Big Data and Windows Azure

Hadoop on Linux on Azure – Step-by-Step: Install Hadoop (3)

Hadoop on Linux on Azure – Step-by-Step: Install Hadoop (3)

  • Comments 7
  • Likes

In this article, we set up a Hadoop cluster on Azure using virtual machines running Linux. More specifically, we use the HDP 2.1 on Linux distribution by Hortonworks that also provides the HDP distributions for the Windows platform. Furthermore, we install Hadoop with Ambari, an Apache project that provides an intuitive UI for provisioning, managing and monitoring a Hadoop cluster.


1 Introduction
2 Step-by-Step: Build the Infrastructure
3 Install a Hadoop Distribution

Step-by-Step: Install a Hadoop Distribution

  1. Install Ambari Server
  2. Install Hadoop

Now that we have set up the infrastructure for a Hadoop cluster in Azure, it is time to get our hands dirty with installing the actual Hadoop distribution.

1. Install Ambari Server

We start off with installing an Ambari Server that allows for a “graphical” way of installing and deploying Hadoop.

a. Set Up Bits
b. Set Up Ambari
c. Start Ambari

a. Set Up Bits

Log onto your master node (in this case oldkHDPm) as root. This node will serve as the main Installation host. Download the Ambari repository. Since we use CentOS 6 as our platform, access the repository as follows:

wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.5.1/ambari.repo

Following, copy the files to your repos.d:

cp ambari.repo /etc/yum.repos.d

More information can be found here in the Hortonworks documentation.

8 ambari 1 setup (1)

You can confirm that the repository is configured, by running yum repolist. You then obtain a list of repo id’s and repo names as marked in blue below. The command may vary depending on the platform (see here for more information).

8 ambari 1 setup (2)

Now, we can install the Ambari bits by running yum install ambari-server.

8 ambari 1 setup (3)

8 ambari 1 setup (4)

8 ambari 1 setup (5)


b. Set Up Ambari

Now that the Ambari server is installed, let us set it up. Run ambari-server setup.

8 ambari 2 setup

Here, we do not customise the user account for the ambari-server daemon since we have already changed the root password. Likewise, we accept the default settings. More Information can be found here in the Hortonworks documentation.

c. Start Ambari

The Ambari server is set up and installed – ready to be started:

ambari-server start

To have a look at the Ambari server processes, type in:

ps –ef | grep ambari

8 ambari 3 start (1)

In case, more than one process is running ambari, kill the other process as follows:

8 ambari 3 start (2)

2 Install Hadoop

Now we are ready to install the Hadoop distribution, i.e. HDP 2.1, using Ambari. Likewise, we will go along the Hortonworks documentation (here).

Log into your DNS server and open an internet browser pointint to


In this case: http://oldkHDPm.oldkHDP.oliviak.com:8080

9 hdp 1

Name your cluster (see Hortonworks documentation), e.g. oldkHDPcluster:

9 hdp 2

Select your desired stack (see Hortonworks docs). We choose the latest for the time being, i.e. HDP 2.1:

9 hdp 3

The next window specifies the install options. Before we go into it, we take a little de-tour, i.e. how to copy the SSH private key onto the DNS server.

Detour: How to Copy the SSH Private Key to the Local Machine

For that purpose, we install WinSCP that enables the secure file Transfer between a local and a remote computer. Once installed, log in using the credentials to the master node (i.e. oldkHDP.cloudapp.net, port 22):

9 hdp 4 (1)

9 hdp 4 (2)

Use the WinSCP client to download the private SSH key (i.e. id_rsa) of the master node into the DNS server:

9 hdp 4 (3)

9 hdp 4 (4)

Once downloaded into the “local” machine, i.e. our DNS server, we can browse for it in the “Install Options” window:

9 hdp 4 (5)

Additionally, type in all the target hosts of your Hadoop cluster. In this case, it includes the master node and the three worker nodes:


9 hdp 4 (6)

When registering and confirming, you will be prompted with another window containing the host name pattern expressions:

9 hdp 4 (7)

Success – the hosts are confirmed. Have a look at the Hortonworks documentation for more information.

9 hdp 5 (1)

You may or may not receive some warnings as shown in the yellowish area:

9 hdp 5 (2)

9 hdp 5 (4)

It turns out that the ntpd services are not running but are required to be. You could run the HostCleanup Python script on each host…

9 hdp 5 (5)

…or manually get the ntpd services to run, by running

chkconfig ntpd on

on each host, i.e. the master and all three worker nodes:

9 hdp 5 (7)

To check the status of the ntpd services, run

service ntpd status

Back in the browser on the DNS server, rerun the checks:

9 hdp 5 (6)

9 hdp 5 (8)

Next, you can choose the services you wish to install on your Hadoop Cluster (see HDP documentation).

9 hdp 6

Next, select the hosts on which certain master components should run (see HDP doc). In this case, I choose to assign the master components of the Hive Server and the Oozie Server to the master node.

9 hdp 7

With the Ambari wizard, slave components (i.e. DataNodes, NodeManagers and RegionServers) can be appropriately assigned to certain hosts in the next window (see HDP doc).

9 hdp 8

Now you can manage the configuration settings for the Hadoop components along the tabs:

9 hdp 9 0

For instance, under HDFS we change the directories from

9 hdp 9 2 hdfs (1)

to the following:

9 hdp 9 2 hdfs (2)

For the remaining tabs marked with warnings, credentials are required, such as Nagios,

9 hdp 9 1 nagios (1)

9 hdp 9 1 nagios (2)


9 hdp 9 3 hive

…and Oozie:

9 hdp 9 4 oozie

More Information on customising the services related to your Hadoop cluster can be found here.

Finally, before deploying the Hadoop cluster you obtain the usual summary of configuration settings:

9 hdp 10

It contains the following information:

Admin Name admin
Cluster Name oldkHDPcluster
Total Hosts 4 (4 new)



NameNode : oldkHDPm.oldkHDP.oliviak.com
SecondaryNameNode : oldkHDPw1.oldkHDP.oliviak.com
DataNodes : 3 Hosts

YARN + MapReduce2

NodeManager : 3 hosts
ResourceManager : oldkHDPw1.oldkHDP.oliviak.com
History Server : oldkHDPw1.oldkHDP.oliviak.com
App Timeline Server : oldkHDPw1.oldkHDP.oliviak.com


Clients : 1 host


Server : oldkHDPm.oldkHDP.oliviak.com
Administrator : nagiosadmin / (<your-email-address>@blabla.com)


Server : oldkHDPm.oldkHDP.oliviak.com

Hive + HCatalog

Hive Metastore : oldkHDPm.oldkHDP.oliviak.com
Database : MySQL (New Database)


Master : oldkHDPm.oldkHDP.oliviak.com
RegionServers : 3 hosts


Clients : 1 host


Clients : 1 host


Server : oldkHDPm.oldkHDP.oliviak.com
Database : Derby (New Derby Database)


Servers : 3 hosts


Server : oldkHDPw1.oldkHDP.oliviak.com


Nimbus : oldkHDPm.oldkHDP.oliviak.com
Storm REST API Server : oldkHDPm.oldkHDP.oliviak.com
Storm UI Server : oldkHDPm.oldkHDP.oliviak.com
DRPC Server : oldkHDPm.oldkHDP.oliviak.com
Supervisor : 3 Hosts

And away you deploy:

9 hdp 11 (1)

9 hdp 11 (2)

9 hdp 11 (3)

9 hdp 11 (4)

Finally, you obtain a summary of your Hadoop installation efforts:

9 hdp 11 (5)


You have the Ambari GUI nicely displayed in front of you:

9 hdp 12 (1)

While hovering over the tiles, you obtain more information, such as on the network usage:

9 hdp 12 (4)

9 hdp 12 (3)

or the cluster load:

9 hdp 12 (2)

  • Thank you ;-)
    Great Post!

  • The wget does not work for me, what could be the reason
    [root@aak-m1 ~]# wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.5.1/ambari.repo
    --2015-01-20 14:17:02-- http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.5.1/ambari.repo
    Resolving public-repo-1.hortonworks.com... failed: Temporary failure in name resolution.
    wget: unable to resolve host address “public-repo-1.hortonworks.com”

  • internet connection not working in Master and other Slaves. Any setting changes to be done. This stops from downloading repo any download command. Please suggest

  • Amol, the reason for it not to work is that that link is broken, you can search in hortonworks, for ambari and there is a link in there that has the new working link.

  • S. i tried to make ntpd work.. for some reason it wasnt running on my virtual machines, and the command given in here didnt seem to start it. so i had to use /etc/init.d/ntpd start instead

  • Thankyou

  • Thank you so much ! 4 Node cluster is up and running in no time. Cheers!

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment