In this article, we set up a Hadoop cluster on Azure using virtual machines running Linux. More specifically, we use the HDP 2.1 on Linux distribution by Hortonworks that also provides the HDP distributions for the Windows platform. Furthermore, we install Hadoop with Ambari, an Apache project that provides an intuitive UI for provisioning, managing and monitoring a Hadoop cluster.
1 Introduction2 Step-by-Step: Build the Infrastructure3 Install a Hadoop Distribution
Now that we have set up the infrastructure for a Hadoop cluster in Azure, it is time to get our hands dirty with installing the actual Hadoop distribution.
We start off with installing an Ambari Server that allows for a “graphical” way of installing and deploying Hadoop.
a. Set Up Bitsb. Set Up Ambaric. Start Ambari
Log onto your master node (in this case oldkHDPm) as root. This node will serve as the main Installation host. Download the Ambari repository. Since we use CentOS 6 as our platform, access the repository as follows:
Following, copy the files to your repos.d:
cp ambari.repo /etc/yum.repos.d
More information can be found here in the Hortonworks documentation.
You can confirm that the repository is configured, by running yum repolist. You then obtain a list of repo id’s and repo names as marked in blue below. The command may vary depending on the platform (see here for more information).
Now, we can install the Ambari bits by running yum install ambari-server.
Now that the Ambari server is installed, let us set it up. Run ambari-server setup.
Here, we do not customise the user account for the ambari-server daemon since we have already changed the root password. Likewise, we accept the default settings. More Information can be found here in the Hortonworks documentation.
The Ambari server is set up and installed – ready to be started:
To have a look at the Ambari server processes, type in:
ps –ef | grep ambari
In case, more than one process is running ambari, kill the other process as follows:
Now we are ready to install the Hadoop distribution, i.e. HDP 2.1, using Ambari. Likewise, we will go along the Hortonworks documentation (here).
Log into your DNS server and open an internet browser pointint to
In this case: http://oldkHDPm.oldkHDP.oliviak.com:8080
Name your cluster (see Hortonworks documentation), e.g. oldkHDPcluster:
Select your desired stack (see Hortonworks docs). We choose the latest for the time being, i.e. HDP 2.1:
The next window specifies the install options. Before we go into it, we take a little de-tour, i.e. how to copy the SSH private key onto the DNS server.
For that purpose, we install WinSCP that enables the secure file Transfer between a local and a remote computer. Once installed, log in using the credentials to the master node (i.e. oldkHDP.cloudapp.net, port 22):
Use the WinSCP client to download the private SSH key (i.e. id_rsa) of the master node into the DNS server:
Once downloaded into the “local” machine, i.e. our DNS server, we can browse for it in the “Install Options” window:
Additionally, type in all the target hosts of your Hadoop cluster. In this case, it includes the master node and the three worker nodes:
When registering and confirming, you will be prompted with another window containing the host name pattern expressions:
Success – the hosts are confirmed. Have a look at the Hortonworks documentation for more information.
You may or may not receive some warnings as shown in the yellowish area:
It turns out that the ntpd services are not running but are required to be. You could run the HostCleanup Python script on each host…
…or manually get the ntpd services to run, by running
chkconfig ntpd on
on each host, i.e. the master and all three worker nodes:
To check the status of the ntpd services, run
service ntpd status
Back in the browser on the DNS server, rerun the checks:
Next, you can choose the services you wish to install on your Hadoop Cluster (see HDP documentation).
Next, select the hosts on which certain master components should run (see HDP doc). In this case, I choose to assign the master components of the Hive Server and the Oozie Server to the master node.
With the Ambari wizard, slave components (i.e. DataNodes, NodeManagers and RegionServers) can be appropriately assigned to certain hosts in the next window (see HDP doc).
Now you can manage the configuration settings for the Hadoop components along the tabs:
For instance, under HDFS we change the directories from
to the following:
For the remaining tabs marked with warnings, credentials are required, such as Nagios,
More Information on customising the services related to your Hadoop cluster can be found here.
Finally, before deploying the Hadoop cluster you obtain the usual summary of configuration settings:
It contains the following information:
NameNode : oldkHDPm.oldkHDP.oliviak.comSecondaryNameNode : oldkHDPw1.oldkHDP.oliviak.comDataNodes : 3 Hosts
NodeManager : 3 hosts ResourceManager : oldkHDPw1.oldkHDP.oliviak.com History Server : oldkHDPw1.oldkHDP.oliviak.com App Timeline Server : oldkHDPw1.oldkHDP.oliviak.com
Clients : 1 host
Server : oldkHDPm.oldkHDP.oliviak.com Administrator : nagiosadmin / (<your-email-address>@blabla.com)
Server : oldkHDPm.oldkHDP.oliviak.com
Hive Metastore : oldkHDPm.oldkHDP.oliviak.com Database : MySQL (New Database)
Master : oldkHDPm.oldkHDP.oliviak.com RegionServers : 3 hosts
Server : oldkHDPm.oldkHDP.oliviak.com Database : Derby (New Derby Database)
Servers : 3 hosts
Server : oldkHDPw1.oldkHDP.oliviak.com
Nimbus : oldkHDPm.oldkHDP.oliviak.com Storm REST API Server : oldkHDPm.oldkHDP.oliviak.com Storm UI Server : oldkHDPm.oldkHDP.oliviak.com DRPC Server : oldkHDPm.oldkHDP.oliviak.com Supervisor : 3 Hosts
And away you deploy:
Finally, you obtain a summary of your Hadoop installation efforts:
You have the Ambari GUI nicely displayed in front of you:
While hovering over the tiles, you obtain more information, such as on the network usage:
or the cluster load: