In this blog series (Hadoop on Linux on Azure), we set up a Hadoop cluster on Azure using virtual machines running Linux. More specifically, we use the HDP 2.1 on Linux distribution by Hortonworks that also provides the HDP distributions for the Windows platform. Furthermore, we install Hadoop with Ambari, an Apache project that provides an intuitive UI for provisioning, managing and monitoring a Hadoop cluster.
1 Introduction2 Step-by-Step: Build the Infrastructure3 Step-by-Step: Install a Hadoop Distribution
In this article, we set up the infrastructure and configure the virtual machines to enable the installation of a Hadoop cluster via an Ambari Server.
We heavily base our step-by-step guide on Benjamin’s great article How to install Hadoop on Windows Azure Linux virtual machines and Hortonworks’ documentation Hortonworks Data Platform – Automated Install with Ambari.
The infrastructure is set up as follows:
What you will learn in this post (amongst others):
The article is structured as follows:
First, we will create a virtual Network in which the nodes for the Hadoop cluster will be created. In the Azure management portal, you simply click on New and custom create a virtual Network:
Choose a name for your virtual network of your liking. The datacenter location best suited for your Hadoop cluster is one nearby yours. Here you find an overview of all available Azure regions.
Later on, we will create a DNS server for distributing domain names for the nodes as well as acting as a desktop environment:
Let’s add some subnets: the first one for the DNS server,
In the Azure documentation you can find more information on configuring a cloud-only virtual network in the management portal.
Each virtual machine requires a storage account. Either you can use an automatically generated storage account when creating a VM, but this usually ends up being some ugly looking name that noone can remember. Hence, take the proactive route and create a storage account beforehand:
We choose the locally redundant replication option to minimise costs.
Note that the location chosen here must be the same as the one from the virtual network created beforehand.
Some more Information on how to create a storage account can be found here in the Azure documentation.
In this paragraph, we will create the DNS Server for our Hadoop cluster. For simplicity, we use a VM based on Windows Server, but a Linux-based VM can obviously be also used for the purpose of a DNS Server.
We first will create an ordinary Windows Server VM, and then will assign the DNS role to it in the second step.
Let’s get started with creating our first VM in the virtual network. The short summary is as follows:
In longer version: click on New in the bottom left corner of the Azure management portal:
You choose your Image, i.e. Windows Server 2012 R2 Datacenter:
Now you specify the virtual machine’s name, e.g. oldkHDPdns, and its credentials:
We create a new cloud service (named <cloud-service> i.e. oldkHDP), and use the previously created virtual network to specify the first subnet. Likewise, we use the previously created storage account:
Once created, you obtain a nice overview of all important info on our DNS VM, such as the associated disk, its DNS name, internal IP address and much more:
So far we have created a general VM. To make it a DNS server, we will have to remotely connect to the virtual machine and add the DNS role. With Windows Server 2012 R2, this is easily done from the Server Manager:
This will take you through a nice and easy wizard to add the necessary roles and features for it to be a DNS server:
Voila, the VM can now be called a DNS Server!
Let’s now define zones in the DNS Server via the DNS Manager. You can reach it as usual from the ubiquitous Server Manager:
We add a new zone for the virtual network oldkHDP, in typical Windows style – a wizard! In short, you configure the following:
As a result, you will find the newly created forward lookup zone with two files contained:
Same will be done for creating the corresponding reverse lookup zone to translate IP addresses into DNS names:
You may want to turn off enhanced security for the Internet Explorer to simplify things later on:
Here we want to create a custom Linux image that we will then use for creating the master node and the three worker nodes comprising our Hadoop cluster. Why the hassle? Well, alternatively we could use a normal Linux image but then will have to repeat certain configurations for each node. Now instead we are making the configurations once and create all nodes based on that.
More elaborate, let’s create a new virtual machine from the Azure management portal and choose the OpenLogic Image. Alternatively, one can also use an Image based on Ubuntu or Suse.
You can provide any name of your liking. Note that this VM will later be used as a custom image for the nodes of the Hadoop cluster, so a name with image or template might be suitable. For authentication we will provide a Password.
We use the cloud service we have created with the DNS server beforehand. Similarly, select the storage account created in 2. Create Storage Account.
And now we have created a Linux VM. More Information on how to create a virtual machine running Linux on Azure can be found here.
To remotely connect to our Linux VM, I recommend using PuTTY – a free SSH and telnet client for Windows. Type in the SSH Details marked above into the host name to connect with the Linux VM. To make life easier later, save the session as well.
And then log in with the credentials that you have provided when creating the virtual machine:
For more Information on logging on to a Linux VM, have a look at the Azure documentation.
The documentation provided by Hortonworks provides a very nice step-by-step guide. Disable SELinux to setup the Ambari server later on, see here in the Hortonworks documentation.
The hosts file on each host within our to-be-created cluster needs to include the IP address and FQDN of each host, as written in the Hortonworks documentation. As opposed to doing it later on every single host, do these changes to the hosts file now in the Linux template that serves as the foundation for each node later on. In the Hortonworks docu
Our cluster will consist of 1 DNS server, 1 master node and 3 worder nodes. In the hosts file, you specify the IP address, the FQDN and the shortname, corresponding to the primary zone you have created in the DNS (2.3.2):
Since we have CentOS installed, disable PackageKit as follows:
In the Hortonworks documentation, refer to here.
In Azure we can capture the state of this virtual machine into an image to deploy several virtual machine instances, i.e. master and worker nodes, later on. First, undo provisioning customisation by running
and shut down the VM running
shutdown –h now
In the Azure management portal, click on Capture in the bottom.
More information on capturing a Linux virtual machine can be found here in the Azure documentation.
Now we can create the nodes for our Hadoop cluster using the template we have created in 2.4 Capture Linux Image.
Start with creating a master node in the Azure management portal. In short:
As described beforehand when connecting to our Linux image, we can check the internal IP address and SSH details on the dashboard of the newly created master node in the Azure Management portal, and connect via PuTTY:
Once logged in, we can check the internal IP address one more time running sudo ifconfig.
Back in the Azure management portal, attach an empty disk to the master node of size 100 GB:
Shortly after, we see the empty disk listed on the dashboard of the master node:
The same applies to the three worker nodes, with the difference that they will be in Subnet-3. We can do this via the Azure management portal as done before in 5.1 Create master node, but since we would repeat the whole process three times, this literally calls out for PowerShell!
It essentially creates three VMs, each with following Information:
In the PowerShell script, an empty disk is attached to each VM, followed by the command updating the VM itself.
Now that the master node and the three worker nodes have been created, the DNS server needs to be configured. In other words, the nodes have to be inserted manually as hosts in the zone we have created in 3. 2 Add DNS Role. Add a new host for each node with the same information provided in 4.3 Edit Hosts File.
So adding the master node as a host looks as follows:
Eventually, you have inserted four additional hosts:
Following, edit the hosts file to be found in C:\Windows\System32\drivers\etc by inserting the new Information as follows:
Checking if the Connections have been established correctly, you can ping all nodes from the DNS Server…
…and from the master node:
Here, we configure each node regarding aspects to either enable passwordless SSH or the installation of the Ambari Server later on.
Why change the root password? Later on we want to install the Ambari server using the root account so that no password will be required. For that, change the root password in each host using the commend sudo passwd root. Start with the master node:
As the Hortonworks documentation specifies, the desired network configuration for each host needs to be set, i.e. specifying the FQDN for each host.
The setup of the Ambari Server requires certain ports to be open and available. Easiest option is to disable iptables temporarily:
See here in the Hortonworks documentation.
Now repeat these three steps (1. change root password, 2. edit the network configuration file, 3. configure iptables) for each node, i.e. the three worker nodes. Connect to each worker node given the SSH details on the Azure management portal.
Worker node 1:
Change root password for worker node 1:
Edit network configuration file in worker node 2:
Configure iptables in worker node 1:
Worker node 2:
Worker node 3:
Have a look at the Hortonworks documentation in here.
Start off with generating public and private SSH keys on the master node (i.e. Ambari Server host), by running
ssh-keygen –t rsa –P “”
Use default values.
Copy the SSH public key to the root account on all target hosts, i.e. worker nodes, by running
ssh-copy-id –i ~/.ssh/id_rsa.pub root@oldkHDPw1
Here, the root’s password is required, which is the reason for changing it beforehand. Check if you can connect to specified host by running ssh root@oldkHDPw1.
Once connected, set permissions on the .ssh directory and authorized_keys:
Repeat for all worker nodes and the master node itself:
Now, if you connect to the other nodes from the master node via SSH, you shall not be asked for a Password. Check it!
Each node contained in our to-be-Hadoop-cluster as an empty data disk attached to it, but each is still offline. Hence, some initialisation is required, i.e. creating a new partition, a file system and a directory, for the data disk to be ready for use. Extensive information on how to attach a data disk to a Linux VM can be found here in the Azure documentation. We start with the master node:
List all partitions in the master node running fdisk –l. We obtain Information on the empty attached disk marked in blue:
Running the command grep SCSI /var/log/messages helps you find the identifier of the last data disk added (also marked in blue):
Now, we will create a new device on the empty attached disk, by first running fdisk /dev/sdc, and then typing in the following commands:
and accept default values:
Now, create the file system on the new partition, in this case an ext2/3 filesystem running mkfs.ext3 /dev/sdc1.
Following that, we need a directory to mount the new file system. Hence, run mkdir /mnt/datadrive to create the directory /mnt/datadrive. To mount the drive, run
mount /dev/sdc1 /mnt/datadrive
such that the data disk is ready to use as /mnt/datadrive.
Add the new drive to /etc/fstab:
/dev/sdc1 /mnt/datadrive ext3 defaults 1 2
You repeat the same for all three worker nodes:
Now, the infrastructure is all set up and ready for the Hadoop installation!
Big thanks also go to Benjamin Guinebertière and Hans Laubisch for their support.
Thanks a lot.i have to studied lot of hadoop information.so,this information is excellent.
http://www.hadooptrainingchennai.co.in">Hadoop Training in Chennai
awesome information. Thank you
You are the best. Just took long to reach you.
Great setup instructions! I just now got the infrastructure setup using two worker nodes instead of three to save a little money with azure. Took me about 6 hours to carefully walk through every step since its been 12 years since I've last touched linux.
I'm starting on the hadoop installation now. I think going this route will save me a lot of money than using HDInsight which seems outrageously priced for a micro sized startup company! I'm on a shoe string budget here. :(
HDInsight works great - but it has its own issues (you have to access everything through azure power shell and not so linux friendly) - but works great for a high-end configuration. Going through self-provisioned VMs is the best bet for low cost operations.
I completed the infrastructure part and was ready to install HDFS. I decided to shutdown the worker and master nodes to reduce usage (DNS VM was left on). The next day, new public IP is assigned and nothing works....ping, ssh, sftp. DNS VM doesn't even
recognize itself via ping localhost.
There must be a better way to persist the DNS settings.
I'm thinking Azure is not ready for prime time.
Thanks great post. I have an observation and help request in Section 3.1 of this document
In your screenshot you see : Virtual Network Subnet and you assign Subnet 1.
When i follow the exact instructions, I do not see "Virtual Network Subnet" , it does not give me this option to assign subnet 1.
Can you help ?
Please ignore or delete my previous comment, only when u select the virtual network you created it drops down the virtual network subnet for you to pick...