Olivia's Blog

All on Big Data and Windows Azure

Hadoop on Linux on Azure – Step-by-Step: Build the Infrastructure (2)

Hadoop on Linux on Azure – Step-by-Step: Build the Infrastructure (2)

  • Comments 12
  • Likes

In this blog series (Hadoop on Linux on Azure), we set up a Hadoop cluster on Azure using virtual machines running Linux. More specifically, we use the HDP 2.1 on Linux distribution by Hortonworks that also provides the HDP distributions for the Windows platform. Furthermore, we install Hadoop with Ambari, an Apache project that provides an intuitive UI for provisioning, managing and monitoring a Hadoop cluster.

Contents

1 Introduction
2 Step-by-Step: Build the Infrastructure
3 Step-by-Step: Install a Hadoop Distribution

Step-by-Step: Build the Infrastructure

In this article, we set up the infrastructure and configure the virtual machines to enable the installation of a Hadoop cluster via an Ambari Server.

We heavily base our step-by-step guide on Benjamin’s great article How to install Hadoop on Windows Azure Linux virtual machines and Hortonworks’ documentation Hortonworks Data Platform – Automated Install with Ambari.

The infrastructure is set up as follows:

0 architecture 2

What you will learn in this post (amongst others):

  • Create a DNS Server in Azure IaaS
  • Linux on Azure:
    • Create a Linux VM
    • Log on to a Linux VM
    • Capture a Linux VM
    • Authenticate using SSH Keys
    • Set up passwordless SSH
    • Attach Disks
  • Create several virtual machines using PowerShell

The article is structured as follows:

  1. Create Virtual Network
  2. Create Storage Account
  3. Create DNS Server
  4. Create Linux Image
  5. Create Master and Worker Nodes
  6. Configure DNS Server
  7. Configure all nodes (root password, network, iptables, passwordless SSH, format empty disk)

 

1. Create Virtual Network

First, we will create a virtual Network in which the nodes for the Hadoop cluster will be created. In the Azure management portal, you simply click on New and custom create a virtual Network:

1 vn (1)

Choose a name for your virtual network of your liking. The datacenter location best suited for your Hadoop cluster is one nearby yours. Here you find an overview of all available Azure regions.

1 vn (2)

Later on, we will create a DNS server for distributing domain names for the nodes as well as acting as a desktop environment:

DNS Server IP Address
<DNS_Server_name> 10.124.1.4

1 vn (3)

Let’s add some subnets: the first one for the DNS server,

Subnet Starting IP CIDR (Address Count)
Subnet-1 (or DNS-Subnet) 10.124.1.0 /24
Subnet-2 (or Master-Subnet) 10.124.2.0 /24
Subnet-3 (or Worker-Subnet) 10.124.3.0 /24

1 vn (4)

In summary:

Name <virtual-network-name> e.g. oldkHDP
Location e.g. North Europe
DNS Server <dns-server> e.g. oldkHDPdns
IP address for DNS server 10.124.1.4
Subnet-1 10.124.1.4 /24
Subnet-2 10.124.2.4 /24
Subnet-3 10.124.3.4 /24

1 vn (5)

In the Azure documentation you can find more information on configuring a cloud-only virtual network in the management portal.

2. Create Storage Account

Each virtual machine requires a storage account. Either you can use an automatically generated storage account when creating a VM, but this usually ends up being some ugly looking name that noone can remember. Hence, take the proactive route and create a storage account beforehand:

2 storage (2)

We choose the locally redundant replication option to minimise costs.

In short:

URL <storage-account>.core.windows.net e.g. oldkhdpstrg
Location North Europe
Replication Locally Redundant

Note that the location chosen here must be the same as the one from the virtual network created beforehand.

Some more Information on how to create a storage account can be found here in the Azure documentation.

3. Create DNS Server

In this paragraph, we will create the DNS Server for our Hadoop cluster. For simplicity, we use a VM based on Windows Server, but a Linux-based VM can obviously be also used for the purpose of a DNS Server.

We first will create an ordinary Windows Server VM, and then will assign the DNS role to it in the second step.

  1. Create Windows Server VM
  2. Add DNS Role
  3. Create New DNS Zone

 

3.1 Create Windows Server VM

Let’s get started with creating our first VM in the virtual network. The short summary is as follows:

Image Windows Server 2012 R2 Datacenter
Virtual Machine Name <dns-server> e.g. oldkHDPdns
Size A1 (Small)
User name <dns-user> e.g. dnsadmin
Cloud Service DNS name [Create new] <cloud-service> e.g. oldkHDP
Region/Affinity Group/Virtual Network <virtual-network> (i.e. oldkHDP from 1. Create Virtual Network)
Virtual Network Subnets Subnet-1
Storage Account <storage-account> (i.e. oldkhdpstrg from 2. Create Storage Account)
Availability Set None

In longer version: click on New in the bottom left corner of the Azure management portal:

3 dns (0)

You choose your Image, i.e. Windows Server 2012 R2 Datacenter:

3 dns (0a)

Now you specify the virtual machine’s name, e.g. oldkHDPdns, and its credentials:

3 dns (1)

We create a new cloud service (named <cloud-service> i.e. oldkHDP), and use the previously created virtual network to specify the first subnet. Likewise, we use the previously created storage account:

3 dns (2)

Once created, you obtain a nice overview of all important info on our DNS VM, such as the associated disk, its DNS name, internal IP address and much more:

3 dns (4)

3.2 Add DNS Role

So far we have created a general VM. To make it a DNS server, we will have to remotely connect to the virtual machine and add the DNS role. With Windows Server 2012 R2, this is easily done from the Server Manager:

3 dns (5)

This will take you through a nice and easy wizard to add the necessary roles and features for it to be a DNS server:

3 dns (6)

3 dns (7)

3 dns (8)

3 dns (9)

3 dns (10)

3 dns (11)

3 dns (12)

3 dns (13)

Voila, the VM can now be called a DNS Server!

3.3 Create New DNS Zone

Let’s now define zones in the DNS Server via the DNS Manager. You can reach it as usual from the ubiquitous Server Manager:

3 dns (14)

3 dns (15)

We add a new zone for the virtual network oldkHDP, in typical Windows style – a wizard! In short, you configure the following:

Type of zone Primary
Type of lookup zone Forward lookup zone
Zone name <zone-name> e.g. oldkHDP.oliviak.com
Zone file [Create new] e.g. oldkHDP.oliviak.com.dns
Dynamic update Do not allow dynamic updates

3 dns (16)

3 dns (17)

3 dns (18)

3 dns (19)

3 dns (20)

3 dns (21)

As a result, you will find the newly created forward lookup zone with two files contained:

3 dns (22)

Same will be done for creating the corresponding reverse lookup zone to translate IP addresses into DNS names:

3 dns (23)

In short:

Type of zone Primary zone
IPv4 or IPv6 IPv4 reverse lookup zone
Reverse lookup zone name 124.10.in-addr.arpa
Zone file [Create new] 124.10.in-addr.arpa
Dynamic Update Do not allow dynamic updates

3 dns (24)

3 dns (25)

3 dns (26)

3 dns (27)

3 dns (28)

3 dns (29)

You may want to turn off enhanced security for the Internet Explorer to simplify things later on:

3 dns (30)

 

4. Create Linux Image

  1. Create Linux VM
  2. Disable SELinux
  3. Edit hosts file
  4. Disable PackageKit
  5. Capture Image

Here we want to create a custom Linux image that we will then use for creating the master node and the three worker nodes comprising our Hadoop cluster. Why the hassle? Well, alternatively we could use a normal Linux image but then will have to repeat certain configurations for each node. Now instead we are making the configurations once and create all nodes based on that.

4.1 Create Linux VM

In short:

Image OpenLogic
Virtual Machine Name <Linux-Image> e.g. oldkHDPimage
Size A1 (small)
New user name <user-name> e.g. olivia
Authentication Password
Cloud Service <cloud-service> (i.e. the one created in 3.1 Create WS VM: oldkHDP)
Virtual Network Subnet Subnet-3
Storage Account <storage-account> (i.e. the one created in 2. Create Storage Account: oldkhdpstrg)
Availability Set None

More elaborate, let’s create a new virtual machine from the Azure management portal and choose the OpenLogic Image. Alternatively, one can also use an Image based on Ubuntu or Suse.

4 linuximg (00)

You can provide any name of your liking. Note that this VM will later be used as a custom image for the nodes of the Hadoop cluster, so a name with image or template might be suitable. For authentication we will provide a Password.

4 linuximg (0)

We use the cloud service we have created with the DNS server beforehand. Similarly, select the storage account created in 2. Create Storage Account.

4 linuximg (1)

And now we have created a Linux VM. More Information on how to create a virtual machine running Linux on Azure can be found here.

4 linuximg (2)

To remotely connect to our Linux VM, I recommend using PuTTY – a free SSH and telnet client for Windows. Type in the SSH Details marked above into the host name to connect with the Linux VM. To make life easier later, save the session as well.

4 linuximg (3)

4 linuximg (4)

And then log in with the credentials that you have provided when creating the virtual machine:

4 linuximg (5)

For more Information on logging on to a Linux VM, have a look at the Azure documentation.

4.2 Disable SELinux

The documentation provided by Hortonworks provides a very nice step-by-step guide. Disable SELinux to setup the Ambari server later on, see here in the Hortonworks documentation.

4 linuximg (6)

4 linuximg (7)

4.3. Edit Hosts file

The hosts file on each host within our to-be-created cluster needs to include the IP address and FQDN of each host, as written in the Hortonworks documentation. As opposed to doing it later on every single host, do these changes to the hosts file now in the Linux template that serves as the foundation for each node later on. In the Hortonworks docu

4 linuximg (9)

Our cluster will consist of 1 DNS server, 1 master node and 3 worder nodes. In the hosts file, you specify the IP address, the FQDN and the shortname, corresponding to the primary zone you have created in the DNS (2.3.2):

IP Address FQDN (fully qualified domain name) Shortname
10.124.1.4 oldkHDPdns.oldkHDP.oliviak.com oldkHDPdns
10.124.2.4 oldkHDPm.oldkHDP.oliviak.com oldkHDPm
10.124.3.4 oldkHDPw1.oldkHDP.oliviak.com oldkHDPw1
10.124.3.5 oldkHDPw2.oldkHDP.oliviak.com oldkHDPw2
10.124.3.6 oldkHDPw3.oldkHDP.oliviak.com oldkHDPw3

4 linuximg (8)

4.4. Disable PackageKit

Since we have CentOS installed, disable PackageKit as follows:

4 linuximg (11)

4 linuximg (10)

In the Hortonworks documentation, refer to here.

4.5. Capture Image

In Azure we can capture the state of this virtual machine into an image to deploy several virtual machine instances, i.e. master and worker nodes, later on. First, undo provisioning customisation by running

waagent –deprovision

and shut down the VM running

shutdown –h now

4 linuximg (14)

In the Azure management portal, click on Capture in the bottom.

4 linuximg (15)

4 linuximg (16)

4 linuximg (17)

More information on capturing a Linux virtual machine can be found here in the Azure documentation.

5. Create Master and Worker Nodes

  1. Create Master Node and Attach Empty Disk
  2. Create Worker Nodes

Now we can create the nodes for our Hadoop cluster using the template we have created in 2.4 Capture Linux Image.

5.1. Create Master Node and Attach Empty Disk

Start with creating a master node in the Azure management portal. In short:

Image Image created in 4. Create Linux Image e.g. oldkHDPimage
Name <master-node>, e.g. oldkHDPm
Size A2 (medium)
User name e.g. olivia
Authentication Provide a password
Cloud Service Cloud Service created in 3.1 Create Windows Server VM, e.g. oldkHDP
Virtual Network Subnet Subnet-2

5 master (0)

5 master (1)

5 master (2)

5 master (3)

As described beforehand when connecting to our Linux image, we can check the internal IP address and SSH details on the dashboard of the newly created master node in the Azure Management portal, and connect via PuTTY:

5 master (4)

5 master (5)

Once logged in, we can check the internal IP address one more time running sudo ifconfig.

5 master (6)

Back in the Azure management portal, attach an empty disk to the master node of size 100 GB:

5 master (7)

5 master (8)

Shortly after, we see the empty disk listed on the dashboard of the master node:

5 master (9)

 

5.2. Create Worker Nodes

The same applies to the three worker nodes, with the difference that they will be in Subnet-3. We can do this via the Azure management portal as done before in 5.1 Create master node, but since we would repeat the whole process three times, this literally calls out for PowerShell!

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029

# Create three VMs i.e. worker nodes
for($i=1; $i -le 3; $i++
)
{
   
Write-Host "creating oldkHDP${i}OS"

    # SSH port is unique to each VM
    $sshPort = 52200 + $i

    #create a new VM Config
    # Specify: VM image, size, name, availability set, label and disk label. Use Subnet-3
    $newVM =
 `
       
New-AzureVMConfig -ImageName oldkHDPimage -InstanceSize Medium -Name "oldkHDPw$i"
 `
           
-AvailabilitySetName "oldkHDPwAS" -DiskLabel "oldkHDPw${i}os"
 `
           
-HostCaching ReadWrite -Label "oldkHDPw$i" |
        Add-AzureProvisioningConfig -Linux -LinuxUser "olivia" -Password $adminPassword |
        Add-AzureEndpoint -LocalPort 22 -Name "SSH$i" -Protocol tcp -PublicPort $sshPort |
        Set-AzureSubnet 'Subnet-3'
   
   
#create the VM
    Write-Host "creating VM oldkHDP${i}"
    New-AzureVM -ServiceName 'oldkHDP' -VMs $newVM -VNetName 'oldkHDP'
   
   
#attach empty disk
    write-host "attaching empty disk"
    Get-AzureVM -ServiceName 'oldkHDP' -Name "oldkHDPw$i"
 `
       
| Add-AzureDataDisk -CreateNew -DiskLabel "oldkHDPw${i}data1" -DiskSizeInGB 100 -LUN 0
 `
       
| Update-AzureVM

}

It essentially creates three VMs, each with following Information:

Image oldkHDPimage
Size Medium (A2)
Name oldkHDPw{i} (i=1-3)
Availability set oldkHDPwAS
User olivia
SSH 5220{i} and 22
Cloud Service oldkHDP
Virtual Network oldkHDP

In the PowerShell script, an empty disk is attached to each VM, followed by the command updating the VM itself.

6. Configure DNS Server 

Now that the master node and the three worker nodes have been created, the DNS server needs to be configured. In other words, the nodes have to be inserted manually as hosts in the zone we have created in 3. 2 Add DNS Role. Add a new host for each node with the same information provided in 4.3 Edit Hosts File.

6 dns (1)

So adding the master node as a host looks as follows:

6 dns (3)

6 dns (4)

Eventually, you have inserted four additional hosts:

6 dns (5)

Following, edit the hosts file to be found in C:\Windows\System32\drivers\etc by inserting the new Information as follows:

6 dns (6)

6 dns (7)

Checking if the Connections have been established correctly, you can ping all nodes from the DNS Server…

6 dns (8)

…and from the master node:

6 dns (9)

 

7. Configure all nodes

  1. Hadoop-specific and little configurations
    1. Change root password
    2. Edit network configuration file
    3. Configure iptables
    4. Repeat for all nodes
  2. Set Up Password-less SSH
  3. Format empty attached disks
    1. Master node
    2. Worker nodes

 

7.1 Hadoop-specific and little configurations

Here, we configure each node regarding aspects to either enable passwordless SSH or the installation of the Ambari Server later on.

7.1.1 Change root password

Why change the root password? Later on we want to install the Ambari server using the root account so that no password will be required. For that, change the root password in each host using the commend sudo passwd root. Start with the master node:

7 1 config_m (1)

7.1.2 Edit network configuration file

As the Hortonworks documentation specifies, the desired network configuration for each host needs to be set, i.e. specifying the FQDN for each host.

7 1 config_m (3)

7 1 config_m (2)

7.1.3 Configure iptables

The setup of the Ambari Server requires certain ports to be open and available. Easiest option is to disable iptables temporarily:

7 1 config_m (4)

See here in the Hortonworks documentation.

7.1.4 Repeat for all nodes

Now repeat these three steps (1. change root password, 2. edit the network configuration file, 3. configure iptables) for each node, i.e. the three worker nodes. Connect to each worker node given the SSH details on the Azure management portal.

Worker node 1:

7 2 config_w (1)

Change root password for worker node 1:

7 2 config_w (2)

Edit network configuration file in worker node 2:

7 2 config_w (4)

7 2 config_w (3)

Configure iptables in worker node 1:

7 2 config_w (5)

Worker node 2:

7 2 config_w (6)

Worker node 3:

7 2 config_w (7)

7.2. Set Up Password-less SSH

Have a look at the Hortonworks documentation in here.

Start off with generating public and private SSH keys on the master node (i.e. Ambari Server host), by running

ssh-keygen –t rsa –P “”

Use default values.

7 3 ssh (1)

Copy the SSH public key to the root account on all target hosts, i.e. worker nodes, by running

ssh-copy-id –i ~/.ssh/id_rsa.pub root@oldkHDPw1

Here, the root’s password is required, which is the reason for changing it beforehand. Check if you can connect to specified host by running ssh root@oldkHDPw1.

Once connected, set permissions on the .ssh directory and authorized_keys:

7 3 ssh (3)

Repeat for all worker nodes and the master node itself:

7 3 ssh (4)

Now, if you connect to the other nodes from the master node via SSH, you shall not be asked for a Password. Check it!

7 3 ssh (5)

7 3 ssh (6)

 

7.3. Format empty attached disks (for all nodes)

  1. Master node
  2. Worker nodes

Each node contained in our to-be-Hadoop-cluster as an empty data disk attached to it, but each is still offline. Hence, some initialisation is required, i.e. creating a new partition, a file system and a directory, for the data disk to be ready for use. Extensive information on how to attach a data disk to a Linux VM can be found here in the Azure documentation. We start with the master node:

7.3.1 Master node

List all partitions in the master node running fdisk –l. We obtain Information on the empty attached disk marked in blue:

7 4 format_m (1)

Running the command grep SCSI /var/log/messages helps you find the identifier of the last data disk added (also marked in blue):

7 4 format_m (2)

Now, we will create a new device on the empty attached disk, by first running fdisk /dev/sdc, and then typing in the following commands:

Command What is happening?
p See details about the disk to be partitioned.
n Create a new partition.
p Make the partition the primary partition
1 Make it the first partition.

and accept default values:

7 4 format_m (3)

Command What is happening?
p See details about the disk that is being partitioned.
w Write settings for the disk.

7 4 format_m (4)

Now, create the file system on the new partition, in this case an ext2/3 filesystem running mkfs.ext3 /dev/sdc1.

Following that, we need a directory to mount the new file system. Hence, run mkdir /mnt/datadrive to create the directory /mnt/datadrive. To mount the drive, run

mount /dev/sdc1 /mnt/datadrive

such that the data disk is ready to use as /mnt/datadrive.

7 4 format_m (7)

Add the new drive to /etc/fstab:

/dev/sdc1 /mnt/datadrive ext3 defaults 1 2

7 4 format_m (6)

7.3.2. Format empty attached disk: Worker Nodes

You repeat the same for all three worker nodes:

7 4 format_w (1)

7 4 format_w (2)

7 4 format_w (3)

7 4 format_w (4)

Now, the infrastructure is all set up and ready for the Hadoop installation!

Big thanks also go to Benjamin Guinebertière and Hans Laubisch for their support.

Comments
  • Thanks a lot.i have to studied lot of hadoop information.so,this information is excellent.
    http://www.hadooptrainingchennai.co.in">Hadoop Training in Chennai

  • awesome information. Thank you

  • You are the best. Just took long to reach you.

  • Great post!

  • Great Post...!!

  • Great setup instructions! I just now got the infrastructure setup using two worker nodes instead of three to save a little money with azure. Took me about 6 hours to carefully walk through every step since its been 12 years since I've last touched linux. I'm starting on the hadoop installation now. I think going this route will save me a lot of money than using HDInsight which seems outrageously priced for a micro sized startup company! I'm on a shoe string budget here. :(

    Thanks again!

  • HDInsight works great - but it has its own issues (you have to access everything through azure power shell and not so linux friendly) - but works great for a high-end configuration. Going through self-provisioned VMs is the best bet for low cost operations.

    GK (Gopalakrishna)
    http://gk.palem.in/

  • I completed the infrastructure part and was ready to install HDFS. I decided to shutdown the worker and master nodes to reduce usage (DNS VM was left on). The next day, new public IP is assigned and nothing works....ping, ssh, sftp. DNS VM doesn't even recognize itself via ping localhost.

    There must be a better way to persist the DNS settings.

    I'm thinking Azure is not ready for prime time.


  • Thanks great post. I have an observation and help request in Section 3.1 of this document

    In your screenshot you see : Virtual Network Subnet and you assign Subnet 1.
    When i follow the exact instructions, I do not see "Virtual Network Subnet" , it does not give me this option to assign subnet 1.

    Can you help ?

  • Please ignore or delete my previous comment, only when u select the virtual network you created it drops down the virtual network subnet for you to pick...

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment