In this blog series, we set up a Hadoop cluster on Azure using virtual machines running Linux. More specifically, we use the HDP 2.1 on Linux distribution by Hortonworks that also provides the HDP distributions for the Windows platform. Furthermore, we install Hadoop with Ambari, an Apache project that provides an intuitive UI for provisioning, managing and monitoring a Hadoop cluster.
1 Introduction2 Step-by-Step: Build the Infrastructure3 Step-by-Step: Install a Hadoop Distribution
While HDInsight is the Platform as a Service (PaaS) option for building and running a Hadoop cluster in Microsoft Azure, this article specifies its IaaS (Infrastructure as a Service) counterpart. With the IaaS option you have more flexibility in the choice of Hadoop distributions, Hadoop components and platform (e.g. Linux), amongst others.
This blog series elaborates the install of Hortonworks’ Hadoop distribution for Linux, HDP 2.1 for Linux. Alternatives for commercial Hadoop distributions on Linux include Cloudera (CDH) and MapR. Moreover, we will use CentOS as the Linux platform. In the end, we will have a four-node Hadoop cluster: one master node (also called NameNode) and three worker nodes (also called DataNode):
We heavily base our step-by-step guide on Benjamin’s great article How to install Hadoop on Windows Azure Linux virtual machines and Hortonworks’ documentation Hortonworks Data Platform – Automated Install with Ambari.
Before installing a Hadoop distribution though, the required environment needs to be prepared. Thus, the next article walks through the infrastructure setup for such a cluster on Microsoft Azure.
One of the nicer blogs that I have read and followed closely. I had to do exactly what the author did - install HDP on a cluster of Linux machines (VMs) on Azure. Thanks to Olivia, it turned out to be reasonable. The number of steps to get the cluster
up and running is more than I expected, so writing automated scripts will be useful.