In the blog series Mahout for Dummies options on how to use Mahout in HDInsight are being explored and elaborated.
1 What is Mahout?2 Step-by-Step: Mahout with HDInsight Interactive Style3 Step-by-Step: Mahout with HDInsight PowerShell Style
In this episode of the series Mahout for Dummies, we deal with Mahout on HDInsight in a PowerShell manner. Ultimately, we go through the Random Forest Scenario detailed in previous post.
Here, we upload all data to the Azure blob storage necessary to build a random forest model from and then to test the model on. More specifically, training and test data will be uploaded. Note that information on the storage account (e.g. container name and storage context) must already be known.
Since Mahout is not installed on any HDInsight cluster by default (and hence not supported by Microsoft), the Mahout jar files also shall have to be uploaded to the blob storage.
We just create a simple HDInsight cluster, just like in the Azure PowerShell Series: Simple HDInsight. Alternatively, you could create one with additional functionality; see Azure PowerShell Series: Custom Create HDInsight.
In the Azure Explorer, you observe some libraries being uploaded, such as mapred, hive, etc.
Just like in the previous post Step-by-Step: Mahout with HDInsight Interactive Style, both the training and test data need to be located in the directory user/hdp/
The typical command for invoking Mahout from the Hadoop Command Line via RDP connection looks as follows:
Thus, it is an ordinary command running the program contained in specified JAR file. org.apache.mahout.classifier.df.tool.Describe is the class name being invoked, followed by mandatory and optional arguments. Translated into PowerShell:
In the case above, this translates into the following PowerShell command:
or a little more elaborate:
Note that the PowerShell commandlets have so far only defined the job but not triggered it yet. The Hadoop Job is started by the following command:
To automatically wait for the HDInsight job to process, you can insert the following
It gives an hour (i.e. 3600 seconds) for the HDInsight job to process . You can print out any output error as follows:
In the previous section, we elaborated on how to construct a Mahout Job as a PowerShell command. Here, we go through an example using the Random Forest, just like in the previous post Step-by-Step: Mahout with HDInsight Interactive Style – Scenario Random Forest.
As a reminder, the command we used to build a forest in Interactive Style is the following:
Thus, the “translated” PowerShell command is
The output in PowerShell should look like this:
The “converted” PowerShell command of the classifying command proposed in Interactive Style is as follows:
Note that the Output shown above has the same format and very similar results to the previous post when done in interactive style.
Cleaning up involves the likes of removing and HDInsight cluster but also removing temporary directories. While the PowerShell command for deleting a single file is pretty straight forward, i.e.
deleting a folder structure comprises a loop in which every single file with specified file path prefix is removed.
As we saw in the first part of our Mahout for Dummies series, there are many algorithms included in the Mahout library other than the random forest.
In the blog by the Big Data Support team at Microsoft, there is a good post demonstrating the use of the RecommenderJob class on an HDInsight Cluster using PowerShell that you can read here. The source code of the RecommenderJob class can be looked up here on GitHub.
In this scenario, we are given two data files: one containing user ID’s and the other comprising the degrees of preference of users towards given items:
In ItemID.txt, the first column indicates the user ID, the second the item IDs and the final one denotes the degree of preference. Thus, ItemID.txt can be expressed in a more intuitive Format of a matrix, where the rows indicate the user and the columns denote the item IDs. The values inside the matrix themselves display the degree of preference, as given in the third column in ItemID.txt.
Here is the comprised PowerShell script for running RecommenderJob as in the Big Data Support blog.
The output files can be seen in the Azure Blob Storage Explorer as ususal:
The output file itself gives information on how likely which items could be of interest to which users, i.e. <user_id> [<item_id>: <degree-of-preference/interest>,…].
In such a way, we can insert the recommendations with their scores in the matrix from above:
In this blog post, we went through two scenarios applying Mahout on HDInsight in PowerShell style: random forest and recommender. These scenarios are nicely wrapped around the usual suspects: uploading data, creating the HDInsight cluster and cleaning up afterwards.
Many thanks go to Alexei Khalyako and Bill Carroll for their support on Mahouting on HDInsight!