In the blog series Mahout for Dummies options on how to use Mahout in HDInsight are being explored and elaborated.

Contents

1 What is Mahout?
2 Step-by-Step: Mahout with HDInsight Interactive Style
3 Step-by-Step: Mahout with HDInsight PowerShell Style

 

Step-by-Step: Mahout with HDInsight PowerShell Style

In this episode of the series Mahout for Dummies, we deal with Mahout on HDInsight in a PowerShell manner. Ultimately, we go through the Random Forest Scenario detailed in previous post.

  1. Upload Data
  2. Create HDInsight Cluster
  3. Mahout: general PowerShell command
  4. Scenario: Random Forest
    1. Build forest
    2. Classify test data
  5. Clean up
  6. Scenario: Recommender Job
  7. Wrapping up…

1. Upload Data

Here, we upload all data to the Azure blob storage necessary to build a random forest model from and then to test the model on. More specifically, training and test data will be uploaded. Note that information on the storage account (e.g. container name and storage context) must already be known.

001
002
003
004
005
006
007
008
009
010
011
012
013

## 1. File Paths
# Data stored locally

$localTrain = "C:\<TrainingDataPath>\KDDTrain+.arff"
$localTest = "C:\<TestDataPath>\KDDTest+.arff"
# Data to be stored in Azure Blob Storage
$blobTrain = "testdata/KDDTrain+.arff"
$blobTest = "testdata/KDDTest+.arff"

## 2. Upload file from local to Azure Blob Storage
Set-AzureStorageBlobContent -File $localTrain -Container $containerName
 `
   
-Blob $blobTrain -Context $storageContext
Set-AzureStorageBlobContent -File $localTest -Container $containerName
 `
   
-Blob $blobTest -Context $storageContext

3 Data 1 MS eg

Since Mahout is not installed on any HDInsight cluster by default (and hence not supported by Microsoft), the Mahout jar files also shall have to be uploaded to the blob storage.

001
002
003
004
005
006
007
008
009
010
011
012
013

## 1. File Paths
# Mahout jar files stored locally

$localMahoutJar = "C:\<PathToMahoutDistribution>\mahout-core-0.9-job.jar"
$localMahoutEx = "C:\<PathToMahoutDistribution>\mahout-examples-0.9-job.jar"
# Mahout jar files to be stored in Azure Blob Storage
$blobMahoutJar = "mahout/mahout-core-0.9-job.jar"
$blobMahoutEx = "mahout/mahout-examples-0.9-job.jar"

## 2. Upload file from local to Azure Blob Storage
Set-AzureStorageBlobContent -File $localMahoutJar -Container $containerName
 `
   
-Blob $blobMahoutJar -Context $storageContext
Set-AzureStorageBlobContent -File $localMahoutEx -Container $containerName
 `
   
-Blob $blobMahoutEx -Context $storageContext

 

3 Data 1 MS eg 3

2. Create HDInsight Cluster

We just create a simple HDInsight cluster, just like in the Azure PowerShell Series: Simple HDInsight. Alternatively, you could create one with additional functionality; see Azure PowerShell Series: Custom Create HDInsight.

001
002
003
004
005
006
007
008
009
010
011

# Input
$clusterName = "<HDInsightClusterName>"
$clusterCreds = Get-Credential
$numNodes = 4

# Simple create
New-AzureHDInsightCluster -Name $clusterName -Subscription $subID
 `
   
-Location $location -DefaultStorageAccountName $storageAccount
 `
   
-DefaultStorageAccountKey $storageKey
 `
   
-DefaultStorageContainerName $containerName -Credential $clusterCreds
 `
   
-ClusterSizeInNodes $numNodes -Version 2.1

In the Azure Explorer, you observe some libraries being uploaded, such as mapred, hive, etc.

2 HDInsight 1

Just like in the previous post Step-by-Step: Mahout with HDInsight Interactive Style, both the training and test data need to be located in the directory user/hdp/

001
002
003
004
005
006

$blobHDPtrain = "user/hdp/testdata/KDDTrain+.arff"
$blobHDPtest = "user/hdp/testdata/KDDTest+.arff"
Set-AzureStorageBlobContent -File $localTrain -Container $containerName
 `
   
-Blob $blobHDPtrain -Context $storageContext
Set-AzureStorageBlobContent -File $localTest -Container $containerName
 `
   
-Blob $blobHDPtest -Context $storageContext

 

3. Mahout: General PowerShell Command

The typical command for invoking Mahout from the Hadoop Command Line via RDP connection looks as follows:

001
002
003

hadoop jar C:\apps\dist\mahout-0.9\mahout-core-0.9-job.jar 
org.apache.mahout.classifier.df.tools.Describe 
-p wasb:///user/hdp/testdata/KDDTrain+.arff ...

Thus, it is an ordinary command running the program contained in specified JAR file. org.apache.mahout.classifier.df.tool.Describe is the class name being invoked, followed by mandatory and optional arguments. Translated into PowerShell:

001
002
003
004

$mahoutJob = New-AzureHDInsightMapReduceJobDefinition `
   
-JarFile  "<PathToMahoutJAR>/mahout-core-0.9-job.jar"
 `
   
-ClassName "<ClassName>"
 `
   
-Arguments "-p wasb:///user/hdp/testdata/KDDTrain+.arff …"

In the case above, this translates into the following PowerShell command:

001
002
003
004

$mahoutJob = New-AzureHDInsightMapReduceJobDefinition `
   
-JarFile  "wasb://$containerName@$storageAccount.blob.core.windows.net/$blobMahoutJar"
 `
   
-ClassName "org.apache.mahout.classifier.df.tools.Describe"
 `
   
-Arguments "-p wasb:///user/hdp/$blobTrain -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L"

or a little more elaborate:

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030

$mahoutJob = New-AzureHDInsightMapReduceJobDefinition `
   
-JarFile  "wasb://$containerName@$storageAccount.blob.core.windows.net/$blobMahoutJar"
 `
   
-ClassName "org.apache.mahout.classifier.df.tools.Describe"

# path to training data
$mahoutJob.Arguments.Add("-p")
$mahoutJob.Arguments.Add("wasb:///user/hdp/$blobTrain")

# path to generated descriptor file
$mahoutDescriptor.Arguments.Add("-f")
$mahoutDescriptor.Arguments.Add("wasb:///user/hdp/testdata/KDDTrain+.info")

# attributes of given training data
$mahoutDescriptor.Arguments.Add("-d")
$mahoutDescriptor.Arguments.Add("N")
$mahoutDescriptor.Arguments.Add("3")
$mahoutDescriptor.Arguments.Add("C")
$mahoutDescriptor.Arguments.Add("2")
$mahoutDescriptor.Arguments.Add("N")
$mahoutDescriptor.Arguments.Add("C")
$mahoutDescriptor.Arguments.Add("4")
$mahoutDescriptor.Arguments.Add("N")
$mahoutDescriptor.Arguments.Add("C")
$mahoutDescriptor.Arguments.Add("8")
$mahoutDescriptor.Arguments.Add("N")
$mahoutDescriptor.Arguments.Add("2")
$mahoutDescriptor.Arguments.Add("C")
$mahoutDescriptor.Arguments.Add("19")
$mahoutDescriptor.Arguments.Add("N")
$mahoutDescriptor.Arguments.Add("L")

Note that the PowerShell commandlets have so far only defined the job but not triggered it yet. The Hadoop Job is started by the following command:

001
002

$mahoutJobProcessing = Start-AzureHDInsightJob -Cluster $clusterName `
   
-JobDefinition $mahoutJob -Credential $clusterCreds

To automatically wait for the HDInsight job to process, you can insert the following

001
Wait-AzureHDInsightJob -Job $mahoutJobProcessing -WaitTimeoutInSeconds 3600

It gives an hour (i.e. 3600 seconds) for the HDInsight job to process . You can print out any output error as follows:

001
002

Get-AzureHDInsightJobOutput -Cluster $clusterName -Subscription $subID `
   
-JobId $mahoutJobProcessing.JobId -StandardError

 

4. Scenario: Random Forest

In the previous section, we elaborated on how to construct a Mahout Job as a PowerShell command. Here, we go through an example using the Random Forest, just like in the previous post Step-by-Step: Mahout with HDInsight Interactive Style – Scenario Random Forest.

4.1. Build Forest

As a reminder, the command we used to build a forest in Interactive Style is the following:

001
002
003
004
005
006

hadoop jar C:\apps\dist\mahout-0.9\mahout-examples-0.9-job.jar 
org.apache.mahout.classifier.df.mapreduce.BuildForest 
-Dmapred.max.split.size=1874231 
-d wasb:///user/hdp/testdata/KDDTrain+.arff 
-ds wasb:///user/hdp/testdata/KDDTrain+.info 
-sl 5 -p -t 100 -o nsl-forest

Thus, the “translated” PowerShell command is

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036

## build forest
$mahoutForest = New-AzureHDInsightMapReduceJobDefinition
 `
   
-JarFile "wasb://$containerName@$storageAccount.blob.core.windows.net/$blobMahoutEx"
 `
   
-ClassName "org.apache.mahout.classifier.df.mapreduce.BuildForest"

# maximum data size per node
$mahoutForest.Arguments.Add("-Dmapred.max.split.size=1874231")
# data path
$mahoutForest.Arguments.Add("-d")
$mahoutForest.Arguments.Add("wasb:///user/hdp/testdata/KDDTrain+.arff")
# dataset path
$mahoutForest.Arguments.Add("-ds")
$mahoutForest.Arguments.Add("wasb:///user/hdp/testdata/KDDTrain+.info")
# number of variables being randomly selected at each node
$mahoutForest.Arguments.Add("-sl")
$mahoutForest.Arguments.Add("5")
# flag for partial implementation
$mahoutForest.Arguments.Add("-p")
# number of trees
$mahoutForest.Arguments.Add("-t")
$mahoutForest.Arguments.Add("100")
# output path for generated forest
$mahoutForest.Arguments.Add("-o")
$mahoutForest.Arguments.Add("nsl-forest")

# start job
$mahoutForestProcessing = Start-AzureHDInsightJob -Cluster $clusterName
 `
   
-JobDefinition $mahoutForest

# wait for job
Wait-AzureHDInsightJob -Subscription $subID -Job $mahoutForestProcessing
 `
   
-WaitTimeoutInSeconds 3600

# print out error if any
Get-AzureHDInsightJobOutput -Cluster $clusterName -Subscription $subID
 `
   
-JobId $mahoutForestProcessing.JobId -StandardError

The output in PowerShell should look like this:

4 RF 1

 

4.2. Classify Test Data

The “converted” PowerShell command of the classifying command proposed in Interactive Style is as follows:

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018

$mahoutClassify = New-AzureHDInsightMapReduceJobDefinition `
   
-JarFile "wasb://$containerName@$storageAccount.blob.core.windows.net/$blobMahoutEx"
 `
   
-ClassName "org.apache.mahout.classifier.df.mapreduce.TestForest"

$mahoutClassify.Arguments.Add("-i")
$mahoutClassify.Arguments.Add("wasb:///user/hdp/testdata/KDDTest+.arff")
$mahoutClassify.Arguments.Add("-ds")
$mahoutClassify.Arguments.Add("wasb:///user/hdp/testdata/KDDTrain+.info")
$mahoutClassify.Arguments.Add("-m")
$mahoutClassify.Arguments.Add("wasb:///user/hdp/nsl-forest")
$mahoutClassify.Arguments.Add("-a")
$mahoutClassify.Arguments.Add("-mr")
$mahoutClassify.Arguments.Add("-o")
$mahoutClassify.Arguments.Add("predictions")

$mahoutClassifyJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $mahoutClassify
Wait-AzureHDInsightJob -Job $mahoutClassifyJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $mahoutClassifyJob.JobId -StandardError

4 RF 2

4 RF 3

Note that the Output shown above has the same format and very similar results to the previous post when done in interactive style.

5. Clean Up

Cleaning up involves the likes of removing and HDInsight cluster but also removing temporary directories. While the PowerShell command for deleting a single file is pretty straight forward, i.e.

001
Remove-AzureStorageBlob -Container $containerName -Context $storageContext -Blob $file

deleting a folder structure comprises a loop in which every single file with specified file path prefix is removed.

001
002
003
004
005
006
007
008
009
010
011
012
013
014

## a. Remove temp directory
$blobPrefix = "user/hdp/temp"
$tempFiles = Get-AzureStorageBlob -Container $containerName -Context $storageContext -prefix $blobPrefix

Write-Host "Removing temp directory"
foreach ($item in $tempFiles
){
   
$tmpFile = $item.
Name
   
Write-Host "Deleting $tmpFile"
    Remove-AzureStorageBlob -Container $containerName -Context $storageContext -Blob $tmpFile
}


## b. Delete HDInsight cluster
Remove-AzureHDInsightCluster -Name $clusterName

6. Scenario: Recommender Job

As we saw in the first part of our Mahout for Dummies series, there are many algorithms included in the Mahout library other than the random forest.

In the blog by the Big Data Support team at Microsoft, there is a good post demonstrating the use of the RecommenderJob class on an HDInsight Cluster using PowerShell that you can read here. The source code of the RecommenderJob class can be looked up here on GitHub.

In this scenario, we are given two data files: one containing user ID’s and the other comprising the degrees of preference of users towards given items:

5 rec items 2 In ItemID.txt, the first column indicates the user ID, the second the item IDs and the final one denotes the degree of preference. Thus, ItemID.txt can be expressed in a more intuitive Format of a matrix, where the rows indicate the user and the columns denote the item IDs. The values inside the matrix themselves display the degree of preference, as given in the third column in ItemID.txt.

5 rec items 3

Here is the comprised PowerShell script for running RecommenderJob as in the Big Data Support blog.

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150

##########################################################################################
# Mahout with HDInsight: RecommenderJob (Collaborative Filtering)
#
# Check out Microsoft's Big Data Support blog
# http://blogs.msdn.com/b/bigdatasupport/archive/2014/02/19/mahout-with-hdinsight.aspx
#
# Source code in GitHub:
# https://gfdithub.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java



##########################################################################################
# 0. Azure Account Details

Add-AzureAccount
$subName = "<AzureSbscriptionName>"
Select-AzureSubscription $subName

# Azure account details automatically set
$subID = Get-AzureSubscription -Current | %{ $_.SubscriptionId } 



##########################################################################################
## 1. Input information


## a. storage account

$storageAccount = "<StorageAccountName>"
$containerName = "<StorageContainerName>"
$location = "<DatacenterLocation>" 
#e.g. North Europe

# if storage account not created yet
#New-AzureStorageAccount -StorageAccountName $storageAccount -Location $location
#Set-AzureStorageAccount -StorageAccountName $storageAccount -GeoReplicationEnabled $false

# Variables automatically set for you

$storageKey = Get-AzureStorageKey $storageAccount | %{ $_.Primary } 
$storageContext = New-AzureStorageContext -StorageAccountName $storageAccount -StorageAccountKey 
$storageKey
$fullStorage
 = "${storageAccount}.blob.core.windows.net"

# if container not created yet
New-AzureStorageContainer -Name $containerName -Context $storageContext



## b. HDInsight Cluster
$clusterName = "<HDInsightClusterName>"
$clusterCreds = Get-Credential -Message "New admin account to be created for your HDInsight cluster"
# best: user name = admin
$numNodes = 4



## c. Data
# Data stored locally

$localFolder = "C:\<localFilesPath>"
$localItems = "$localFolder\ItemID.txt"
$localUsers = "$localFolder\users.txt"
$localMahoutJar = "C:\<PathToMahoutDistribution>\mahout-core-0.9-job.jar"

# Data to be stored in Azure Blob Storage
$blobMahoutJar = "mahout/mahout-core-0.9-job.jar"
$blobFolder = "testdata"
$blobItems = "$blobFolder/ItemID.txt"
$blobUsers = "$blobFolder/users.txt"



##########################################################################################
# 2. Upload file from local to Azure Blob Storage

# Mahout jar

Write-Host "Copying Mahout JAR into Blob Storage" -BackgroundColor Green
Set-AzureStorageBlobContent -File $localMahoutJar -Container $containerName -Blob $blobMahoutJar -Context $storageContext

# data for RecommenderJob
Write-Host "Copying necessary data into Blob Storage" -BackgroundColor Green
Set-AzureStorageBlobContent -File $localItems -Container $containerName -Blob $blobItems -Context $storageContext
Set-AzureStorageBlobContent -File $localUsers -Container $containerName -Blob $blobUsers -Context $storageContext




##########################################################################################
# 3. Create HDInsight Cluster

# Simple create

New-AzureHDInsightCluster -Name $clusterName -Subscription $subID -Location $location
 `
   
-DefaultStorageAccountName $storageAccount -DefaultStorageAccountKey $storageKey
 `
   
-DefaultStorageContainerName $containerName -Credential $clusterCreds -ClusterSizeInNodes $numNodes
 `
   
-Version 2.1



##########################################################################################
# 4. Mahout


# Mahout Job defining the appropriate JAR file and the class name

$mahoutJob = New-AzureHDInsightMapReduceJobDefinition
 `
   
-JarFile "wasb:///$blobMahoutJar"
 `
   
-ClassName "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob"

# Similarity class name.
# Alternative similarity classes: loglikelihood, tanimoto coeff,
# city block, cosine, pearson correlation, euclidean distance

$mahoutJob.Arguments.Add("-s")
$mahoutJob.Arguments.Add("SIMILARITY_COOCCURRENCE")

# Input path to file with preference data
$mahoutJob.Arguments.Add("-i")
$mahoutJob.Arguments.Add("wasb:///$blobItem")

# path to file containing use IDs for which recommendations will be computed
$mahoutJob.Arguments.Add("--usersFile")
$mahoutJob.Arguments.Add("wasb:///$blobUsers")

# path for recommender output
$mahoutJob.Arguments.Add("--output")
$mahoutJob.Arguments.Add("wasb:///$blobFolder/output")

# Starting job
$mahoutJobProcessing = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $mahoutJob -Debug

# Waiting Job for completion
Wait-AzureHDInsightJob -Job $mahoutJobProcessing -WaitTimeoutInSeconds 3600 -Debug

# Getting error if any
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $mahoutJobProcessing.JobId -StandardError



##########################################################################################
# 5. Clean up, i.e. remove temp directory


## a. Remove temp directory

$blobPrefix = "user/hdp/temp"
$tempFiles = Get-AzureStorageBlob -Container $containerName -Context $storageContext -prefix $blobPrefix

Write-Host "Removing temp directory"
foreach ($item in $tempFiles
){
   
$tmpFile = $item.
Name
   
Write-Host "Deleting $tmpFile"
    Remove-AzureStorageBlob -Container $containerName -Context $storageContext -Blob $tmpFile
}


## b. Delete HDInsight cluster
Remove-AzureHDInsightCluster -Name $clusterName

 

The output files can be seen in the Azure Blob Storage Explorer as ususal:

5 rec output 1

The output file itself gives information on how likely which items could be of interest to which users, i.e. <user_id> [<item_id>: <degree-of-preference/interest>,…].

5 rec output 2

In such a way, we can insert the recommendations with their scores in the matrix from above:

5 rec output 3

 

7. Wrapping Up…

In this blog post, we went through two scenarios applying Mahout on HDInsight in PowerShell style: random forest and recommender. These scenarios are nicely wrapped around the usual suspects: uploading data, creating the HDInsight cluster and cleaning up afterwards.

Many thanks go to Alexei Khalyako and Bill Carroll for their support on Mahouting on HDInsight!