MEA Center of Expertise

We are a 120+ technology enthusiasts helping Microsoft customers around Middle-East & Africa region. We bridge Microsoft tools & technologies to their businesses.

Connecting R to HDinsight through HIVE

Connecting R to HDinsight through HIVE

  • Comments 2
  • Likes

With the powerful big data platform that Microsoft provides through Azure HDinsight, and with the wide range of data scientists and statisticians utilizes R, this Post is who to bring the best of both and connect R to HDinsight through Hive connector. so that you can analyze hive tables in R where they resides on the Azure HDinsight cluster so let’s see the steps

    1. Download Microsoft Hive ODBC driver from here
    2. Install the Microsoft Hive ODBC driver use either the x86 or the x64 ( take care of the version to use the same with R)
    3. Configure your DSN in the ODBC Data Sources
    1. go to Control Panel > Administrative Tools > ODBC Data Sources (64-bit)
    2. open System DSN click add
    3. choose Microsoft Hive ODBC Driver and click Finishimage
    4. enter the fields
    1. Data Source Name: the data source name we’ll use in R so name it anything i’ll call it now HiveOnAzure
    2. Description: write your desription
    3. Host: get it from your Azure Manage site [yourclustername]
    4. port: leave it 443
    5. Database: leave it “default”
    6. Hive server type: use Hive Server 2
    7. Mechanism: Windows Azure HDinsight (it automatically configures the port and Database above)
    8. HTTP Path: leave it blank
    9. username: your username that you entered while creating the cluster
    10. password: your password that you entered while creating the cluster
    11. then test the connectivity you should receive a connection successful established message
    12.  image
  1. Now after establishing the ODBC driver connectivity to Azure we’ll shift to R
  • Open RStudio (make sure to use the same x64 or x86 version as you’ve configured in the ODBC drivers)
    1. install the RODBC package

 > install.packages("RODBC")
 > library(RODBC)
      1. create the ODBC connection in R
 > myconn <- odbcConnect("HiveOnAzure",uid="[YOUR_USERNAME_HERE]",pwd="[YOU_PASSWORD_HERE]" )
      1. run your HiveQL Query and return the data into a data frame
 > alldata <- sqlFetch(myconn,"Select * from hivesampletable")
      1. inspect the retrieved data
     > head(alldata,10)

    Now you’ve successfully connected your R to the Hive on HDisnight on Azure to pass your HiveQL Queries and start doing the analysis you want to create. 

    • Hey thanks for the instructions!! However when i try to run hiveql query "alldata <- sqlFetch(myconn,"Select * from hivesampletable")", i get an error
      Error in odbcTableExists(channel, sqtable) :
      ‘Select * from hivesampletable’: table not found on channel

      would you know if im doing something wrong or is it with the settings or something? Please help!!

    • hey i did not figure out the issue however i figured out a work around. instead of sqlFetch keyword we can use sqlquery which will the results. i guess this will still work!! but the question now is why sqlquery works but not the sqlfetch?

    Your comment has been posted.   Close
    Thank you, your comment requires moderation so it may take a while to appear.   Close
    Leave a Comment