Insufficient data from Andrew Fryer

The place where I page to when my brain is full up of stuff about the Microsoft platform

July, 2012

  • Microsoft, and Hadoop for Big Data

    Despite common misconceptions Microsoft now has extensive interoperability with open source technologies for example you can run a php application on Azure, get support from us to run RedHat, SUSE or CentOs on Hyper-V and manage your applications from System Center. ,  So extending this approach to the world of big data with Hadoop is a logical step given the pervasiveness of Hadoop in this space.

    Hopefully your reading this because you have some idea of what big data is. If not it is basically an order of magnitude bigger than you can store, it  changes very quickly and is typically made up of different kinds of data that you can’t handle with the technologies you already have.  For example web logs, tweets, photos, and sounds.  Traditionally we have discarded this information as having little or no value compared with the investment needed to process it, especially as it often not clear what value is contained in this information.  For this reason big data has been filed in the too difficult drawer, unless you are megacorp or a government.

    However after some research by Google, an approach to attacking this problem called map reduce was born.  Map is where the structure for the data is declared for example pulling out the actual tweet from a twitter massage, the hashtags and other useful fields such as whether this is a retweet.  The reduce phase then pulls out meaning from these structures such as digram ( the key 2 word phrases) sentiment, and so on.

    Hadoop uses map reduce but the key to its power is that it applies these map reduce concept on large clusters of servers by getting each node to run the functions locally, thus taking the code to the data to minimise IO and network traffic using its own file system – HDFS.  There are lots of tools in the Hadoop armoury built on top of this notably Hive which presents HDFS as a data warehouse that you can run SQL against and the PIG (latin) language where you load data and work with your functions.

    What Microsoft are developing in conjunction with a leading Hadoop developer Horton Works is to add integration to Hadoop to make it more enterprise friendly:

    • an odbc driver to connect to Hive
    • an addin in Excel to query the Hive
    • the ability to run Hadoop as a service on Windows Server
    • the ability run Hadoop on Azure and this create clusters and when you need them and use Azures massive connectivity to the internet to pull data in there rather than choke bandwidth to your own data centre.
    • F# programming for Hadoop. F# is a functional programming language that data scientists understand in the same way as I learned Fortran in my distant past as an engineering student.

    At the time of writing there these tools are still in development and there is only “by invitation” admission to Hadoop on Azure. However I wanted to write this up now after a talk I gave a couple of weeks ago at the cloud world forum..


  looking at the deck in isolation doesn't really help as I don’t tend to use PowerPoint to repeat what I am saying!

    23 march 2013: This post has been superseded my post on HDInsight as that is the new name of of tools that have now been released to public beta

  • Dell and Hyper-V for the smaller business

    Over the last month I have been on tour with Dell showing what Hyper-V can do for small/medium businesses,  and later this week I’ll be with them in Falmouth.  The argument they put forward for Hyper-V is really simple:

    1. You want use a new server for virtualisation, and now that moderns servers can run ten plus virtual machines you’ll probably want to buy an OEM license of Windows DataCenter edition as you are then licensed to run as many virtual machines on a server as it can take, each of which will be running the DataCenter edition. 
    2. In the past you would then have then bought some virtualisation software to run those virtual machines at additional cost.  However that DataCenter license also covers you to run Hyper-V as that virtualisation software for no additional cost. 

    You might argue that Hyper-v isn’t as good as the other stuff you can buy, and that’s OK with me as long as you can prove that for the scenario you have in mind you are getting what you are paying for be that performance, security, manageability etc. 

    As far as performance goes I think that getting an application like SQL Server or Exchange to run in a virtual machine at about 90% of the speed of the physical server the virtual machine runs on is an acceptable loss and is competitive with other  hypervisors.  You’ll want to test this yourself, but remember to compare like with like for example your compute, network and storage setting should be the same. 

    You might wonder if Hyper-V is secure? anecdotal evidence suggests that it is as secure as anything else because if it wasn’t you’d be able to reply to this post with the evidence from a competitors website or blog. For best practice on securing Hyper-V please refer to this earlier post of mine

    However in the manageability space Hyper-V by itself  runs out of road once you end up with more than X virtual machines – where X will depend on your infrastructure the size of the IT team etc. but if you have more than a hundred virtual machines, you’ll need to be very well organised or use additional software. Microsoft have a suite of tools called System Center (currently System Center 2012) and this is also has a DataCenter edition, licensed per physical server allowing you to manage however many virtual machines you have on there, but more importantly it’s designed to manage your applications. By this I mean deploying them, monitoring them, etc. rather than just looking at the health of the virtual machine they are running in. 

    I don’t see this lack of manageability as a problem for smaller businesses as many of them don’t have that many virtual machines and your organisation might well be OK just using Hyper-V and the tools that Dell provide with their servers and EqualLogic SANs.

    Many things change with Windows Server 2012 and while the big headlines have been about massive improvements in scale for the next version of Hyper-V, that’s not really relevant for the smaller business. Rather it is things like multi server management with specific tools in the new Server Manager to monitor, and even update servers in a group. Powershell 3 has extensive support for managing all aspects of your servers from one place.

    That’s not to say there’s nothing in Hyper-V,  for smaller business and my top 3 would be:

    • Running replica virtual machines at different sites over slow networks and being able to failover to the replica as required.
    • The ability to live migrate virtual machines with out shared storage (aka “Shared Nothing”).    
    • NIC teaming is now ther in the operating system allowing you to manage all your network adapters form inside Windows and the capacity they provide as needed

    So before you pay out for tools for virtualisation or management, see what you get included in Windows Server, and whether the return you get from additional software be that a different hypervisor or management tools is justified in your business with your IT team.

    Finally the release Windows Server 2012 is end of August, so if you are planning a server procurement now you may well find it is shipped with it. To be ready for that rather than just downgrading to an earlier version, have a look at the Windows Server 2012 content on the Microsoft Virtual Academy and/or have download the Release Candidate