Despite common misconceptions Microsoft now has extensive interoperability with open source technologies for example you can run a php application on Azure, get support from us to run RedHat, SUSE or CentOs on Hyper-V and manage your applications from System Center. , So extending this approach to the world of big data with Hadoop is a logical step given the pervasiveness of Hadoop in this space.
Hopefully your reading this because you have some idea of what big data is. If not it is basically an order of magnitude bigger than you can store, it changes very quickly and is typically made up of different kinds of data that you can’t handle with the technologies you already have. For example web logs, tweets, photos, and sounds. Traditionally we have discarded this information as having little or no value compared with the investment needed to process it, especially as it often not clear what value is contained in this information. For this reason big data has been filed in the too difficult drawer, unless you are megacorp or a government.
However after some research by Google, an approach to attacking this problem called map reduce was born. Map is where the structure for the data is declared for example pulling out the actual tweet from a twitter massage, the hashtags and other useful fields such as whether this is a retweet. The Reduce phase then pulls out meaning from these structures such as digrams ( the key 2 word phrases) sentiment, and so on.
Hadoop uses map reduce but the key to its power is that it applies the map reduce concept on large clusters of servers by getting each node to run the functions locally, thus taking the code to the data to minimise IO and network traffic using its own file system – HDFS. There are lots of tools in the Hadoop armoury built on top of this, notably Hive which presents HDFS as a data warehouse that you can run SQL against and the PIG (latin) language where you load data and work with your functions.
Here a Map function defines what a word is in a string of character and the reduce function then counts the words. Obviously this a bit sledgehammer/nut, but hopefully you get the idea. Also the clever bit is that each node has part of the data and the algorithm to process and then reports back when it’s done with the answers to a controlling node a bit like High Performance Computing and the SQL Server Parallel Data warehouse.
So where does Microsoft fit into this?
The answer is HDInsight which is now in public beta. This is a toolkit developed in conjunction with Horton Works to add integration to Hadoop to make it more enterprise friendly:
Big Data is definitely happening, for example there was even a special meeting at the last G8 meeting on this as it is such a significant technology. However it cannot be solved in one formulaic way by one technology; rather it’s an approach and in the case of Microsoft a set of rich tools to consolidate, store, analyse and consume: The point being to integrate Big Data into your business intelligence project using familiar tools, the only rocket science being the map reduce bit, and that is the specialism of a data scientist. Some of their work is published by academics so you might find the algorithm you need is already out there - for example the map function to interpret a tweet and pull out the bits you need is on twitter.
However research is going all the time to crack such problems as earthquake prediction, emotion recognition from photographs, ,edical research and so on. If you are interested in that sort of thing world then you might want to go along to the Big Data Hackathon 13/14th April in Haymarket, London, and see what other like minded individuals can do with this stuff.