Despite common misconceptions Microsoft now has extensive interoperability with open source technologies for example you can run a php application on Azure, get support from us to run RedHat, SUSE or CentOs on Hyper-V and manage your applications from System Center. , So extending this approach to the world of big data with Hadoop is a logical step given the pervasiveness of Hadoop in this space.
Hopefully your reading this because you have some idea of what big data is. If not it is basically an order of magnitude bigger than you can store, it changes very quickly and is typically made up of different kinds of data that you can’t handle with the technologies you already have. For example web logs, tweets, photos, and sounds. Traditionally we have discarded this information as having little or no value compared with the investment needed to process it, especially as it often not clear what value is contained in this information. For this reason big data has been filed in the too difficult drawer, unless you are megacorp or a government.
However after some research by Google, an approach to attacking this problem called map reduce was born. Map is where the structure for the data is declared for example pulling out the actual tweet from a twitter massage, the hashtags and other useful fields such as whether this is a retweet. The Reduce phase then pulls out meaning from these structures such as digrams ( the key 2 word phrases) sentiment, and so on.
Hadoop uses map reduce but the key to its power is that it applies the map reduce concept on large clusters of servers by getting each node to run the functions locally, thus taking the code to the data to minimise IO and network traffic using its own file system – HDFS. There are lots of tools in the Hadoop armoury built on top of this, notably Hive which presents HDFS as a data warehouse that you can run SQL against and the PIG (latin) language where you load data and work with your functions.
Here a Map function defines what a word is in a string of character and the reduce function then counts the words. Obviously this a bit sledgehammer/nut, but hopefully you get the idea. Also the clever bit is that each node has part of the data and the algorithm to process and then reports back when it’s done with the answers to a controlling node a bit like High Performance Computing and the SQL Server Parallel Data warehouse.
So where does Microsoft fit into this?
The answer is HDInsight which is now in public beta. This is a toolkit developed in conjunction with Horton Works to add integration to Hadoop to make it more enterprise friendly:
Big Data is definitely happening, for example there was even a special meeting at the last G8 meeting on this as it is such a significant technology. However it cannot be solved in one formulaic way by one technology; rather it’s an approach and in the case of Microsoft a set of rich tools to consolidate, store, analyse and consume: The point being to integrate Big Data into your business intelligence project using familiar tools, the only rocket science being the map reduce bit, and that is the specialism of a data scientist. Some of their work is published by academics so you might find the algorithm you need is already out there - for example the map function to interpret a tweet and pull out the bits you need is on twitter.
However research is going all the time to crack such problems as earthquake prediction, emotion recognition from photographs, ,edical research and so on. If you are interested in that sort of thing world then you might want to go along to the Big Data Hackathon 13/14th April in Haymarket, London, and see what other like minded individuals can do with this stuff.
In my last post & screen cast I showed how Dynamic Access Control (DAC) worked; the business of matching a users claims to the properties of a file (Resource Property in DAC), however the problem then becomes how do I correctly tag my files so that DAC works. You shouldn’t necessarily be doing this; it’s the users data and you are just the curator of that data. The users aren’t going to have the time or inclination to do this even if they are working in a compliance or regulated environment. However they might be able to give you some rules which you could apply to the files and this is what Data Classification does.
File Classification is part of the part of File System resource Manager (FSRM) role service and is new for Windows Server 2012 where before FSRM was just there to only allow certain file type to be uploaded or to grant quotas to users to restrict how much and of what could be stored on your servers. The secret sauce is then to link the resource property you set using the classification rule to a Central Access Rule in DAC
Hopefully this screencast shows how easy this is to do..
Things to note:
As per my previous post you’ll need your domain functional level to be Windows Server 2012.
You’ll need the FSRM role service on your file servers and these also need to be running Windows Server 2012.
The PowerShell is
Add-WindowsFeature –Name FS-Resource-Manager
Add-WindowsFeature –Name FS-Resource-Manager
and you’ll need a copy of Windows Server 2012 Evaluation Edition to try this out
I used a simple expression “Top Secret” in my screen cast but you can write RegEx to look for things like credit card details, NI numbers and appropriately protect those documents automatically using this technique.
File Classification in a production environment would typically run as a scheduled job, so to be clear this does not magically happen on the fly as users save documents onto your file servers.
Managing users access to the right files is a pain on any OS, the best that’s going ot happen is that no one will complain about not having access to a file while none of your sensitive company data gets into the wrong hands. In a traditional hierarchical business life was pretty easy you had a group called finance, a folder with their finance documents in you set up permissions form one to the other and you were done. However in a virtual taming, outsourcing home working organisation all sort of rules are needed to keep third parties at arms length from confidential data and allow users to have different roles on different teams. Also very few of us are good at filing, for example how many of properly tag our holiday photos so that we can track down our friends in all the photos we have?
Windows Server 2012 has several components in it to make this work, but key to this is Dynamic Access Control (DAC) which itself plugs into Active Directory (AD) , Group Policy and File Server Resource Manager (FSRM). The Dynamic in DAC refers to the fact that whenever a user tries to access a file their claim to do so is evaluate at the time of access. There are several parts to DAC to make this work and in my screencast you can see this in action..
However there’s a lot going on here and so I also wanted to describe the moving parts of DAC in more detail.
Claim Types these are the things we know about our users and the devices they are using based on querying what’s in AD for example here I have defined the Country a user in it..
Resource Properties are the things I know about what the user is trying to access such as a file, for example I could setup a tag of Country and tag each file with one or more Countries..
Resource Property Lists are optional groups of Resource Properties that you want to keep together for a purpose, so a subset of the Global Resource Property List that is there by default in DAC. Here’s the Global Property List..
Central Access Rules allow you to define how to evaluate a claim against a Resource Property and assign permissions of the back of this. At the top of this dialog you are asked about which resources (Target Resources) the rule will apply to in my demo I have set this up so that my rules are only applied to objects that have the resource properties I am interested in already set..
Further down the dialog under Current Permissions I can then set the rule that I want to enforce. Here I have said the device the user on must be running Windows 8 Enterprise to get full permissions to the resource. For this to work AD must know the computer I am on and in Windows Server 2012 AD this property is actually only set if I am on a Windows 8 or a Windows Server 2012 machine . So I can’t get in from on an older windows machine or if my machine is not domain joined.
I also have a rule (User-Country-Department) which says that the user’s country and department must match the country and department of the resource being accessed. This is great I don't have to create groups for each user or folders to categorise departments and fiddle with ACLs this one rule makes that work and provided the users data in AD is kept up to date and files are tagged correctly that’s all I have to do.
Central Access Policies. Several Rules can then be combined into a single policy. In my case I have a Central Access Policy I have called Default and this references my two rules..
This is now a policy object that can be applied like any group policy. So If I look at group policy you can see a policy called DA-FileServer-Policy that is filtered to only apply to Server1 ...
If I edit that and expand Computer Configuration –> Windows Settings –> Security Settings –> File System –> Central Access Policy you can see where I have referenced my Default policy..
DAC requires the AD functional level to be at Windows Server 2012. This can work in concert with traditional ACLs but remember that the principal of least privilege applies so if there’s and explicit deny somewhere in DAC or in an ACL that is what will win. You’ll want to test your scenarios and there’s two tools here to help: You can set proposed permissions in a Central Access Rule as well as actually set permissions. For a particular folder or file you can go into properties –> security tabs –> advance security to evaluate security. You can see what policy is applied and what is granting or blocking users’ access to objects. You can also see there’s a classification tab from which you can see and set (depending on permissions ) the resource properties for that file/folder.
DAC requires the AD functional level to be at Windows Server 2012.
This can work in concert with traditional ACLs but remember that the principal of least privilege applies so if there’s and explicit deny somewhere in DAC or in an ACL that is what will win. You’ll want to test your scenarios and there’s two tools here to help:
I will cover off how to automatically classify files rather then rely on manual tagging them in my next post. In the meantime if you want to try this you’ll need a copy of Windows Server 2012 evaluation edition and use it to make a domain controller.