Insufficient data from Andrew Fryer

The place where I page to when my brain is full up of stuff about the Microsoft platform

Insufficient data from Andrew Fryer

  • Machine Learning for the Business Intelligence developer

    Amy (the other half of the data duo in our team) and I have been gate crashing the Data Culture series and other events recently to see who’s interested in Azure Machine Learning (MAML) . It turns out that data scientists are pretty comfortable using their own tools and scripts be that in R and Python and while they are interested in MAML they don’t see then need for it.  However Business Intelligence (BI) developers seem to be far more excited about how this new Azure service can be quickly evaluated and deployed against their existing data without them needing the deep maths knowledge a data scientist has.  That’s not to say  a basic knowledge of maths and stats is needed or that Azure ML makes data scientists redundant rather it means that interesting ideas can be tested and evaluated if you have some idea of what you are doing. 

    However MAML  has only just gone into general availability and while the service is powerful fast and reliable, the full documentation around it is still a work in progress.  To fix that Amy and I have come up with a tutorial to help get you started.  It based around my earlier posts on predicting whether flights are delayed or not – in technical parlance a two class classification problem.  This lab guide is on One Drive and makes use of existing sample data sets in MAML, and it takes you through the whole process right up to publishing, note you’ll still need to sign up for MAML to do this.

    Amy has also posted two experiments at various stages of the lab into the MAML Gallery..


    So that you can start at the point where things get interesting i.e. you have got the data cleaned it up and are ready to do actual machine learning in ML studio, or see how to publish an experiment without doing all the rest of the lab.

    Note: you’ll still have to sign up to Azure to get started with MAML , but you can get a trial for a month or use your Azure hours included with your MSDN subscription.

  • Azure Machine Learning – General Availability

    Today Microsoft Azure Machine Learning (MAML)goes from preview to general availability and it’s also undergone quite a few changes on the way.

    Before I get into what’s new (and there is quite a lot) I do occasionally get feedback that Azure services are evolving too quickly for everyone to keep up.   I do agree this is a challenge but this is where the industry is going and it means that new features requested by you or dreamt up by product teams can be quickly put out there and refined based on feedback – so a bit like an experiment in MAML.

    So what’s new now it’s officially in production and why does this matter?

    When I talk about MAML at the events I go to I call out the three things that make it a bit different from other services and tools that do machine learning: It’s collaborative, it has some seriously good algorithms in it that you can grab and go , and it allows you to take your hypothesis from initial concept through to production.  In this post I’ll deal with the first of those – Collaboration.

    You might notice that from today, if you go to the ML Studio landing page there is a new preview option in the menu bar at the top, the gallery, and if you follow this link you’ll get something like this..


    The gallery is the place we can share our work with each other, where until now we could only do this by inviting each other into our workspaces, which is still fine for team work but is not so good if you want to share your finished work.  In the screen shot above you can see that as well as the ML team sharing some new samples, there are already contributions from those of us who have been on the preview for this. 

    The home page has also had a shakeup as more and more training and resource materials are coming on line.  I would also call out the contribute option where you can get help and also feedback on what you think needs to be in MAML.  I can’t say for certain that your idea will get through but I can guarantee that what you write will be reviewed by the product team.

    To start sharing our own experiments with each other we need to describe them to make them searchable and discoverable so you’ll notice that if focus is set to the project in ML studio I can do exactly that..


    Here I can add in a 140 character summary and a longer description. Once I have successfully run my experiment, the Publish to gallery option in the toolbar at the bottom is enabled, allowing me to share my work. Also notice the create scoring experiment option here which I’ll get to in a later post.

    The other data developer evangelist on the UK team I work in is Amy Nicholson, and she has  posted up an experiment we use during the instructor led lab on MAML that we have developed..


    If you open this you’ll realize that it doesn’t do any machine learning at all.  This is because this is the start point of the lab following a quick demo we do of how to load and transform data in MAML, which is just showing how to use ML studio to do the kinds of things we have always done with data.   It’s great for us to have this experiment in the gallery as until now we have had to invite you into our workspace which meant giving us the details of your azure accounts.  Now w you can just open this, save a copy and you are ready to do our lab.  If you do actually want to do this lab then we’ll be at the following events..

    • the Data Culture series which is at various location across the UK
    • the UK venues of SQL Saturday
    • SQLBits at the Excel Centre London 4-7th March (the community day is on 7th and free the other days are paid for).
    • other random events that’ll be on the MSDN UK  & TechNet UK newsletters

    so please come along to one of these with your curiosity and your ideas. However If you want to sit at home and work your way through our lab then please e-mail me and I’ll send you the guide.

  • TechDays on Line 2015–it gets better



    I have been out in Seattle for the last couple of weeks as a roving reporter for our upcoming TechDays Online 2015 and I have managed to get footage of some of the engineers who write the stuff we all use.  I have also been having coffee with some of them to make sure we can share the very latest thinking during the event whether that’s on enterprise mobility, on 3rd Feb, what we are doing with service management and automation on the 4th, and what Windows 10 means for developers on the 5th. 

    Of course it’s no substitute for a ticket to Build (which has sold out anyway) or Ignite, but if you hate flying and look home comforts then there is no better way to stay up to date. than to tune in next week. 

    Also be aware there is small army of experts both MVPs and our filed engineers online throughout the day so if you have questions and comments we will be get you answers right back to you on the side chat.

    If that’s not enough then we are also giving away all sorts of stuff, but hopefully I don’t need to mention that as it’s all about the technology isn’t it (btw I bought my own Surface Pro 3 while I was out there as I  am not allowed to enter!)

    so hopefully we’ll be able to virtually catch up, as there should be something over the three days that will catch your eye.


    Neon sign in downtown seattle – @deepfat 2015

  • Don’t be a turkey

    capercallie 006

    It’s not great being a turkey, pheasant or grouse  at Christmas, but my Christmas sketch this year isn’t too worried as he’s a protected and valued Capercaillie.  In the same  same way the cloud first world might not seem like a great time to be an IT Professional but there are certain species who are similarly valued and for whom 2015 is pretty exciting. 

    So as we all take time out to enjoy time with our friends and families, don’t panic about 2015.  Infrastructure guys are still needed and many of the skills we have acquired are still valued too like VPN’s , certificates, Active Directory, and of course PowerShell and I for one won’t be retiring to do art full time just yet.

    Happy Christmas and hopefully we’ll meet up  at an IT Camp , Future decoded, or user group somewhere next year


  • IT Camp - Extend your Data Centre with Azure

    I have been on a journey over the last year that some of you may already be embarked on and that is to transition my skills to use cloud services and specifically Azure.  For me it’s like going to America, the language is similar, the culture less so and everything over seems to be much bigger such as the cars and buildings. 

    In Azure we can still use the skills we have (PowerShell is amazing!) to run up virtual machines and we understand what they are. Our knowledge of gateways and VPNs is also useful but then there are new things like Web Roles and SQL Azure databases which actually are similar to the web and database services we provide to the business in our own data centres.

    So how do we get your head around this stuff? it all depends on how you learn, some of us like to read the manual , some like to be shown and the curious amongst just jump in. Actually we all use all three of these approaches but in any situation the amount of risk and uncertainty will influence which we use. 

    When it comes to Azure you could sit back and watch the excellent content on MVA, but it’s all in American and isn’t interactive unless you catch a jumpstart like the one in November on Azure.  There are also various lab guides lying around but there’s no immediate help if you get stuck.

    So your best bet might be to start your journey to Azure by coming along to an IT Camp which is a mix of theory and instructor led practical work.

    Specifically we’ll get you started by showing you how to create a group of Azure VMs and stand up a load balanced public facing web site.  This is easy to do but just as hard to maintain as it would be if the VMs were in your basement so in the next stage we move the web sites to Azure web roles and the backend database to SQL Azure. These Azure services mean that all you worry about is the code /database not the OS or application layer.

    Having done this we can then explore other parts of the Azure fabric such as how storage and networking works so you can connect and integrate  your on premises applications and extend those to Azure.  We also look at operations like monitoring your data centre form Azure and how to leverage Azure automation to hand off routine tasks. 

    We have lots of optional content which we cover to a greater degree depending on demand such as running Linux VMs on Azure and how to deploy web sites based on non Microsoft languages. Azure can also be used for disaster recovery so we can cover that if it’s of interest as well.

    Finally we can discuss questions like how much will it cost, how do I move stuff and the fear, uncertainty and doubt (FUD) of the cloud as answers on this are very dependant on  what business you are in and your existing investments in infrastructure.

    So if you are interested in Azure and want to get your hands dirty you need to do three things

    1. sign up for an IT camp 1

    2. Turn up on the day

    3. Bring along an open mind and a laptop/tablet to do the labs on

    1 Stop press 13 Jan 2015 - Open Registrations for our IT Camps are here but you may also get an invitation to closed camps directly from your account manager

  • IT Camps–What are they good for.


    One of the most rewarding yet most difficult part of my job over the last three years has been involved with designing and delivering something called an IT camp.  So I thought it might be useful to explain what they are in case an invite lands in your mailbox.

    What IT Camps are:

    Technical. it’s designed for technical people who like to learn by doing so we allocate 3 hours plus of lab time out of the day so it’s essential you bring a device (or even two – one for the lab guide).

    Focused on a key technology. Camps have a theme , currently either  Enterprise Device Management, and Extend Your Data Centre with Azure.  In both cases we want to get you to a place but how we get there and where we focus is up to you.

    Education rather than training.  We show the art of the possible but we do cut corners that you wouldn’t in production e.g. self signed certs and our PowerShell misses out the error trapping so it’s easy to see what it’s trying to do.  We do share best practice but we also encourage debate about what that is as you are all form very different organisations and what works in retail might not be appropriate in education.

    What IT camps are not:

    Death by PowerPoint. We do use decks to show architecture and setup instructions but that’s about it.  We do supply you with a pile of decks to explain content off line, and you can continue your labs after the day as we’ll make the guides available.

    a Lecture.  While events like Future Decoded provide deep technical content there is little opportunity to ask questions, or have any kind of discussion.  This does mean that IT camps can appear more unstructured than a session at something like TechEd, as we will deviate into areas based on what you are interested in on the day and so no two camps are ever quite the same.  We also make mistakes and things don’t always work because we are doing this live and actually experience has taught us that small hiccups actually reinforce learning so expect surprises!

    A Sales or marketing pitch. Evangelists don't get bonuses for generating leads or , we are there to show the art of the possible and often that's about making the best use of the Microsoft solutions you already have like, Hyper-V or the deployment tools we give away.

    The reason that Ed and I put a lot of effort into camps is about the direct help and advice we can share and the feedback we get from you.  Feedback enables us to continually improve camps a recent example being that we now give you much more time for labs than we used to.  We also get ideas for posts and other content delivered by the TechNet team like my Lab Ops series on using PowerShell in a sandbox.

    So hopefully we’ll have the pleasure of your company at a camp sometime next year.

    Stop press 13 Jan 2015 - Open Registrations for our IT Camps are here but you may also get an invitation to closed camps directly from your account manager.

  • How to train your MAML–Redux

    Since I wrote this series we have taken on two computer science graduates in the evangelism team in the UK, Amy Nicholson and Bianca Furtuna. Both studied machine learning as part of their courses so I introduced them to this series on Azure ML to get them up to speed.  They have taken this series apart and put it back together again as a full day training course which we have been testing at several events recently.  They also spotted some improvements and so I thought I would share these.

    Amount of Training Data:

    In the part of the experiment when I split the data into training and test data, I split out 20% of the data for training.  However what I should have done is used  80% of the data training instead as usually for continuous variable type machine leaning scenarios the majority of data is given to train the model and then 20% is used to validate the model, this allows the model to really refine the patterns it understands in the data and predict with more accuracy.  It’s also important to have a good balance of data as well and this case we need a mix of flights that are late and on time.

    Excluding Month and DayofMonth:

    I removed the time related columns used for the joining the Flight and Weather Data,  Amy wondered if the time of year was significant and ran some tests with the Boosted Decision Tree algorithm:

    Keep Month, DayofMonth

    Remove Month, DayofMonth




    However to be fair to me there isn’t enough data in the sample to really test this as there’s less than a years worth of data.  However Tine of day might be important as well.  The point is here is to experiment and test to iterate to a good result, and knowledge of the business data and maths combined with some scientific rigour is  needed for that.

    Quantize Module Use:

    I explained and used quantization in my experiment the process of putting a continuous variable like temperature into several buckets of ranges as this is in the Azure ML sample experiment on Flight Delay.  Amy decided to test whether this was improving the accuracy of the model  by removing it..

    Keep Month, DayofMonth + No Quantize Module

    Remove Month, DayofMonth + Quantize Module




    so it turns out it’s better not to do that for this scenario.  Amy then wondered whether other two class algorithms in ML worked better with quantised data..

    Without Quantize Module

    With Quantize Module

    Boosted Decision Tree



    Logistic Regression



    Support Vector Machine



    So it’s not helping here at all.  Amy did see a slight reduction in processing time using quantisation and you might argue that a small drop in accuracy might be acceptable to really speed up processing but the improvement


    I learnt three things from this feedback as well as the specific errors above:

    1.  Expertise. It is worth talking to and learning from a real data scientist and sharing your work with them. Azure ML makes this easy by allowing me to invite others into my workspace.

    2.  Do good science.  Don’t slavishly follow other peoples work, be curious but sceptical by testing your experiments rigorously

    3. Document your work.  If you want to share your work you'll need to explain your hypothesis and working like any good scientist. While Azure ML does allow comments but doesn’t have a way of storing other metadata about an experiment the way a solution in Visual Studio does.

  • What kind of data person are you?

    andrewI have been working with data in one form or another since 1991, but I wouldn’t describe myself as a DBA, data scientist or a developer so what am I, and am I alone?

    I spent ten of those years doing business intelligence and most of that was working with databases, designing systems data marts, and analytical models, as well as getting into search and unstructured data.  What these had in common was my desire to provide the business I was in with a better view on the decisions they were making.  However what I was creating in each case was a space in which business users could explore that data to come up with their own answers rather than me doing that as the questions changed quickly and needed a deeper insight into the business than I had

    More recently I have been getting back into working with data again and what’s interesting about that is that there seem to be a lot of other data orientated people like me who are exploring the many new data tools that exist both on the desktop and more interestingly in the cloud.  Some have deeper technical knowledge while others are really close to the business.  In all cases they are curious but cautious in that they will try test and evaluate rather than ignoring a new opportunity or taking someone else’s word for it.

    I would classify us (or should I say cluster us) as data scientists however that term seems to be reserved for those who have a deeper mathematical / statistical background so what should collective term should I use to describe us?  In the team I work in we call ourselves data developers and I kind of like this as the word development implies that we do all do some sort of coding (such as R, Python, SQL, MDX and F#) and it also means we are using some sort of work bench like the SQL Server data tools in Visual Studio.  It also suggests that we are developing the data into something in the process, for example in Machine Learning that something would be a prediction. in BI it might be an analytical model and in real time analysis that might be a trigger to fire an event when a condition is met.

    There are other groups around this field as well. I have already mentioned the data scientist and it might be that their role might be to oversee the models we build and sign them off for production much as an actuary would do in an insurance company.  Other data related roles are the visualisation experts who like user experience designers on an application team put the gloss and usability on data to ensure that its true meaning is self evident.  Then there are curators/librarian types who understand how disparate data sets fit together.  Perhaps one are that will emerge that might get you thinking is an ethicist - as the laws on data privacy are far behind what is possible these days so the question is not “is this illegal?” but “is it ethical?”. Just because someone didn’t uncheck a box on an online form, and just because someone can now mash that preference data with their social media data exhaust and a  bunch of commercially bought 3rd party data that enables a data scientist to infer some really intimate details about someone’s life, doesn’t mean we should.

    Anyway back to data developers.  I think there’s a lot of us out there and that we have had some sort of background in business intelligence, data management etc. and while we may have done some maths in the past it’s not the day job.   My conversations at recent events seem to support this and so our plan is to try and help you transition you into this new world with in person and online content which the MSDN UK team have curated onto a dedicated data developers portal

    What we need from you is feedback on what we are doing to ensure you are getting the content and events you need so it’s just like any  other science project.

  • Lab Ops – Working with Azure VMs

    A couple of months ago I had a very interesting chat with Andy a Data Centre admin at CP Ltd about using PowerShell to manage Azure as Andy is partially sighted, and his team mate Lewis is blind (For more on this please read Andy’s post on the TechNet UK blog) .  I wanted to go into some of the PowerShell in a little more detail so that you can be a good as administrator on Azure as these guys are.  For this post I am going to assume you know how to work with Azure and are familiar with concepts like storage, cloud services and networking, though you will get an idea if you follow this post!

    Firstly to get working with PowerShell on Azure we need to get hold of PowerShell for Azure and remember to check back regularly as they change as Azure changes.

    Before we can run any PowerShell against our subscriptions we need to setup some sort of trust otherwise anyone can create services against our subscription.  The simplest way to do this is with


    this will launch the Azure management portal and ask us to sign in. This command will then save a file to our local machine which we can then consume the file like this..

    Import-AzurePublishSettingsFile -PublishSettingsFile "C:\AzureManagement\some filename.publishsettings"

    However the problem with this approach is that you have access to the whole subscription which is fine for demos and labs. In production you’ll have some sort of Active Directory in place and you'll connect to that with:

    $userName = "your account  name" $securePassword = ConvertTo-SecureString -String "your account password" -AsPlainText –Force

    $Azurecred = New-Object System.Management.Automation.PSCredential($userName, $securePassword)

    Add-AzureAccount -Credential $Azurecred

    Now we can run any of the PowerShell for Azure commands against our subscription but before we can do too much with Azure VMs we will need a storage account to store them..

    $StorageAccountName = “lowercase with no spaces storage account name”

    $AzureLocation = “West Europe”

    New-AzureStorageAccount –StorageAccountName $StorageAccountName –Location $AzureLocation 

    where –Location specifies the data centre you want the storage account to reside in e.g. West Europe and get-AzureLocation will give you all the data centres you can choose. Now we have a storage account we need to declare that as the default location for our VMs ..

    $SubscriptionName = (Get-AzureSubscription).SubscriptionName
    Set-AzureSubscription -SubscriptionName $SubscriptionName -CurrentStorageAccountName $AzureStorageAccountName

    If you are familiar with VMs in Azure you’ll know that by default each VM get’s its own wrapper or cloud service but in this demo I want to put three of these VMs into the same cloud service which we can create with..

    $AzureServiceName = “This needs to be unique on”

    New-AzureService -ServiceName $AzureServiceName –Location $AzureLocation -Description "Lab Ops cloud service"

    Before we can create any VMs we need to also have a network in place and it turns out the PowerShell port for this in Azure is pretty weak all we can do is setup a network using an xml file in the form of

    <NetworkConfiguration xmlns:xsd="" xmlns:xsi="" xmlns="">
            <DnsServer name="AzureDNS" IPAddress="" />
          <VirtualNetworkSite name="My-VNet" Location="West Europe">
              <Subnet name="My-Subnet">

    you can hack this around and then save it as something like VNnet.xml and then apply it to your subscription with.

    $AzureVNet = Set-AzureVNetConfig -ConfigurationPath " path to VNet xml file"

    for more on how to hack this file with PowerShell rather than editing it then have a look at Rik Hepworth’s (azure MVP)  blog  -

    Now to create those VMs we have more choices -  we could use a template VHD of our own but for now we will  just use the gallery images just as we can in the Azure Management portal.  To do this we need to interrogate the gallery to find the right image with something like this..

    $AzureImage = Get-AzureVMimage | where imagefamily -eq "Windows Server 2012 R2 datacenter" | sort-object PublishedDate -Descending  | select-object -first 1

    which will get the most recent gallery image for Windows Server 2012R2 datacenter edition. I can then consume this in a script to create a VM

    $AdminUser = "deepfat"
    $adminPassword = "Passw0rd!"

    New-AzureVMConfig -Name $VMName -InstanceSize Medium -ImageName $AzureImage.ImageName | `
            Add-AzureProvisioningConfig –Windows -AdminUsername $AdminUser –Password $AdminPassword | `
            Set-AzureSubnet 'Deepfat-Prod' |`
            New-AzureVM –ServiceName $AzureServiceName –Location $AzureLocation -VNetName $AzureVNet

    Note that if you want to embed these snippets  in a script you’ll need to get clever and introduce some wait loops to allow the VMs to spin up.

    By default when you create a VM a couple of endpoints will be created one for RDP and one for PowerShell.  In reality you wouldn’t necessarily want to do this as you may have site to site VPN in which case this is redundant or you might just do this on one VM to manage the rest or use Azure Automation.  We need to query for these ports as in a cloud service each VM will have the same DNS entry but with different random ports:

    $VM = Get-AzureVM -ServiceName $AzureServiceName -Name $VMName

    $VMPort = (Get-AzureEndpoint -Name PowerShell -VM $VM).port

    In Andy’s post he published a self signed certificate to his cloud service which is needed to enable a secure remote PowerShell session to the VM.  However if we are just trying this in a lab then we can use the certificate that Azure automatically creates when a cloud service is created as this is also trusted by the VMs in that cloud service by default.  We can then pull this down and trust it on our local machine with

    (Get-AzureCertificate -ServiceName $AzureServiceName -ThumbprintAlgorithm SHA1).Data | Out-File "${env:PUBLIC}\CloudService.cer"
    Import-Certificate -FilePath  "${env:PUBLIC}\CloudService.cer" -CertStoreLocation Cert:\LocalMachine\AuthRoot

    Now we have all the setting and permissions we need to setup a remote PowerShell session to our VM..

    $VMCred = new-object -typename System.Management.Automation.PSCredential -argumentlist $AdminUser, (ConvertTo-SecureString $adminPassword -AsPlainText -Force  )

    $VMSession = New-PSSession -ComputerName ($AzureServiceName + "") -Port $WFE1Port -Credential $VMCred -UseSSL -Name WFE1Session

    with this session we can now add in roles and features, turn on firewall rules and so on like this

    Invoke-Command -Session $WFE1Session -ScriptBlock {
        Get-WindowsFeature Web-server | Add-WindowsFeature –IncludeAllSubFeature}

    If we want to work on a SQL server VM (there’s a bunch of gallery items on Azure with different editions of SQL Server on) then it might be useful to enable SQL Server mixed mode authentication in which case we need to pass parameters into the session and the simplest way to do this is by using the param() setting inside the script block with and –ArgumentList switch at the end (remembering to keep the parameters in the same order..

    Invoke-Command -Session $VMSession -ScriptBlock { param($VMCred, $VMName)
        #set SQL Server to mixed mode and restart the service in the process
        Get-SqlInstance  -machinename $VMName -credential $VMCred -AutomaticallyAcceptUntrustedCertificates |  Set-SqlAuthenticationMode  -Mode Mixed -Credential $VMCred -ForceServiceRestart -SqlCredential $VMCred
    -ArgumentList $VMCred, $VMName

    as this allows us to reuse the parameters we are already working with in the remote session and enhances readability.

    So that’s a quick introduction to some of the stuff that the more enlightened IT Professionals like Andy are using to make their lives easier and actually a lot of the stuff in this post works in your own data centre (like the stuff at the end to setup SQL Server) so using Azure really is just an extension of what you are used to.

    Be warned -  stuff keeps changing on Azure.
    For example a lot of older examples use Affinity Groups in azure to co-locate VMs but this is on the way out so I deliberately didn’t reference that here.  My advice is to be wary of older posts and follow the Azure blog particularly if what you are trying to do is still in preview
  • How to train your maml – Publishing your work

    Before I get into my next post you may have noticed that ML has changed compared with a few weeks ago -the main page previews experiments differently, and the visualization of data has also changed..


    One other problem hit me and that was some of the parts of my demo experiment didn’t work or ran really slowly – for example the Sweep Parameter module seems to be really slow now.  I did ask about this and apparently it’s not been configured to run as a parallel process.  So bear in mind that at the moment MAML is still in preview so it’s going to change and stuff might break.  What’s important is that we feed back to the team so we all get the tool we need.

    What I want to show in this post is how we can publish the work we have done so far, and because of the time it takes to do a sweep parameters module at the moment I am going to suggest we swap that module out for the train module as my experiment takes 4 mins to run with train compared to 5 hours with sweep!  as well as doing this we can now remove the less accurate algorithm (two class logistic regression) and just leave this


    such that the train module is set to train against the column Arrdel15 (whether or not the flight is delayed) .

    Now we’re ready to publish our experiment and share it with the world but what does that mean exactly?

    MAML allows us to expose a trained experiment as  a web service so that a transaction can be sent to it for analysis and a score returned to the calling application / web site etc. In our case we could then send an individual flight and relevant weather data and then get back the likelihood that the flight will be delayed.  While this may not be a unique feature the ease with which it can be achieved does make the use of MAML even more compelling so lets see how it works..

    The first thing we need to do is to define the data  inputs and outputs, by right clicking on the right and input and selecting Select as Publish Input and then on the output port Select as Publishing Output. This declares what data needs to be past in and what data we’ll get out of the model when it’s published.  We just need to rerun the experiment again and now we’ll see that that option to publish a web service at the bottom of the screen is now enabled..


    If we select this option our experiment is published as a draft and we get this screen..


    If we click on the test option on  the right we can see how this works as we’ll be presented with the input data we need to fill out in order to get a flight delay prediction back..


    We can also see how to use the service either in batch or transaction by transaction by clicking on the appropriate API Help page link..


    which gives the link and the format of the query we need to pass to our published service as well as links to sample code in C#, Python and R.

    However our web service is not yet live in production, it’s just a draft.  To put it into production we need to go back to the dashboard for the model and select the Mark as Ready for Production link.  We can then add descriptions to the input fields  and set the Ready for Production flag to Yes..


    where I have properly described what the Carrier ID is and then clicked Save to confirm what I have done.

    Having done all of that we now need to go back to the dashboard view and we’ll now get the option to deploy to Production and if we do that we’ll be taken back to the ML Workspace dashboard in the Azure portal where we can create a Production Web Service..


    There is a reason for all of these steps and that is to separate out who does the publishing form the data science guys in a larger team. Once it’s live note that we can see if there are updates available.  For example the model might be refined and then marked for publishing again as it is continually refined by the team who create and train experiments. 

    Now that our web site is in production we get a dashboard to show how it’s being used that’s very similar to other services in Azure like virtual machines and SQL databases..


    So that’s a quick run through of an end to end experiment in Azure ML which I hope has aroused your curiosity, however you might be wondering how much this stuff costs and that’s another recent change to the service there is a free standard version that lets you work with data sets of up to 10Gb on a single node server and publish a staging web api just I have done here. Above that is a standard edition which does incur costs which are detailed here.  Essentially you pay for every hour you are running experiments and then when you publish you pay per prediction hour and per thousand transactions (currently these are discounted while the service is in preview and are 48p per prediction hour and 12p per thousand predictions). 

    Couple those prices with the ease of use of the service and I think that MAML will take off even if it existed in isolation.  However there are other data services on Azure that have been recently announced like Near Real Time analytics, Storm on HD Insight, Cloudera and Data Factory which I would assert means that Azure is the best place to derive insight from data which by definition is data science.    

  • Future Decoded – with Data Science

    If you have been following my posts on Microsoft’s approach to data science with things like Machine Learning you’ll realise that it’s capable of making predictions. However to be honest it’s not going to tell you what devices and career you’ll have in five years time.  To do that you need expert humans not expert systems and we have gathered some of the best in their field at Future Decoded at the Excel Centre on 12th November..


    Future Decoded is much more than a data science orientated event but there is a special data track within it in the afternoon which is preceded by a morning of inspirational talks in the morning, and of particular interest to any one in the data science space would be Dr Brian Cox, Sir Martin Shadbolt and Michael Taylor (IT director of Lotus F1)  as they have unique perspectives on how to apply and use data to advance science the internet and business.

    The data track in the afternoon is not just for data scientists, it’s for anyone in a data rich e.g. traditional roles like BI and reporting, in fact anyone who can see what they do in this diagram..


    This is a big area to cover so what we have done is collect a diverse group of experts with a different perspective on how to derive value from data, based on their experience of implementing some or all of these techniques:

    image Data Science inside Microsoft Research
    Kenji Takeda,  Microsoft Research 
    JRJ Changing the game  - an agnostic hybrid approach to big data integration
    James Rowland-Jones MVP
    Chris headshot cropped Transformation and Analysis with Power BI
    Chris Webb MVP
    image Azure Machine Learning without a PhD.
    Jen Stirrup MVP

    So hopefully you can spare the time and expense to come along to Future Decoded  and get your head around what this means for you and your career. 

  • How to train your MAML

    In this fourth post in my Azure Machine Learning series we are actually going to do the training itself.  I have tidied up my experiment from last time to get rid of the modules to export data to SQL Azure as that has now served it’s purpose ..


    Before we get stuck into training there’s still a bit more tidying up to do:

    • We can get rid of the year month and day columns as they were only needed to join our datasets and don’t have any relevance in predicting flight delay (actually they might do but I am keeping things simple here).
    • We can put the weather reading data into discrete ranges rather than leaving them as continuous variables.  For example temperature could be aggregated into groups say less than –10,  –10 to 0, 0-10 , 10 , 20-40 and  40 plus.  This is called quantization in the same way that the  quantum in quantum physics recognises that particles like electrons have discrete amounts of energy.  There’s a special quantization module to help us do this in ML studio.

    So lets do those tow steps in ML studio. First add in another Project Columns module under the join to remove the time based columns..


    Now add in the a Quantize module underneath that.  Quantization can work in several ways depending on the binning and quantile normalization options we select in this module.  We can design our own bins into which each value falls by setting the bin edges (like the temperature groups I just mentioned).  If we decide to let the Quantize model set the bin edges automatically then we can choose the algorithm it will use,  the quantile normalization method, and we can set how many bins we want.  For our initial training we’ll select 10 bins for all of the numerical data we are going to use to base our predictions on and we’ll overwrite the columns with the new values..


    If we now Visualize the left hand output of the quantize function we can see that there are now 10 unique values for each of our numerical columns..


    Now we can  actually get on with the training itself, although in the real world we may have to revisit some of the earlier decisions we made to improve the output.  The process of training is much like learning simple arithmetic we are introduced to a new symbol say + and given example of how it works.  To confirm we’ve got it we use it ourselves and compare our results against some more examples.  In machine learning we do the initial training by carving out a sample from our data set.  In MAML we can use the split module to do this which we used this in my last post to just to get us the data from one airport we are now going to use the split function to get a random sample. 

    To do create our training set we drag a Split module to the bottom of the design surface and connect its input to the left hand output of the Quantization module and set its properties as follows to give a 20% random sample we can use for training..


    The question now is which algorithm to use to make our flight delay prediction? This is a huge topic in its own right and also requires prior knowledge of statistics to make sense of what some of these do.  Also you can elect to ignore all the good work Microsoft have done in providing some really powerful algorithms used in Xbox and Bing and bring your own algorithm written in open source R language. In MAML there are three types of built in algorithms, Classification, Clustering & Regression so what exactly do those terms mean?

    • Classification is where we want to determine which group something belongs to – in our case we want to determine whether a flight is in the delayed group r the on time group.
    • Clustering is similar to that but this time we don’t know what the groups are.  For example we might take a collection of purchases from a shopping web site and MAML would work out what things were bought with what other things to provide recommendations when a customer places something in their basket. 
    • Regression is the process of fitting points to curve or line (hence linear regression) so here we are looking at predicting a continuous variable say house price based on attribute of a property such as age area, nearest station distance etc.    

    So if we expand out the  Machine Learning | Evaluate | Classification object on the left of ML studio we can see some curiously named modules like Multiclass Decision Jungle  which we can use..



    Multiclass Decision Forest

    Create a multiclass classification model using a decision forest

    Multiclass Decision Jungle

    Create a multiclass classification model using a decision jungle

    Multiclass Logistic Regression

    Create a multiclass logistic regression classification model

    Multiclass Neural Network

    Create a multiclass classifier using a neural network algorithm

    One-vs-All Multiclass

    Create an one-vs-all classification model

    Two-Class Averaged Perceptron

    Creates an Averaged Perceptron binary classification model

    Two-Class Bayes Point Machine

    Create a Bayes Point Machine binary classification model

    Two-Class Boosted Decision Tree

    Create a binary classifier using a boosted decision tree algorithm

    Two-Class Decision Forest

    Create a two-class classification model using a decision forest

    Two-Class Decision Jungle

    Create a two-class classification model using a decision jungle

    Two-Class Logistic Regression

    Create a two-class logistic regression model

    Two-Class Neural Network

    Create a binary classifier using a neural network algorithm

    Two-Class Support Vector Machine

    Creates a Support Vector Machine binary classification model

    At the time of writing the help on MSDN on what these do is pretty opaque to mere mortals and to add to the fun many of these have parameters that can be tweaked as well.  At this point it’s worth recapping what data we have and what we are trying to do - In our case there is probably some sort of relationship between some of our columns for example  visibility humidity wind speed and temperature and our two outcomes are that the flight is delayed or it isn’t.  So let’s jump in as we can’t really break anything and take Microsoft’s advice that Two-Class Boosted Decision Tree is really effective in predicting one of two out comes especially when the data is sorted of related.

    The other thing we can do while training our model is to work out which of columns are actually informing the decision making process.  Eliminating non significant columns both improves computation time (which we are paying for!) and also improves accuracy by eliminating spurious noise.  In our experiment it might be that the  the non-weather related attributes a factor such as  the carrier (airline), and the originating and departing airports.  Rather than guess or randomly eliminating columns with yet another Project columns module we can get MAML to do the work for us by using the Sweep Parameters module. Let’s see this in action –drag the Sweep parameters module onto the design surface. Note it has three inputs which we connect as follows:

    • the algorithm we want to use which is a Two-Class Boosted Decision Tree module
    • the trained data set this is the 20% split from the split module – so the left hand output.
    • the data to be used to evaluate the sweep is working which is simply the rest of the data form the split module.

    So how does the evaluation but work – simply by selecting the column to be used which has the answer we are looking for , whether or not  a flight is delayed which is the ArrDel15 column.  We can simply leave the rest of the sweep parameters alone for now and rerun the model.


    Each of the outputs from the sweep look very different from what we have in our data set, and although accuracy (at an average of 0.91) is in one of the columns it’s difficult for a data science newbie like me to work out how good the predictions are is. Fortunately we can get a better insight into what’s going on by using the Score and Evaluate modules.  To do this we connect the Score module the left output of the sweep parameters module, underneath this we can then drag on the Evaluate module and connect the output of the score module to the left hand input and rerun the experiment( neither of these have any settings )


    If we now look visualize the output of the Evaluate module we get this graph..


    What is this telling us? The key numbers are the false positive figures just under the graph where we can see that there were 302,343 correctly predicted on late flights, but 98,221 were identified as being late but were on time and that 1,466,0871 flights were correctly predicted as being on time but 61,483 were predicted to be on time but were late.  The ROC graph is also an indication of accuracy the greater the area between the blue line and a straight line is an indication of accuracy as well so the closer the blue line is to un upside down L shape the better it is.

    Could we improve on this score and how might we do that?  We could tweak the parameters for the Two-Class Boosted Decision Tree and what parameters we sweep for or we could see if there is a better algorithm.  For this I would use the technique I use to evaluate a good wine which is to try blind taste two wines and select my favourite hang on to that and then  compare that with the next wine and continue until I have my favourite of a given group.  In MAML we can do this by adding another branch tot eh second input of the second input evaluate module and compare how they do.

    By way of an example we can drag on the Two-Class Logistic Regression module,  copying and pasting the Score Model and Sweep Parameters modules  and connecting up these as shown below..


    and if we visualize the evaluate module we now get two ROC curves and by highlighting them we can see the detailed results  where the red curve is the right hand input so our Two-Class Logistic Regression module


    We can see that the Two-Class Boosted Decision Tree module is slightly outperforming Two-Class Logistic Regression and we should stick with that or compare it to something else.

    So not only is MAML really easy to setup but provides sophisticated tools to help you to evolve to the right algorithm to meet your business need and we still haven’t had to enter any code or get too deep into maths & stats.  What now – well we could output the data to something like Power BI to show the results but we might want also want to use the experiment we have made to predict flights on a departure by departure basis in some sort of web site and that’s what we’ll look at in my next post in this series.

  • How to train your MAML – Looking at the data in SQL Azure

    In my last post we saw how to clean, transform and join datasets. I also mentioned I had trouble doing the join at all and even now it’s not quite right so how can we look at the data and find out what’s going on.  The visualisation option only shows a few rows of the 2 million in the dataset and there’s not really anything in ML Studio to help.  However we can export the data in a number of ways, to Azure storage either as a table or blob or directly into SQL Azure.  The latter seems more useful as we need to investigate what’s going on when the data is joined, however a little bit of pre-work is required as ML studio won’t create the tables for us – we’ll have to do that ourselves.  We also don’t need all the data just a consistent set to work on so let’s start by understanding the split module in ML studio, which we’ll need later on to train our experiment.

    Search for the Split module in MLStudio and drag it onto the design surface, connect it the last Apply Math Operation in the flight delay data set process and set the split properties as shown..


    What the split does is to send the filtered data that meets our criteria to the result dataset1 (on the left) and the data that doesn’t to the result set on the right (result dataset 2). as well see later normally we would use this to randomly split out some of the data at random but in this case we are using a relative expression - \"DestAirportID" > 11432 & <11434 ( I did try \"DestAirportID" = 11433 but I got an error!) to just give us the flights arriving at one airport. If we run the model now we can visualise the data in the left output the split module we’ll just one value (1433).

    Note when we rerun experiment only the bits that have changed and what they affect are actually run – in this case only the split itself is being run the rest is being read form cache (in the  storage account)

    Now we need somewhere to put the data which in this case will be SQL Azure.  As with MAML everything can be done from the browser.

    Go to the Azure management portal and select the SQL Databases option and at the bottom of the screen click on the plus to create a new database with quick create. Mines in northern Europe, it’s basic with no replication and  is 1Gb (you won’t need that much)..


    You will also be prompted for an administrator ID and password. Once the database is created we now need to ensure it remotely and so we need to open up the firewall much as we would do if the database was on any remote server in our data centre.  To this in SQL Azure  click on the configure option ..


    You will see your current ip address here which you can then use to make a rule (I have hidden mine in the screen shot above) . Now we can go back to the dashboard and click on the hyperlink to design or run queries against our new database from the SQL management portal (you’ll be asked to login first).  Now we can add in a table for our flight delay data as ML studio won’t do that for us.  We need it to have the right data types and rather than you doing it manually here is the query you can run to create it.

    USE ML

    CREATE TABLE [dbo].[FlightDelay] (
        [ID]              INT          IDENTITY (1, 1) NOT NULL,
        [Month]           INT          NOT NULL,
        [DayofMonth]      INT          NOT NULL,
        [Carrier]         NVARCHAR (5) NOT NULL,
        [OriginAirportID] INT          NOT NULL,
        [DestAirportID]   INT          NULL,
        [DepDelay]        INT          NULL,
        [CRSArrTime]      INT          NULL,
        [ArrDel15]        INT          NULL,

        ON [dbo].[FlightDelay]([ID] ASC);

    Notice that there is a separate ID column with the identity type so that we have a primary key for each row.

    Now we can see how to export data to SQL Azure from ML studio.  Drag the Data Writer model onto the design surface and connect it to the left hand output of the split module..


    Set the module properties as follows;

    • Data destination:  SQL Azure
    • Database server name: your database server name
    • Database name ML (in my case)
    • Server account name and password to match the admin account and password you setup for your database.
    • Check accept any server certificate
    • Coma separated list of columns to be saved:   Month,DayofMonth,Carrier,OriginAirportID,DestAirportID,CRSArrTime,DepDelay,ArrDel15
    • Data table name Flight Delay
    • Comma separated list of datatable columns: Month,DayofMonth,Carrier,OriginAirportID,DestAirportID,CRSArrTime,DepDelay,ArrDel15
    • Number of rows to be written per SQL Azure operation: 50 (the default)

    Note the columns names above are case sensitive and the number of columns input and output must be the same.  Also be aware if you run the experiment again you’ll add more rows to he SQL tables each time so remember to empty the table before a run –

    truncate table FlightDelay; truncate table AirportWeather

    If we run the experiment now we will populate the FlightDelay table in SQL azure and each rerun will truncate the table and repopulate this (I couldn’t see how to override this).  Once that’s working OK we can then repeat the exercise for the weather data:

    • In the SQL Azure management portal create a new table AirportWeather

     USE [ML]
    CREATE TABLE [dbo].[AirportWeather] (
        [ID]               INT IDENTITY (1, 1) NOT NULL,
        [AirportID]        INT NULL,
        [Month]            INT NULL,
        [Day]              INT NULL,
        [Time]             INT NULL,
        [TimeZone]         INT NULL,
        [Visibility]       INT NULL,
        [DryBulbCelsius]   INT NULL,
        [WetBulbCelsius]   INT NULL,
        [DewPointCelsius]  INT NULL,
        [RelativeHumidity] INT NULL,
        [WindSpeed]        INT NULL

    CREATE CLUSTERED INDEX [IX_AirportWeather_0]
        ON [dbo].[AirportWeather]([ID] ASC);

    • Copy and past the existing Split module and connect it to the last Output Math Operation for weather data process and change the relative expression to  \"AirportID" > 11432 & <11434
    • Copy and paste the existing write module and connect it to the new Split module for the weather data. Change the two setting for the columns to be used to AirportID,Month,Day,Time,TimeZone,Visibility,DryBulbCelsius,WetBulbCelsius,DewPointCelsius,RelativeHumidity,WindSpeed


    Now we’ll leave ML Studio and look at what we have in the SQL Azure portal. Click on new query and paste following in..

    Select F.Month, F.DayofMonth, F.CRSArrTime, F.DestAirportID, F.CRSArrTime, F.DepDelay, F.OriginAirportID, A.DewPointCelsius, A.DryBulbCelsius, A.RelativeHumidity, A.Visibility, A.WetBulbCelsius, A.WindSpeed
    from FlightDelay F
    inner join AirportWeather A
        on F.DestAirportID = A.AirportID  and F.Month = A.Month  and F.DayofMonth = A.Day
        and F.CRSArrTime = A.Time
    order by
    F.Month, F.DayofMonth, F.CRSArrTime

    and notice that we get back 61,231 rows compared to the row count of 62,211 for the FlightDelay table which means we are losing 980 rows of data.  887 rows of this are down to the fact that there are no rows in the weather data for 0  (midnight)  but there are in the flight data (midnight in the weather data shows as 24). So something is wrong with the mechanism for matching the flight arrival time with the nearest timed weather reading.  This was not a deliberate error on my part I just based this all on the example in ML studio but it does show two things:

    • Always check your findings – in this case the wrong weather data was being used which will affect the accuracy of our predictions
    • Don’t blindly trust someone elses work!

    Anyway in this case it’s easy to fix. all we need to do is to subtract one hour from the weather reading data by adding in another Apply Maths Operation Module as show below..


    if we run our query in SQL Azure again we will get back 62159 rows which means we are missing a few rows of weather data and so certain flights will are being dropped when we do the join.  If we were to use this in BI then we would need to fix this but what we need for machine learning is clean data and now we have made this fix we have a good set of data on which to make predictions and that's what we will start to look at in my next post.

  • How to Train your MAML–Refining the data

    In my last post we looked at how to load data into Microsoft Azure Machine Learning using the browser based ML Studio.  We also started to look at the data around predicting delayed flights and identified some problems with it and this post is all about getting the data into the right shape to ensure that the predictive algorithms in MAML have the best chance of giving us the right answer.   

    Our approach is three fold

    • To discard the data we don’t need, either columns that aren’t relevant or are derived from other data and to discard rows where there is missing data for the columns (features in machine learning speak)
    • To tag the features correctly as being numbers or strings and whether they are categorical or not.  Categorical in this context means that the value puts them in a group rather than being continuous so AirportID is categorical as it puts a row into a group of rows for the same airport where temperature is a continuous variable and the numbers do represent point on a line (where AirportID 1 is nothing to do with ID 3 or 4).
    • To join the flight delay dataset to the weather data set on the Airport and the date/time. In my last post I mentioned that we could either join the weather data in twice once to the departure airport and once to the arriving airport and indeed the sample experiment on flight delay prediction does exactly this but I think a simpler approach is to just model the arrival delay on the fact that some flight have a delayed departure time which may or  may not be influenced by the weather at the departure airport.

    Let’s get started..

    Open ML Studio, create a new experiment , give it a suitable name and drag the flight delays and the Weather datasets onto the design surface so it looks like this ..


    Clean the data

    As before we can right click on the circle at the bottom of the data set and select visualize data to see what we are working with- for example here’s the weather data.


    What’s is odd here is that the data is not properly typed in that some of the numeric data is in a column marked string such as  the weather data set temperature columns.  I spent ages trying to work out how to fix this and the answer turns out to be to use the Convert to Data set module which automatically does this.  So our first step is to drag tow of them onto the design service and connect them to each of our data sets..


    If we run our model (run is at the bottom of the screen) we can then visualize the output of the convert to dataset steps and now our data is correctly identified as being numeric etc.

    The next step is to get rid of any unwanted columns and this is simply a case of using the project columns module (to find it just use the search at the top of the modules list).  You can either start with a full list of columns and remove what you don’t need or start with an empty list and add in what you do need. So lets drag it onto to the design surface and then drag a line from the Flight Delays Data to it.  It’ll have a red X against it as it’s not configured and we can do this from the select columns on the task pane


    Here I have selected all columns and then excluded Year , Cancelled, ArrDelay, DepDelay15, and CRSDeptime.  At this point we can check to see that what we get is what we wanted by clicking the run button at the bottom of the screen.  

    Note It’s only when we run stuff in ML Studio that we are being charged for computation time using this service, the rest of the time we are just charged for the storage we are using (for our own data sets and experiments)

    As before at each stage we can visualize the data that’s produced by right clicking on its output node..


    Here we can see that we have one column Depdelay that has missing values so the next thing we need to do is to get rid of that and we can use the Missing Values Scrubber module for this so search for that and drag it on to the design service and drag a connector from the output of the project columns module to it.  We then need to set its properties to set how to deal with the missing values.  As we have such a lot of clean data we can simply ignore any rows with missing values by setting the top option to remove entire row..


    We can now run the experiment again to check we have no more missing values.

    Now we need to do some of this again for the weather dataset. We can then add in another project column module to select the columns we need – this time I am starting with an empty list and specifying which columns to add..


    and the data scrubber module again set to remove the entire row…


    Tag the Features

    Now we need to change the metadata about some of the columns to ensure ML studio handles them properly. Here I cheated which shows you another feature of ML studio.   Remember that some of the number in our data are codes rather than being a continuous number for example the airport codes and the airline code. We need to tell MLStudio that these are categorical  by using the Metadata Editor module. To this we are going to cheat and by simply copying that module form another experiment.  Open another browser window and go into the ML Studio home page and navigate to the Flight Delay sample prediction.  Find the Metadata Editor module on their and paste it to the clipboard and then go back into the browser with our experiment and paste it in, and you should see that this module is set to make Carrier, OriginalAirportID and DepAirPortID categorical...


    Join the datasets

    Now we have to sets of clean data we need to join them.  They both have an airport ID, month and day and the flight delay data set has an arrival time to the nearest minute.  However  the weather data is taken at 56 minutes part the hour every hour and is in local time with a separate time zone column. So what we need to do is round up the flight arrival time to the nearest hour and do the same for the weather data as follows:

    For the flight delay arrival time

    1. Divide the arrival time by 100

    2. Round down the arrival time to the nearest hour

    For the weather data time

    3. Divide the weather reading time by 100 to give the local time in hours

    4. round up to the nearest hour

    So how do we do that in ML studio? The answer is one step at a time making repeated use of the Apply Math Operation  module.  Help is pretty non existent for most of these modules at the time of writing so experimentation is the name of the game, and I hope I have done that for you. We’ll place 4 copies of the Maths Operation module on the design surface one for each step above (so two linked to the weather dataset and two to the flight delay set) ..


    Notice the comments I have added to each module (right click and select add comment) and here’s the setting for each step..

    Step 1


    note the output mode of inplace which means that the value is overwritten and  we get all the other columns in the output as well, so make sure this is set for each of the four steps.

    Step 2


    Step 3


    Step 4


    Now we can use the Join module (again just search for Join and drag it onto the design surface) to connect our data sets together.  Not surprisingly this module has two inputs and one output and we’ll see several modules with multiple inputs and outputs in future.  Connect the last module in each of our data set chains into the join and set the properties for the join as shown..


    so on the left (flight data ) we have Month,DayofMonth,CRSArrTime,DestAirportID and on the right (the weather data) we have Month,Day,Time,AirportID.

    I have to be honest it took a while to get here and initially I got zero rows back.  Even now it’s not quite perfect as I have got slightly more rows than I started with which I have tracked down to having the odd hour in the weather data that has two readings.  Finding that kind of data problem is beyond what you can do in ML studio in the preview so in my next post I’ll show you your options for examining this data outside of ML studio.

  • How to train your MAML – Importing data

    In my last post I split the process of using Microsoft Azure Machine Learning (MAML) down to four steps:

    • Import the data
    • Refine the data
    • Build a model
    • Put the model into production.

    Now I want to go deeper into each of these steps so that you can start to explore and evaluate how this might be useful for your organisation.  To get started you’ll need an Azure subscription; you can use the free trial, your organisation’s Azure subscription or the one that you have with MSDN.  You then need to sign up for MAML as it’s still in preview (note using MAML does incur charges but if you have MSDN or a trial these are capped so you don’t run up a bill without knowing it)..

    You’ll now have an extra icon for MAML in your Azure Management Portal and the from here you’ll need to create an ML Workspace to store your Experiments (models).  Here I have one already called HAL but I can create another if I need to by clicking on New at the bottom of the page, and selecting Machine Learning and clicking on it and then following the Quick Create wizard..


    Notice that as well as declaring the name and owner I also have to specify a storage account where my experiments will reside and at the moment this service is only available in Microsoft’s South Central US data centre.  Now I have somewhere to work I can launch ML Studio from the link on the Dashboard..


    This is simply another browser based app which works just fine in modern browsers like Internet Explorer and Chrome..


    There’s a lot of help on the home page from tutorials and sample to a complete list of functions and tasks in the tool.  However this seems to be aimed at experienced data scientists who are already familiar with the concepts of machine learning.   I am not one of those but I think this stuff is really interesting so if this all new to you too then I hope my journey through this will be useful but  I won’t be offended if you break off now and check these resources because you went to University and not to Art College like me!

    In my example we are going to look at predicting flight delays in the US based on one of the included data sets.  There is an example experiment for this but there isn’t an underlying explanation on how to build up a model like this so I am going to try and do that for you. The New option on the bottom of this ML studio screen allows you to create a new experiment and if you click on this you are presented with the actual ML studio design environment..


    ML studio works much like Visio or SQL Server Integration Services, you just drag and drop the boxes you want on the design surface and connect them up but what do we need to get started?

    MAML needs data and there are two places we can import this -  either by performing a data read operation from some source or creating a data set or. At this point you’ll realise there’s lots of options in ML Studio and so the search option is a quick way of getting to the right thing if you know it’s there.  If we type reader into the search box we can drag that onto the design surface to see what it does..


    The Reader module comes up with a red x as it’s not configured, and to do that there a list of properties on the right hand side of the screen.  For example if the data we want to use is in Azure blob storage then we can enter the path and credentials to load that in.  There are also options for http feed , SQL Azure,  Azure Table Storage as well as HiveQuery (to access Hadoop and HDInsight)  and PowerQuery. PowerQuery is a bit misleading as it’s actually a way of getting OData and one example of that is PowerQuery.  Having looked at this we’ll delete it and work with one of the sample data sets.

    Expand the data sources option on the left you’ll see a long list of samples from IMDB film titles to flight delays and astronomy data. If I drag the Flight Delays Data dataset onto the design surface  I can then examine it by right clicking on the output node at the bottom of it, right click and select Visualize..


    this is essential as we need to know what we are dealing with and ML Studio gives us some basic stats on what we have..


    MAML is fussy about it’s diet and heres’ a few basic rules:

    • we just need the data we that are relevant to making a prediction.  For example all the rows have the same values for year (2013) so we can exclude that.
    • There shouldn't be any missing values in the data we are going to use to make a prediction and 27444 rows of this 2719418 row data set have missing departure values so we will want to exclude those.
    • No feature should be dependant on another feature much as in good normalisation techniques for data base design.  DepDelay and DepDel15 are related in that if DepDelay is greater then 15 minutes then DepDelay = 1.  The question is which one is the best at predicting the Arrival Delay, specifically ArrDel15 which is whether or not the flight is more than 15 minutes late.
    • Each column (feature in data science speak) should be of the same order of magnitude.

    However eve after cleaning this up there is also some key missing data to answer our question “why are flights delayed?” It might be problems associated with the time of day or the week , the carrier our difficulties at the departing or arriving airport, but what about the weather which isn’t in our data set?  Fortunately there is another data set we can use for this – the appropriately named Weather dataset.  If we examine this in the same way we can see that it is for the same time period and has a feature for airport so it's be easy to join to our flight delay dataset. The bad news is that most of the data we want to work with is of type string (like the temperatures) and there redundancy ion it as well so we’ll have some clearing up to do before we can use it. 

    Thinking about my flying experiences it occurred to me that we might need to work the weather dataset in twice to get the conditions at both the departing and the arriving airport. Then I realised that any delays at the departing airport might be dependant on the weather and we already have data for the departure delay (DepDelay) so all we would need to do is to join it which we’ll look at in the next post in this series where we prepare the data. based on what we know about it.

    Now we know more about our data we can start to clean it and refine it and I’ll get stuck into that in my next post but just one thing before I go – we can’t save our experiment yet as we haven’t got any modules on there to do any processing so don’t panic we’ll get to that next.

  • Learning Machine Learning using a Jet Engine

    In this post I want to try and explain what machine learning is and put into context with what we used to do when analysing and manipulating data.  When I started doing all this the hot phrase was Business Intelligence (BI),  but this was preceded by EIS (Executive Information Systems) and DSS (Decision Support Systems).  All of these were to a larger extent looking backwards like driving a car using just the rear view mirror.  There was some work being done on trying to do look ahead (predictive analysis) and this was typically achieved by applying data mining techniques like regression (fitting points to a line or curve).

    Nowadays the hot topic is Machine Learning (ML) so is that just a new marketing phrase or is this something that’s actually new?  Data mining and ML both have a lot in common; complex algorithms that have to be trained to build a model that can then be used against some set of data to derive something that wasn’t there before. This might be a numeric (continuous value) or a discreet value like the name of a group or a simple yes/no.  What I think makes ML different is that it’s there for a specific purpose where data mining is more like a fishing exercise to discover hidden relationships where one such use is predictive analytics.  So data mining could be considered to be a sub set of ML and one use of ML is to make predictions.

    If I look at how Microsoft has implemented ML in Azure (which I will refer to as MAML), then there is a lot of processes around data acquisition before training and publishing occur.  In this regard we might relate this to a modern commercial jet engine.


    • Suck the data in to the ML workbench..


    This can be done either from a raw source or via HD Insight (Hadoop on Azure) which means MAML can work with big data.  Note that in the jet  engine diagram much of the air bypasses the engine if we are using big data in ML then we may ignore large chunks of that as not being relevant to what we are doing.  A good example is Twitter – most tweets aren’t relevant because they don’t mention my organisation.

    • Squeeze the data.


    If we haven’t sourced the data from something like HD Insight we need to clean it up to ensure  a high quality output and we understand it’s structure -  type, cardinality etc.

    • Bang.


    In a jet engine we introduce fuel and get a controlled explosion in ML we apply our analysis to get results and provide momentum for organisation.  Specifically this is where we build and refine our analytical model by taking an algorithm such as one used in data mining and training it against a sample set of data. MAML has specific features to support this a special split task for carving out data for model training and an evaluation task to tell you how well your model is performing by visualising the output against an ideal (upside down L curve).

    • Blow. 


    Having established that your model works you’ll want to unleash it on the world and share it which is analogous to the thrust or work produce by a jet engine.  In MAML we can expose the model as a web service or a data set which can be used in any number of ways to drive a decision making process. 

    Jet engines don’t exist in isolation typically these are attached to aircraft and it is the aircraft that controls where the engine goes.  In ML we are going to have a project to set our direction and ML might only be a small part of this in the way that the product recommendation engine on a web site like Amazon is only a small part of the customer experience.  So as with all the varying names for projects that use data to drive businesses forward we need to be aware that the analytics bit is a small part of the whole; we need to be aware of data quality, customer culture and all of the baggage that are essential to a successful implementation.  There is also one extra complication that comes from the data mining world and that is that it’s not always possible to see how the result was derived at so we have to build trust in the results particularly if they contradict our intuition and experience. 

    So that the good news for us experienced BI professionals is that we can apply many of the lessons we have learnt and apply them to ML and with MAML we don’t need to know too much about the data mining algorithms unless we want to.

  • Adventure Works!

    My title for this post is a pun on the Adventure Works databases, and samples that have been in SQL Server since I can remember.  There were also some data mining examples ( as referenced in  this old post) but this has not really moved on since 2011 when I last wrote about it so you might be forgiven for thinking that data mining is dead as far as Microsoft is concerned.

    However since that time two big things have happened;  the hyper-scale of cloud and the rise of social media as a business tool not just as a bit of fun to share strange pictures and meaningless chat.  Coupled together this is big data;   masses of largely useless data, being produced at a rate faster than can be downloaded in a variety of semi and unstructured formats – so Volume , Velocity and Variety.  Hidden in this data are nuggets of gold such as brand sentiment , how users navigate our web sites and what they are looking at, and patterns that we can’t immediately recognise. Up until now processing and analysing big data has really only been possible for large corporates and governments as they have the resources and expertise to do this. However as well as storing big data the cloud can also be used to make this big data analysis available to anyone who has the sense of adventure to give it a try -  all that’s needed is access to the data and an understanding of how to mine the information.  However the understanding bit of the equation is still a problem and this expertise aka data science is the bottleneck and a quick search on your favourite jobs board for jobs in this area will confirm this. 

    So what is Microsoft doing about this?

    What they have always done – simplify it , commoditise it, and integrate it.  If I go back to SQL Server 2000 we had DTS to load and transform data from any source and analysis services to slice and dice it from Excel and then we got reporting services in 2002 all in one product.  In 2014 we have a complete set of tools to mash, hack and slice data into submission from any source, but these tools are no longer in SQL Server they are in the cloud specifically Azure and in Office 365.   So what are the tools?

    • HDInsight which is Hadoop running as a service in Azure  where you can build a  cluster as large as you need and feed it data with all the industry standard tools you are used to (Mahout and Pig for example).
    • Microsoft Azure Machine Learning (MAML) can take data from anywhere including HDInsight and do the kind of predictive analytics that data mining promised but without the need to be a data scientist yourself.  This is because the MAML studio has a raft of the best algorithms that are publicly available and is also very easy to use from IE or Chrome – actually it reminds a bit of SQL Server Integration Services which is no bad thing.    


    Once you have trained your experiment (as they called) you can expose this as a web service which can then be consumed on a transaction by transaction basis to score credit, advise on purchase decisions etc. within your own web sites. 

    • Office 365 provides the presentation layer on the tools above with access to HDInsight data and machine learning datasets from the Power BI suite of tools.

    In order to play with any of these there’s two other tools you’ll need - An MSDN subscription to get free hours on Azure to try this out and to get a copy of Office 2013 for the Power BI stuff. You’ll also want to watch the Microsoft Virtual Academy for advice and guidance although at the time of writing there aren’t any courses on MAML as it’s so new.

    Finally a word of warning before you start on your own adventure  - these tools can all encode a certain amount of business logic and so it’s important to understand the end to end changes you have made in building your models from source to target and to consider where and when to use which tool.  For example Power Pivot can itself do quite a of data analysis but is best used in a big data world as a front end for HDInsight or machine learning experiment. 

    I will be going deeper into this in subsequent posts as this stuff is not only jolly interesting it’s also a huge career opportunity for anyone who loves mucking around with data.

  • VDI–Shameless plug


    I have spent the last six months writing my first book, Getting Started with Windows VDI  for Packt Publishing, basically because I want to create a one stop repository of the key elements you need to do this. The basic idea I had was to only use the Microsoft stack to do this and to explore the wider aspect of VDI in the process:

    • Building the VDI template using the Microsoft Deployment Toolkit (MDT) where this is normally used to create images that are deployed to physical desktops.
    • Use of the Microsoft Desktop Optimization Pack (MDOP) to virtualize applications (App-V) and user settings (UE-V) so that users can get the resources they need whether they are using VDI or not.
    • How to enable High availability for VDI for each of the component parts.  For example more than one the broker can be used by setting up a shared SQL Server database to store metadata about the deployment such as who is logged into which desktop and what Remote App Applications can they use.
    • Using WSUS to update the deployment by integrating it with MDT to refresh the virtual desktops and also what needs to be done to patch the role services and hosts.

    I have tried to be as honest as possible as I would rather maintain my integrity and stop you wasting time trying to get stuff to work which is not optimal.  A good example is the big debate of Terminal Services (which I refer to as Session Virtualization) vs a collection of Virtual Machines each providing a virtual desktop which is what most people mean by VDI.  Session Virtualization is still much more efficient than using individual VMs to provide virtual desktops  by factor of 10 or 20  so you won’t find me defending VDI. Rather I have tried to explain when to use what and that using a blend of both with a common infrastructure is the best approach as this is very easy to do in Remote Desktop Services in Windows Server 2012R2. 

    I also wanted to make it a practical guide but you’ll need a fairly beefy box (i7 16gb RAM) and ideally an SSD to run all the VMs if you want to try it.  I have also included as much PowerShell as possible in there, as you’ll want to automate as much of this as possible. I have also included a basic guide to licensing VDI as this is one of the main barriers to adoption.    

    So I hope you get as much out of it as I did in writing it and just one thing to note  - If you buy the paper edition the code samples you need are here as some of my PowerShell scripts are too long to be inline in the book and  I am old enough to remember how tedious it was copying lines of code out of a magazine to get a game to run!

  • Lab Ops – PowerShell 4 and PKI

    In a Remote Desktop Services lab setup involving a gateway to allow connections to our virtual desktops over the internet  we are going to need to use an SSL certificate to secure traffic between the remote devices and the gateway.  Typically in the past we would have gone into a server running IIS and made a self signed certificate in IIS Manager and while this still works in IIS8.5 (the one that ships with Windows Server 2012R2) there is better way and that is PowerShell 4 that is also included in Server 2012R2.   Why is it better? More control specifically we can add alternate names to correspond to all the DNS entries we might need in something like RDS.  We also get all the other benefits of using PowerShell like repeatability and like all ne PowerShell commands it’s easy to understand what it’s doing..

    To make a certificate:

    New-SelfSignedCertificate `
            -CertStoreLocation cert:\LocalMachine\My `

    where the new certificate is going to be stored on the local machine in the “my” folder and it has two alternate names. By the way not everyone knows this but the ` character is the line continuation character and means I can format code that’s easier for you to read. 

    The new certificate is not much use where it is so we can also export it and if we want the private key we’ll need to password protect it. First we need to find it and to do that we can simply browse the certificates just as we might for any other kind file on a hard drive except that for certificates we are navigating through the certs: drive not the C drive.  Get-ChildItem is the proper PowerShell command for navigating folders of any kind but it also has aliases dir ls, gci etc.  I mention this because in some IT departments it is considered best practice not to use aliases but the root PowerShell commands to provide clarity, perhaps in the same way as we don’t use the word ain’t in an English class!

    $Cert = Get-ChildItem -Path cert:\localmachine\My | `
            where subject -EQ ",,"

    Now we can setup a password and export it

    $pwd = ConvertTo-SecureString -String "Passw0rd!" -Force –AsPlainText
    Export-PfxCertificate `
            -cert $cert `
            -FilePath '\\rds-DC\c$\RDS-Web self signed certificate.pfx'   `
            -Password $pwd

    We could then import this into our VDI deployment with PowerShell as well, but there is a simple UI in the deployment properties in Server Manager.


    where we can check our certificates are installed and recognised as part of the deployment.  Of course there is proper PowerShell for this..

    set-rdcertificate `
        -Role RDGateway ` 
        -ConnectionBroker `
        -ImportPath 'C:\RDS-Web self signed certificate.pfx'`
        -Password $pwd  #same as password above

    which not only puts the certificate where it need to go but put makes the RD  Broker aware of it.

    Creating certificates is also going to be useful if we want to manage any evals , demos and labs we might want to work with in Azure but that is a story for another post , not least because I need spend more time in the cloud myself!

    In the meantime do have a look at the PowerShell content in MVA, and if you aren’t running Server 2012R2 where you are you can install it on older operating systems back to Windows Server 2008 R2/ Windows 7 from here

  • Microsoft VDA and Windows Embedded – A game changer for VDI?

    If you are following me on Twitter then you may have noticed that a few of my friends are reviewing a book I am writing on VDI for PackT.  I wanted to write this up as a one shop stop on all you needed to know about implementing Microsoft based VDI using the various Remote Desktop Services server roles in Windows Server 2012R2 with Windows 8.1 as the Virtual desktop OS. You would be forgiven for thinking that the hardest part of this book would be the last chapter Licensing, which I wanted to include precisely because it is widely touted as being hard and because you do need to know this stuff to implement it. 

    In the course of my research and my day job I have been doing some work with 10Zig and in fact it was as a result of some events we did together that I thought that Microsoft VDI needed to be written up properly and in one place. Prior to this I thought that thin clients were pretty much of a muchness and I certainly didn’t appreciate that they came in different types or how they have come on since I joined Microsoft. So I have asked Kevin Greenway from 10Zig to explain the world of the thin client and how your choice might affect the licensing..

    The ever growing debate over which flavour of Thin Client is the right choice in relation to cost, performance, user experience, peripheral support, management, etc, etc, continues to fill blogs and review sites. Recent news from Microsoft on changes to Virtual Desktop Access (VDA) licensing to Windows Embedded throws a really interesting perspective on using a Windows Embedded based Thin Client for Virtual Desktop Infrastructure (VDI).

    VDA, Thin Clients and PC’s

    As of today if you wanted to use a Thin Client, which includes ‘Zero’, ‘Linux’ or ‘Windows Embedded’ based Thin Clients in conjunction with Virtual Desktop Infrastructure (VDI), you are also required to use Virtual Desktop Access (VDA) licensing. VDA licensing is charged at $100 per year for each device using this license. Couple that with the initial acquisition cost of any of the above types of Thin Clients which all require VDA licensing, the costs suddenly start to rack up and can exceed that of a traditional PC.

    The VDA licensing does not apply if you use Microsoft Software Assurance (SA), which also includes some desktop virtualisation rights. Most commonly if a Windows PC is used as your endpoint for connecting to VDI and you have SA, this negates the need for VDA licensing. The downside to this is that by using a Windows PC as an endpoint, this doesn’t leverage any of the true benefits of VDI, most specifically the system administrator is now administering two desktops instead of one. There are options to alleviate this somewhat with products, such as Windows Thin PC*, but what if you want to realise the benefits of Thin Clients? These include Manageability, Ease of Deployment, Small Footprint, Power Savings, and Lifespan due to no moving parts.

    *This only applies to repurposing windows OS, not OS agnostic repurposing solutions.

    This is where the recent news from Microsoft comes into play.

    VDA, Software Assurance and Windows Embedded Thin Clients

    If you use Microsoft Software Assurance and you have a Windows Embedded Thin Client (Windows Embedded 7/Windows Embedded 8), then you can now move this into your SA, thus negating the need for VDA licensing. This means the total acquisition cost is potentially lower than equivalent PC’s and even Linux based Thin Clients or Zero Clients due to the Annual Charge of VDA.

    Windows Embedded, Linux or Zero

    As mentioned there is always the ongoing debate over which flavour of Thin Client is right. One of the biggest arguments against Windows Embedded based Thin Clients has traditionally been the Cost, User Experience (We’ll cover this one in a moment) and Manageability. However the biggest argument for use of Windows Embedded Thin Clients, is that typically the majority of VDI vendors including Microsoft, Citrix, and VMware will focus their efforts primarily around the compatibility, support and feature set available within the Windows version of their clients for connecting to VDI.

    You typically see a greater feature set and an earlier adoption of new features within the relevant Windows clients, than you will against say Linux, Android or IOS equivalents. Additionally the ever growing list of peripherals required to either work locally within the Thin Client or be redirected into the VDI session, will typically be more pain free from a Windows based Thin Client, than they will from an equivalent Linux or Zero Client.

    More recently we’ve also started to see more and more emergence of client based plugins to redirect Real Time Transport Protocols (RTP/RTVP), such as the Microsoft Lync 2013 VDI plugin, which is only available as a Windows Plugin. We also have the various Media offloading capabilities such as those within the RemoteFX, Citrix HDX and VMware PCoIP stacks, which often still result in a greater tick list of features within these stacks from a Windows Client than their Linux or Zero counterparts.

    Let’s shift back to the topic of User Experience. Whilst adopting VDI one of the easiest ways to get user buy-in is ensuring that their use experience, including access and entry remains the same as it has been with a PC. You want to ensure ultimately that the user only has access to their VDI desktop, not the local operating system. This way there is no retraining involved, or risk that the user can access/break the local operating system. The end goal is that you want their access device to serve simply as a launch pad to their VDI desktop, whilst also leveraging the performance that they achieved from their local PC.

    This is where Windows Embedded 8 really comes into great effect. Windows Embedded 8 includes some really great features including HORM (Hibernate Once, Resume Many) for shrinking boot speeds, as well as Unified Write Filter (UWF) for protecting Disk Volumes/Folders/Registry. Embedded Lockdown Manager is also available for Windows Embedded 8 and includes several lockdown features. Examples of features include Keyboard Filter, Gesture Filter, USB Filter and most prolifically the Shell Launcher. The Shell Launcher can be configured to provide the user with a launch pad to their desired VDI session and restricts access to the underlying local operating system.

    There are several management tools available for managing Windows Embedded 8, including Microsoft based and 3rd Party tools such as those provided by 10ZiG Technology within Windows Embedded 8 based Thin Clients (5818V/6818V) and the Enterprise Free 10ZiG Manager.

    [added 12/05/14 by Andrew] the official Microsoft guidance on licensing Windows 8  embedded can be found here.


    To conclude this exciting news from Microsoft regarding the addition of Windows Embedded into Software Assurance (SA) and negating the need for Virtual Desktop Access License (VDA), thus dramatically reducing the overall acquisition cost per Thin Client device against a PC. Additionally the adoption of new and exciting features into Windows Embedded 8 heralds a potential shift in the VDI/Thin Client space towards Windows Embedded 8.

    Couple it all together with HORM, UWF, ELM and Client offload techniques powered by modern multi core CPU Thin Clients such as those from 10ZiG Technology (5818V/6818V). We are now equipped with what is fast becoming a Hybrid of traditional Thin Client and PC Technology into one.

    It’s an interesting space to watch in the ever evolving world of VDI and Thin Client Technology.

    For more on VDI there’s some great content on the Microsoft Virtual Academy.

  • Lab Ops Redux setting up Hyper-V in a hurry

    I am at Manchester United as I write this and one of the delegates wanted to quickly try out Hyper-V in Windows Server 2012R2.  I thought I had an all up post on that but it turns out I don’t, so Nicholas Agbaji this is just for you!

    You’ll need a latptop/desktop running Windows 7/Windows 2008R2 or later to work on that’s capable of running Hyper-V, and that the BIOS is setup for virtualization

    • You’ll need to download a copy of Windows Server 2012R2 and a special PowerShell script Convert-WindowsImage.ps1 from the TechNet Gallery.
    • Run the PowerShell script as follows. 

    .\Convert-WindowsImage.ps1 –SourcePath <Path to your iso> -Size 50GB -VHDFormat VHD –VHD “C:\WS2012R2.VHD” -Edition "ServerStandardEval"

    Note: If you are running on Windows 8 or Windows Server you can use the newer VHDX format for virtual hard disks

    • We now have a VHD with a sysprepped clean copy of Windows Server 2012R2 and Windows 7/2008R2 & later allows us to boot from a VHD just like the one we just made.
    • To boot from VHD we need to mount the VHD. In Windows 8/2012 you can simply click on a VHD to mount it, however in Windows 7/2008R2 then we’ll need to open an elevated command prompt and do this manually:
      • diskpart
      • select vdisk file =”<path to your VHD>”
      • attach vdisk
    • We’ll now get an additional drive say drive H: and now we’ll need to edit the boot database from an elevated command prompt and add and edit a new entry to register the VHD:
      • bcdboot g:
    • We also need to edit the BCD to get Hyper-V to be enable in our VHD with
      • bcdedit /set “{default}” hypervisorlaunchtype auto
    • Optionally you could describe your new boot entry with
      • bcdedit /set “{default}” description “Windows Server 2012R2 Lab Ops”
    • Reboot you server/laptop and you’ll have an extra boot option to boot to windows server.
    • The final step is to add in  the Hyper-V role either from Server Manager or with Powershell..
      • add-WindowsFeature hyper-v –IncludeManagementTools 

    Once you have this VHD setup you can boot into your OS and backup the VHD you made and possibly reuse it on another machine.  So good luck Nicholas and thanks for spending the day with us!

  • Lab Ops 19 – Software Defined Networking - where’s my stuff

    Well done if you got through my last post, I bolted a lot of stuff together and it all looks a bit confusing particularly if you are moving from Hyper-V to System Center Virtual Machine Manager 2012 (VMM).  For a start where’s my network settings? When I was testing my rig I found I couldn’t swap out the network connectivity on a test VM from an internal virtual switch to my shiny new VM Network, and when I looked at the error it said I couldn’t change the MAC address of a running VM.  What was going on here was that Hyper-V had assigned a static MAC address at random to my VM and by assigning it to my new VMM infrastructure it wanted to assign a new one.  This got me to wondering about MAC address spoofing which can be important for legacy applications and in Hyper-V you can do this from the advanced features of the Network adapter on a VM. In VMM MAC address spoofing is part of a Virtual Port Profile which also allows us to set the other advance features of Networking in Hyper-V like bandwidth management guest NIC teaming as well as router and DHCP guard..

    Virtual Port Profiles are just logical groupings of these things which you can assign to a logical switch and there are off the shelf ones like Guest Dynamic IP and the Host Management profile we use last time.  This might seem an unnecessary extra step but now that we live in a world of NIC teaming we need to identify the different type of traffic flowing down a fat connection.  We can also see Uplink Port Profiles in this list..

    such as the RDS-FabricUplink I created in my last post, which allows us to connect a logical network and specify how that port gets teamed.  A logical switch has one or more uplink port profiles connected to it and has virtual ports to specify one or more port classifications to describe what traffic this switch can handle.  At this point we can assign the switch to as many hosts as we want and each one will inherit all these properties:

    • The logical switch appears as a Virtual Switch in Hyper-V and is bound to a NIC or team on that host.  When we do that we can see it’s the Uplink Port Profile that is passed to the NIC
    • Underneath that we have one or more Virtual Network adapters which are associated to a VM Network and a Virtual Port Profile.

    When we attach VM to a VM network many of the properties are now greyed out (like those MAC address settings ).

    Anyway I am now ready to create a couple of VM Networks for Dev and Test on which I can put identical VMs without them seeing in each other but also allow them to span hosts ..


    To do this I need to create the two VM Networks and integrate them into the networking fabric, by associating them with my logical network (RDS-FabricNet) I created in my last post, and here’s the PowerShell:

    Import-Module VirtualMachineManager
    $logicalNetwork = Get-SCLogicalNetwork -Name "RDS-FabricNet"

    # 2 x VMNetworks each with the same subnet
    $vmNetwork1 = New-SCVMNetwork -Name "DevVMNet" -LogicalNetwork $logicalNetwork -IsolationType "WindowsNetworkVirtualization" -Description "Developer Virtual Network" -CAIPAddressPoolType "IPV4" -PAIPAddressPoolType "IPV4"
    $subnet1 = New-SCSubnetVLan -Subnet ""
    New-SCVMSubnet -Name "DevVMNet Subnet 10" -VMNetwork $vmNetwork1 -SubnetVLan $subnet
    $vmNetwork2 = New-SCVMNetwork -Name "ProductionVMNet" -LogicalNetwork $logicalNetwork -IsolationType "WindowsNetworkVirtualization" -Description "Developer Virtual Network" -CAIPAddressPoolType "IPV4" -PAIPAddressPoolType "IPV4"
    $subnet1 = New-SCSubnetVLan -Subnet ""
    New-SCVMSubnet -Name "ProductionVMNet Subnet 10" -VMNetwork $vmNetwork2 -SubnetVLan $subnet

    At this point You might be wondering about what to do about DNS and AD in this situation as we would normally assign fixed ip addresses to these.  The answer is to start these VMs first and then they’ll get the lowest address by default where x.x.x.1 is reserved on the subnet for the switch. This is similar to Azure except that Azure hands out x.x.x.4 as the lowest address as there are three reserved addresses on a subnet. 

    Anyway the other thing we’ll want to do is specify the new traffic that will be carried on our virtual switch by these VM Networks and to do that we’ll add in another port profile.

    $portClassification = Get-SCPortClassification -Name "Guest Dynamic IP"
    $nativeProfile = Get-SCVirtualNetworkAdapterNativePortProfile -Name "Guest Dynamic IP"
    New-SCVirtualNetworkAdapterPortProfileSet -Name "Guest Dynamic IP" -PortClassification $portClassification -LogicalSwitch $logicalSwitch -RunAsynchronously -VirtualNetworkAdapterNativePortProfile $nativeProfile

    Our design now looks like this..

    We can then assign these VM network to new or existing VMs and it will be available on any hosts we manage in VMM provided we connect those host to our virtual switch. To do that we need a process to create VMs and to do that we need somewhere to put them so next up Storage in VMM.

  • Lab Ops 18 Getting started with Software Defined Networking

    In my last post I finished up where I had my host under management of Virtual Machine Manager (VMM) and that was about it. As with Hyper-V I can’t really use VMM until I have my fabric configured and after adding in hosts the first thing we need to do is look at Networking.  To recap my current setup now looks like this


    Where RDS-Switch is an internal virtual switch, and my RDS-DC is my DC & DHCP server with one scope of  VMM has a dynamic ip address and is also hosting SQL Server for its own database.

    If I go to VMM and go to VMs & Services | All Hosts | Contoso | Orange (which is my Dell Orange laptop) I can right click and select View Networking. If I look at the Host Networks all I can see are the physical NICs, If I look at VM Networks all I see is my VMs but no networks and the Network topology screen is entirely blank so what’s going on?  Basically things are very different in VMM than they are in Hyper-V, and frankly we don’t want to use Hyper-V for managing our networks anymore than a VMWare expert would configure networks on individual ESXi hosts.  In our case we use VMM to mange virtual switches centrally where in VMWare distributed switches are controlled in VCenter.  So my plan is to use VMM to create a network topology that reflects what my VMs above are for;  to manage my datacentre.  Later on I’ll add in more networking which will enable me to isolate my services and applications from this, in the same way that Cloud providers like Azure hide the underlying infrastructure from customers.

    If we look at the Fabric in VMM and expand Networking we have 8 different types of objects we can create and there is a ninth VM Networks that shows up under VMs and Services so where to begin?

    Your starter for ten (or nine in this case) is the TechNet guide Configuring Networking in VMM, and once you dig into that you realise that VMM wants to control not just the switching but ip management as well.  The core to all of this are the Logical Network and Virtual Networks which are just containers for various properties including sites and ip pools.   I am going to start simple and as I only have one host just create the first object we need, a Logical Network, that has one connected network.    For now I am going to ignore the sub options to get us started. 

    Logical network properties

    note this is a screen grab from the end of the process

    I can’t create a Logical Network without a Network Site which has a specific subnet and optionally VLAN set..

    network site

    The small print
    As per usual in this series I am going to share the PowerShell to do this.  VMM is very good at allowing you to see the equivalent script when doing something,  however I have modified what VMM spits out to make it easier to read, while ensuring it still worksSmile .  This is a good thing as it’s easy to cut and paste form this post and get exactly the same results and you can see how the variables are passed and related to each other. 
    Note the raw PowerShell from VMM often runs everything at the as a job which is a useful trick it has, and in all cases all our work gets logged in VMM whether using the UI or the VMM cmdlets

    Note the segments in this post are all using the same variables so you will need to turn them in the order shown

    The equivalent PowerShell is: 

    $logicalNetwork = New-SCLogicalNetwork -Name "RDS-FabricNet" -LogicalNetworkDefinitionIsolation $false -EnableNetworkVirtualization $true -UseGRE $true -IsPVLAN $false
    $allSubnetVlan = New-SCSubnetVLan -Subnet "" -VLanID 0
    $allHostGroups = Get-SCVMHostGroup -Name "All Hosts"
    $logicalNetworkDefinition =New-SCLogicalNetworkDefinition -Name "RDS-FabricNet_Site192" -LogicalNetwork $logicalNetwork -VMHostGroup $allHostGroups -SubnetVLan $allSubnetVlan -RunAsynchronously

    Note that in the code above a network site is referred to as a Logical Network Definition.

    VMs are connected to virtual machine networks and we had the option  to create one of these when we created the logical network with the same name. In this case that would have been fine for what I am doing here, as my two VM’s are actually there to manage the hosts in much the same way as a VMWare appliance does.   so I am going to create a virtual network that is directly connected to the logical one..

    VM Network

    $vmNetwork = New-SCVMNetwork -Name "RDS-FabricVNet" -LogicalNetwork $logicalNetwork -IsolationType "NoIsolation"

    However this and the logical network and site are just a containers in which we put our settings as points of management.  We now need to create an uplink port profile from Fabric | Networking | Port Profiles.  This needs to be an Uplink Port profile, and when we select that option we can describe how the underlying NIC can be teamed directly from here rather than doing that it in Server Manager on each host.  We then simply select our Network site (RDS-FabricNet_Site192) and we are done..

    port profile network config

    The one line of PowerShell for this is..
    $nativeProfile = New-SCNativeUplinkPortProfile -Name "RDS-FabricUplink" -Description "" -LogicalNetworkDefinition $logicalNetworkDefinition -EnableNetworkVirtualization $false -LBFOLoadBalancingAlgorithm "HostDefault" -LBFOTeamMode "SwitchIndependent" -RunAsynchronously

    The next piece of the puzzle is to create a Logical Switch.  This is a logical container that emulates  a top of Rack switch in a real server room.  It can have a number of virtual ports but unlike VMWare these are limited by numbers but are there to manage traffic through the use of port classifications.  We’ll need at least one of these and I am going for Host management for the port classification as that is what all of this is for..

    Logical swith link to uplink prot profile

    The PowerShell is:

    $virtualSwitchExtensions = Get-SCVirtualSwitchExtension -Name "Microsoft Windows Filtering Platform"
    $logicalSwitch = New-SCLogicalSwitch -Name "RDS_FabricSwitch" -Description "" -EnableSriov $false -SwitchUplinkMode "NoTeam" -VirtualSwitchExtensions $virtualSwitchExtensions
    $UplinkPortProfileSet = New-SCUplinkPortProfileSet -Name "RDS-FabricUplink-Set" -LogicalSwitch $logicalSwitch -RunAsynchronously -NativeUplinkPortProfile $UplinkProfile

    We should also create a Virtual port with the  host management port classification:

    Logical switch virtual port

    The PowerShell is..

    $portClassification = Get-SCPortClassification -Name "Host management"
    $nativeProfile = Get-SCVirtualNetworkAdapterNativePortProfile -Name "Host management"
    New-SCVirtualNetworkAdapterPortProfileSet -Name "Host management" -PortClassification $portClassification -LogicalSwitch $logicalSwitch -RunAsynchronously -VirtualNetworkAdapterNativePortProfile $nativeProfile

    We can now apply this logical switch is then applied to our hosts by going to its properties and navigating to Hardware | Virtual Switches and adding a new Virtual Switch | New Logical Switch.  Immediately our RDS-FabricSwitch is selected and we can see that our adapter (physical NIC) is connected to the Uplink we have created through this switch.

    Host Virtual switch properties

    However that is just like using virtual switches in Hyper-V manager what we also need to do is to add in a Virtual Network Adapter as in the diagram above.  This picks up the VM Network we already created (RDS-FabricVNet). Notice I can have all kinds of fun with ip addresses here..

    host virtual switch virtual network adapter

    BTW I should have set the Port Profile to the only option available, Host Management, in the above screen shot. If I look at Hardware | Network adapters I can also see my logical network and site..

    host hardware logical network

    The equivalent PowerShell to connect the logical switch virtual network adapter to the host is ..

    #My Host is called Orange

    $vmHost = Get-SCVMHost -ComputerName Orange

    #Note you’ll need to at least change the Get-SCVMHostNetworkAadpter –Name to reflect the NIC in your host.

    $networkAdapter = Get-SCVMHostNetworkAdapter -Name "Broadcom NetXtreme 57xx Gigabit Controller" -VMHost $vmHost
    Set-SCVMHostNetworkAdapter -VMHostNetworkAdapter $networkAdapter -UplinkPortProfileSet $uplinkPortProfileSet
    New-SCVirtualNetwork -VMHost $vmHost -VMHostNetworkAdapters $networkAdapter -LogicalSwitch $logicalSwitch  -CreateManagementAdapter -ManagementAdapterName "RDS-FabricVNetAdatper" -ManagementAdapterVMNetwork $vmNetwork -ManagementAdapterPortClassification $portClassification

    Now we can finally see what on earth we have been doing as this Logical switch we have created is now visible in Hyper-V Manager..

    hyper-v virtual switch

    and if we look at it’s extensions we can see a new Microsoft VMM DHCPv4 Server Switch Extension in here which allows to control all the virtual switches from VMM.

    The tricky part now is to add the VMs to the virtual network. This isn’t tricky because it’s hard it’s tricky because if VMM and the DC loose sight of each other or VMM can’t see the host then we are in trouble so as we can’t easily change these settings in the UI or PowerShell plus we’ll need to flush DNS as DHCP will kick in and change things as well.  However what we are doing is moving VMs that are essentially part of the data centre fabric.  Other VMs would not be affected like this indeed that’s the point we should be able to move VMs together across hosts and datacentres without affecting their connectivity.  

    Here is a diagram of what we have created to contrast with what was above.


    This has been quite a lot of work to achieve very little, but we now have the foundations in place to quickly add in more networks and more hosts and to isolate and manage our networking without needing to use VLANs.  However if there are already VLANs in existence then all of this will work just fine as well (for more on that check this post from the VMM engineering team).

    Now I have a basic network for VMM and the hosts I need to do some of this again for the VMs that will serve out applications.  Until next time have a go at this stuff read the stuff on TechNet and create a lab like this to get your head around it.

  • Lab Ops part 17 - Getting Started with Virtual Machine Manager

    In my last post I talked a lot about the way all of System Center  works, now I want to look at  one part of it Virtual Machine Manager (VMM) as VMs in a modern data centre are at the core of the services we provide to our users. In the good old days we managed real servers that had real disks connected to real networks and of course they still exist, but consumers of data centre resources whether that other parts of the IT department or business units will only see virtual machines with virtual disks connected to virtual networks.  So administration in VMM is all about translating all the real stuff (or fabric as it’s referred to) into virtual resources. So before we do anything in VMM we need to configure the fabric, specifically our physical hosts, networking and storage.

    There’s good TechNet labs and MVA courses to learn this stuff, but I think it’s still good to do some of this on your own server, so you can come back it again whenever you want to especially if you are serious about getting certified.  So what I am going to do in the next few posts is to explain how to use the minimum of kit to try some of this at home.  I generally use laptops which are sort of portable typically with 500Gb plus of SSD and at least 16Gb or RAM half of those resources should be enough.

    I am going to assume you have got an initial setup in place:

    • Copy the above VHDX to the root of a suitable disk on your host (say e:\) and mount it(so it now shows up as a drive for example X:)
    • from an elevated prompt type BCDBoot X:\windows
    • Type BCDEdit /set “{default}” hypervisorlaunchtype auto
    • Type BCDEdit /set description “Hyper-V Rocks!”
    • reboot the host and select the top boot option which should say Hyper-V Rocks!
    • from PowerShell type add-windowsfeatures Hyper-V –includemanagementtools –restart
    • the machine should restart and you are good to go.
    • VMM running already either by downloading a complete VMM evaluation VHD or installing from the VMM2012R2 iso yourself.
    • A domain controller as per my earlier post on this, with the host and the VMM VM belonging to this domain.

    The first thing you want to do in VMM is to configure Run As Accounts.  One of the reasons my PowerShell scripts in this series are not production ready is that they have my domain admin password littered all over them which is not good.  VMM allows us to create accounts used to authenticate against anything we might need to  which could be a domain account a local account on a VM (be that Windows or Linux), access to a switch or a SAN.  So lets start by adding in domain admin. We can do this from settings | Security | Run As Accounts or with the VMM PowerShell cmdlets and not all I have to do is open PowerShell from inside VMM or use PowerShell ISE and either way there’s no more of that mucking about importing modules to do this..

    #New Run As Account in VMM
    $credential = Get-Credential
    $runAsAccount = New-SCRunAsAccount -Credential $credential -Name "Domain Admin" -Description "" -JobGroup "cb839483-39eb-45e0-9bc9-7f482488b2d1"

    Note this will popup the credential screen for me to complete (contoso\administrator is what I put in). The jobGroup at the end puts this activity into a group that ends up in the VMM job history so even if we use Powershell in VMM our work is recorded which a good thing.  We can get at that job history with Get-SCJob | out-gridview

    We’ll probably want to do the same for a local account so just Administrator and a password so that any VM’s we create will have the same local admin account & Note this will popup the credential screen for me to complete.

    Now we can consume that domain login account to add manage our host.  In VMM we would do this from Fabric | Servers | all Hosts and before we add in a server we can create Host groups to manage lots of hosts ( I have one already called Contoso). To add a host right click on all hosts or a host group and select add Hyper-V Hosts or Clusters and the equivalent PowerShell is ..

    <#Note in the raw PowerShell from VMM there  is a load of IDs included but this will work as long as we don’t use duplicate names in accounts etc.  #>

    $runAsAccount = Get-SCRunAsAccount -Name "Domain Admin"

    $hostGroup = Get-SCVMHostGroup -Name "Contoso"
    Add-SCVMHost -ComputerName "" -RunAsynchronously -VMHostGroup $hostGroup -Credential $runAsAccount

    In the background the VMM agent has been installed on our host and it has been associated with this instance of VMM.  You should now see the host in VMM and against all the VMMs that are on it so not bad for a line of PowerShell! We can also see the properties of our host by right clicking on it, and of special interest are the virtual switches  on the host and if you have used my script  with modifying it you’ll see a switch called RDS-Switch on the host.  We can also see the local storage attached to our host here.

    So now we have a basic VMM environment we can play with, a host a DC and VMM itself so what do we need to do next.  If this was VCenter we would probably want to setup our virtual switches and port groups so let’s look at the slightly scary but powerful world of virtual networks next.

  • Virtual Machine Manager 2012R2 Templates cubed

    Up until now in my Lab Ops series I have been using bits of PowerShell to create VM’s in a known state, however this requires a certain amount of editing and I would have to write a lot more code to automate it and properly log what is going on. Plus I would probably want a database to store variables in and track those logs and some sort of reporting to see what I have done.  That’s pretty much how Virtual Machine Manager (VMM) works anyway so rather than recreate it in PowerShell I’ll just use the tool instead.  VMM not only manages VMs it also manages the network and storage used by those VMs. However before we get to that we need to create some VM’s to play with and before we can do that we need to understand how templates are used.  It’s actually similar to what I have been doing already – using a sysprepped copy of the OS and then configuring that for a particular function (file server, web server etc.) and building a VM around it.  It’s possible just to use the Windows Server 212R2 evaluation iso and get straight on and build a template and from there a VM.  However VMM also has the concept of profiles which are sort of templates used to build the templates themselves. There are profiles for the Application , the Capability (which Hypervisor to use) the Guest OS and the Hardware. Only the hardware profile will look familiar if you have been using Hyper-V Manager as this has all the VM setting in.  The idea behind profiles is that when you create a VM template you can simply select a profile rather than filling in all the settings on a given VM template tab and in so doing you are setting up an inheritance to that profile. However the setting in Application and Guest OS profiles are only relevant when creating a Service Template. So what are those and why all this complexity when I can create a VM in a few minutes and a bit of PowerShell?

    For me Service Templates are the key what VMM is all about. If you are VMware expert they are  a more sophisticated version of Resource Groups and before you comment below please bear me out.  A service template completely describes a service and each of its tiers as one entity..


    a sample two-tier service

    If I take something like SharePoint,  there are the Web Front Ends (WFE) the Service itself (the middle tier) and back end databases which should be on some sort of SQL Server cluster.  The Service Template allows us to define maximum and minimum limits for each tier in this service and to declare upgrade domains which will enable you to update the service while it is running by only taking parts of it off line at a time.  The upper and lower VM limits on each tier enable you to scale up and down the service based on demand or a service request, by using other parts of System Center.    There might well be away to do this sort of thing with Resource Groups and PowerCLI in VCenter, but then there are those application and hardware profiles I mentioned earlier.  They mean that I can actually deploy the fully working SharePoint environment  from a template including having a working SQL Server guest cluster where the shared storage for that cluster is on a shared VHDX file.

    Services created by these templates can then be assigned to clouds, which are nothing more than logical groupings of compute, networking and storage splashed across a given set of hosts, switches and storage providers and assigned to a given group of users who have delegated control of that resource within set limits. 

    So templates might seem to be piled on top of one another here, but you don’t have to use all of this capability if you don’t want to. However if you do have a datacentre (the internal Microsoft definition of a datacentre has more than 60,000 physical servers) then this power is there if you need it.

    If you haven’t a spare server and a VMM lab setup then you can just jump into the relevant TechNet Lab and see how this works.