Insufficient data from Andrew Fryer

The place where I page to when my brain is full up of stuff about the Microsoft platform

Insufficient data from Andrew Fryer

  • Azure SQL Data Warehouse Don’t do this..

    I am always one of the first to try new stuff when it comes out and so it was great for Microsoft to give me job doing this some 99 months ago.  One of the latest things I signed up for was the Azure SQL Data Warehouse (ADW), and then setup a demo ADW so I could test it against Power BI.  The good news is that had an internal cross charged Azure subscription to run this in, the bad news was that last month this cost me $1,000+..


    Fortunately, we are allowed to make mistakes at work the trick is not to make the same one twice!!

    In this case the mistake was that I made was to leave this running and at this point it’s important to understand that while an ADW server looks a bit like an ordinary Azure DB server it’s actually a very different thing (apart from the cost that is).  In Azure DB Server you can quickly scale up and down a server but you can’t completely shut down the server and leave your database parked there for later. 

    However not only can you pause ADW, you can do this safe in the knowledge that your data will still be there for later when you want to resume your work.  I should also mention that of course ADW scales massively more than ADB.  In the screenshot below I have two databases running on my DeepFatDW Azure database server


    where TechNetDW is a data warehouse and DeepFat is an “ordinary” database which cannot be paused. ( but can be scaled up and down to a certain extent).

    I can pause and restart this ADW and I could have done this in PowerShell, for example


    –ResourceGroupName "ResourceGroup11"`

    –ServerName "DeepFatDW" `

    –DatabaseName "TechNetDW"

    Plus I can also dial up/down the compute power against the database, without affecting the data itself, so I could have a truly massive DW with only minimal compute against it until I realise I need more power 


    -DatabaseName "TechNetDW"`

    -ServerName "" `

    -ServiceObjective "DW100"

    and I could go further and embed this in Azure Automation to schedule a scale down or pause overnight.

    So ADW is a great example of how services should work in the cloud, where scale down to save costs is as important as scale up, while data is protected, and you are in control. 

  • Data, no use at all without a catalog


    About a decade a go I went on a Microsoft BI boot camp which was also  attended by Microsoft staff specifically Matthew Stephen and Rob Gray who inspired me to work towards joining Microsoft myself.  Two years later I got in and haven’t looked back and I still track Matt on LinkedIn, so I was very interested in his article on Data Lakes.  I won’t repeat that here other than to summarise it briefly for those that don’t like to click on hyperlinks.

    Data lakes are large heterogeneous repositories of data the idea being that essentially you can quickly store anything without worrying about it’s storage.  Contrast this with a carefully curated data warehouse which has been specifically designed to be a high quality consistent store of business critical information.  Of course this means that Gb per Gb  data /.nbvcx much more expensive to build and maintain but this is because every bit of data in them is directly useful and applicable to the business.

    On the other hand a data lake will contain vast amounts of less obviously useful data for example while the data warehouse might contain all the transactions from the online store, the data lake would store all of the logs from the site running the store.  The former is immediately valuable in showing sales trends the later seems to be largely useless.  However if storage is nearly free (and it largely is if data is stored in the cloud) we can afford to keep everything on the off chance it might be useful.  But we couldn’t justify processing the log data into our warehouse as this would be expensive in terms of compute and the human effort to make this work.  Not only that we aren’t sure what we need from this other kind of data or what it will be used for. Instead we just catalogue it and hang on to it until we realise we need it.

    Let’s take an example – the company on-line store.  The actual sales transactions get written into the data warehouse, but we decide it might be useful to store the web logs from the site as well, so they get thrown into the data lake, typically in a folder/container with a file for each time slice/ web server and the process is just a simple file copy from the web server.  A couple of weeks later there seems to be something odd happening with the site and the cyber security firm we engage ask to see our logs and so we can provide them access to our data lake (we archive off the logs form the actual site to keep the servers nice and lean).  They use their tools to analyse this and advise on a mitigation strategy.  A few months later analysis of the data warehouse shows that the number of new customers being acquired is dropping despite discounts and promotions being in place.  The suspicion is that the culprit might be the new customer sign up experience and a design agency is called in to conduct click stream analysis over the stored web logs to confirm this.

    A couple of points come out of this.

    • The Data Warehouse is still needed for the reporting & analytics that we have always done.
    • The use of the logs wasn’t known when the data was stored and this data might not have been used at all. so trying to second guess how to store it at the time of arrival would be at best difficult and could be pointless.
    • What is important is to catalogue what is stored in the lake and its structure as well as how it relates to other data be that in the data lake or in the data warehouse. 

    So in my opinion the term data lake is not a great description of what it does (how about a data museum?) and without understanding what is in there it is not a lot of use. So a key companion to the Azure Data Lake is the new  Azure Data Catalog


    This new freemium service does is a bit like Master Data Services in SQL Server, but because it’s on line on Azure  it’s accessible and available by default. However it also leverages and requires  Azure Active Directory 1 so that catalogs are protected while still enabling teams inside an organisation to collaborate.

    If this is used as intended then not only do the data professionals record what is going into the data lake AND the data warehouse, everyone shares their knowledge about how data is used and how to join the dots to do specific analytics.

    So I would suggest two things start having a look at the principles of master data management in general and think about how this might play out in your organisation by having a look at the Azure Data Catalog.

    1. This means that if you have an MSDN subscription linked to a personal account you may not be able to sign up to try this service however you can setup Azure Active Directory inside your subscription and then make a domain account a member of your subscription (a topic for another post).

  • Cortana - Remembering the Future

     If you have ever watched the Star Trek First Contact film the crew allude to the fact that the star ship Enterprise is named after the first space shuttle, when actually it’s the other way around the first space shuttle was so named because over 2 million star trek fans lobbied NASA to call it Enterprise.

    A different kind confusion seems to surround another icon of science fiction Cortana. Originally a fictional AI  that was embedded in the Halo franchise of XBox games, Cortana is the name and inspiration of your digital assistant on phones and in Windows 10.  That sort of makes sense as she was a very smart advisor to the Master Chief in those games, and now she can help us with our daily lives, if we want her to.

    However the real point of confusion seems to be the announcement of the Cortana Analytics Suite, specifically what is it and how does that relate the assistant on Windows 10 devices and phones?

    From a services perspective Cortana Analytics Suite represents the collection of data tools to do analytics that exist in Microsoft’s cloud services, Azure and Office 365. So it includes Power BI , Machine Learning, Data Lake, Data Catalog and Data Factory, amongst others as well as the digital assistant, Cortana..


    This makes it a superset of all of these capabilities much in the same way as Office 365 covers traditional tools like Excel and Word as well as newer ones like Sway.  However today this is just a collective term in that you can’t buy Cortana per se, so is this just the marketing folks trying to create impact and build on what is a cool brand?

    Not really, as that set of services can be used by any developer to create intelligence in application that also mimic Cortana in that game.  For example you could develop applications to monitor your health and your team’s health through wearables like the Microsoft Band, recommend courses of action based on behaviour to make your web sites appear more personal and relevant to predict what action to take before systems fail and if allowed to carry out that preventive maintenance,

    Is Cortana Analytics Suite relevant?  I have already done a series of posts on these services and how they can be wired up to create solutions around predictive analytics, big data, near real time analysis and “traditional BI” – I just didn’t know at the time this would all be joined up in this way. What this means for me is that I am betting my future on Cortana.

  • Azure Data Services for (near) Real Time Scenarios

    Light takes about 8 and half minutes to get from the surface sun to the earth and it can actually take a photon 20,000 years to get form the core of the sun to the surface.  On earth that means that if the sun stops shining we wont know about it until eight and a half minutes after and begs the question what is real time as in real time analytics?  So IO tend to stick to the term near real time to be precise because if we are to do any kind of analysis we’ll need a small amount of time for that no matter what technology we use. 

    In my diagram for this scenario I have identified two candidate sources for this sort of data – IoT and Feeds.


    IoT, the internet of things is the crazy work of connecting devices and humans to the net and gathering telemetry from them, so things like wearables, weather stations, or even wildlife.  These things are typically low powered devices often prototyped using maker technology like Raspberry Pi, Aduino and many others, these then send feeds to a service but the question is how to ingest that data and what to do with it? There could be hundreds to millions of these devices and some of them can generate a lot of data as they send detailed state readings possibly at several times a second and it’s scale we need to factor in when deciding what to use.

    In the diagram above I have shown two Azure services in the orchestration box in pink, Service Bus and Event Hubs they are complimentary and different.  For smaller scale scenarios Azure Service Bus can be used by itself and it was original designed to allow asynchronous communications between service tiers e.g. An Azure Web role talking to a worker role (web application and logic application as they are now called). 

    When more scale is required one or more event hubs can be used behind the service broker to handles millions of events a second. so where before a service bus has  queues and topics it can now have event hubs behind it as well. However where multiple queues and topics compete to process the next message, event hubs are partitioned with each hub having it’s own sequence of events. The event hubs are partitioned by a partition key that can be defined for example if a sensor send in temperature and pressure reading you might have an event hub for each portioned by a reading type key.  The event hub pricing gives you an idea of how powerful they are with each hub handling up to 1,000/events a second with a data ingress of 1Mb/sec.

    So event hubs can bring and cache events at great scale but what to do with them ? we could write them as is to storage but typically we’ll want to aggregate and process the data in some way first and this is where Stream Analytics comes into play.  This is another Azure Service that uses a SQL like language to analyse data over time and send the result of to another process or write it to one of the many storage formats on Azure. 

    Comparing this to the other orchestration service in Azure, Data Factory can be likened to how we might analyse a group of cars by make model and colour.  If we used a tool like Data Factory or any other ETL (Extract Transform and Load) solution we would go to a car park where the cars were and count them every day possibly noting which ones had left and arrived since the last count. If we used Stream Analytics we would stand on a bridge over a motorway and count the number of cars that went under the bridge in a given time. 

    The other things to know about stream analytics is what it can consume and what it can output. It can consume data from Event Hubs and Azure Blob Storage and it can also output to both of these and SQOL Azure databases.  This means we can recurse and reuse data as it arrives to do ever more sophisticated analysis.

    I have also included HDInsight in the diagram as we can configure Apache Storm on top of HDInsight be selecting this as an option when we create an HDInsight cluster..

    An example of the quick-create cluster form in the portal

    This allows open source orientated organisations to use Storm as a service and thus use the tools and techniques they are familiar with to do near time analysis at scale as well.

    In both cases the (near) real time dashboard in Power BI allow these feeds to be visualised and in this video you’ll see some of that as used by Microsoft’s Digital Crimes Unit as they continually monitor threats and work with law enforcement agents to mitigate them.

    To conclude while there is a lot of interest in IoT what really matters is the data coming off of those devices and how we can use it for good, for example by feeding more people through analysis of the environment, making the internet safer, and to improve chronic health problems with wearables.   

  • Azure Data Services for Big Data

    Big data is one of those over used and little understood terms that occasionally crop up in IT. Big data has also been extensively over hyped and yet it’s a pretty simple concept that does have huge potential.     A good way to describe it is the three V’s , Volume, Variability and Velocity:

    • Volume speaks for itself – often big data is simply to big to store in any but the biggest data centres – for example few of us could store all of twitter locally.
    • Velocity reflects the rate of change of data and in this case it’s network capacity that limits your ability to keep up with data as it changes
    • Variability is about the many formats that data now exists in from images, and video to the many semi and unstructured formats from logs, and the Internet of Things (IoT).   This affects how we store data – in the past we might have spent time to write data into databases but this forces us to make assumptions about what questions we will ask of that data.  However with big data it’s not always clear what use will be made of the data as it is stored and so in many solutions data is stored in its raw form for later processing.

    These three Vs mean that big data puts stress on  the compute storage and networking capacity we have and because its use is uncertain it can be difficult to justify any infrastructure investments for this.  This is the main reason that cloud services like Azure make sense in this space.  Cloud storage is very cheap, and as we’ll see compute can be on demand and very scalable so you just pay for what you need when you need it.

    Enough of the theory how does that play out in Azure?


    Working backwards – we need to have somewhere to store all of this data and at the moment we currently have three choices for big data :

    • Azure Blob storage.  This is nothing more than the default storage for azure  - for example if you create a VM this is where the virtual hard disks are stored.  objects get stored in containers for which permissions can be set much like any operating system except that these can be accessed over DNS and must therefore must have a valid DNS name e.g.
    • Azure Table storage. This is a storage mechanism based on key value pairs, where the key is a unique identifier and the value can be pretty much anything you want and can vary between records in the same table  The above rules about DNS also apply to table storage.
    • Azure DocumentDB  This is a NoSQL schema free data base like storage platform based on JSON data.  While there is no concept of transaction you can write SQL like queries and the results come back as JSON (there’s a sandbox here where you can try this yourself.
    • Azure Data Lake. This is a new Hadoop File System (HDFS) compatible filing system that was announced at Build 2015, which will supersede Azure Blob Storage as the default location for big data on Azure.  However it’s only private preview at the time of writing so I don’t have more detail I can share.

    Having found somewhere on Azure to store data, we need a mechanism to process it and the industry standard for this is Hadoop.  On Azure this is referred to as HDInsight and while this is essentially a straight port of Hadoop on to the Windows or Linux designed to rub on Azure it has a number of benefits:

    • Data can be quickly accessed from Azure Blob Storage by referring to its location with a wasb:// prefix and then the DNS entry to the folder where the files to be used exist. Note this approach allows us to consume all the files at once in a folder but assumes those files all have the same structure.  This makes it ideal for accessing slices of data from logs or feeds which often get written out from a service in chunks to the same folder.
    • You only spin up a cluster when you need it and pay only for the time it exists. What’s clever about this is that you’ll typically process a chunk of data into some other format say into a database, which will persist after the cluster is destroyed 
    • Clusters can be configure programmatically from and part of this process can be to add in additional libraries like Giraffe, or Storm and to address and use the storage locations needed for a process.
    • You aren’t restricted to using HDInsight -  you can if you want spin up Hadoop virtual machines with tooling from Horton Works or Cloudera already included (look for these in the VM Depot I have mentioned in previous posts). However HDInsight manes that you don’t have to worry about the underlying operating system or any of the detailed configuration need to build a cluster form a collection of VMs.

    Creating cluster in HDInsight is still a manual process, and while we could create it and leaving it running this can get needlessly expensive  if we aren’t using it all the time so many organisations will want to create a cluster run a job against it and then delete it.  Moreover we might want to do that in line as part of a regular data processing run and we can do this in HD Insight in one several ways for example:

    • An HDInsight cluster can be configured and controlled using PowerShell, and it’s also possible to execute jobs and queries against the cluster.  This PowerShell can be part of a bigger script that performs other tasks in Azure and in your local data centre.  PowerShell  scripts can be adapted and run from Azure Automation which has a scheduling and logging service.
    • Make use of Azure Data Factory (ADF) which can do everything that is possible with PowerShell but also shows how processes are connected and has a rich logging environment which can be accessed just form a browser.  As I mentioned in my last post ADF can also be used to control batch processing in Azure ML (MAML) as well as connecting to storage and SQL Server and Azure SQL databases which might be the final resting home of some analysis done in HD Insight.

    If you want to find out more about any of these services the Microsoft Virtual Academy now has a more structured view of the many courses on there including a special portal  for Data Professionals..


    and drilling into Big Data will show the key course to get you started..


    Hopefully that was useful and in my next post in this series we’ll show how various Azure services can be used for near real time data analysis.

  • Azure Data Services for Predictive Analytics

    In my last post I hopefully explained that business intelligence works very well on Azure and with new services like the Azure Data Warehouse and Power BI (Not strictly part of Azure but a Microsoft  online service).  However while BI is good at showing some trends and how a limited number of factors affect that trend it can really be used as part of process to automatically make decision of the back of some new data. That is the realm of predictive analytics and there are a few routes you can use if you want to do this in the cloud..


    The first is to simply use Azure Machine Learning (MAML) which does this as a service (which I have covered in earlier posts).  MAML can directly read data in from a number of sources including OData , a Hive query (from Hadoop or HD Insight as it’s called in Azure.

    The other approach is to make use of the extensive gallery of VMs in Azure some of which are configured for machine learning . For example  H2O is an open source VM which provides support for R over Hadoop.  H2O..


    which means you just select it enter your Azure subscription details and the VM will spin up complete with tutorials..

    Having made your prediction what do you do with it? In MAML the resulting trained model can be published and then accessed on a transaction by transaction or in batch from a rest api. In transaction mode you’ll just put the call to MAML in lie in your app and so to a large extent MAML itself is invisible.  Batch mode might be used as part of the load of a data warehouse such as customer segmentation or to help complete partial data and in that case then it might be good to make use of  the Azure Data Factory (ADF).

    In this case we might want also want to do pre-processing and cleansing using the right tools for that, like HD Insight and just do the machine learning in MAML.  ADF can orchestrate all of this including provisioning an HD Insight cluster on demand and calling MAML. Not only that it does a great job of  logging what has been processed and any issues all from your browser.

    The other scenario where ADF could make sense when doing predictive analytics is to have a process (Factory) to retrain a MAML model based on newer data in a controlled way.  For example if you have a model to accept/reject loan applications the model needs to be audited but may need to change to reflect more recent data, so a controlled update is needed rather than just continually changing it as new data arrives (which could also make use of ADF).

    Finally in the diagram above I have also shown Power BI as being in the mix as this is the de facto service online service form Microsoft although it’s not really part of Azure it’s Office 365, and has now been unbundled form SharePoint as on-line service.

  • Azure Data Services for Business Intelligence

    A little known fact the biggest workload on Azure VM’s (aka IaaS) is .. SQL Server. This might seem odd give that there is a SQL Azure service where you can run SQL Server database without worrying about SQL Server itself.  So what’s going on here?

    There are a few possible options:

    • These VMs might be running older versions of SQL Server as a way of offloading legacy apps to the cloud
    • SQL Server is much more than a database engine and these VM’s might be running the other components that come with its such as Integration Service , Reporting Services ,Analysis Services, Master Data Management or Stream Insight.
    • The databases the VM is running might not be compatible with SQL Azure
    • SQL Azure can seem expensive and this is because it is essentially SQL Server in an Always On mode: Transactions are written to two back up servers in the same Azure data centre and all three server are replicated to the other European data centre.

    What’s interesting for me is that whichever scenario I think of it does mean that data and databases are being stored on Azure in some shape or form which is a good thing provided that it is properly protected in the way that any SQL Server instance that’s connected to them internet. However what I an interested in in this post is the use of Azure to do “traditional” business intelligence (BI).  By this I mean data marts / warehouses reporting and analytics.   Note If this data is simply persisted virtual hard disks attached to VMs then ultimately that ends up in azure blob storage.

    If we accept concerns about where data is homed then BI s actually quite a good fit for cloud if the visualization tools are close to where the data is…


    Note that since my last post two new data services have been announced for Azure – Data Lake and Data Warehouse.  The Data Warehouse is a good example of how the cloud changes as it will allow the compute resources used to run queries to be scaled to meet demand while the storage aspect will be relatively cheap.  the Data Lake is essentially a place to drop large amounts of data that you might want to use in future but aren’t sure what that use might be when you throw the data in the lake.

    Wherever the data persists for a data mart/warehouse, we’ll need to have a process in place to update and refresh it.  Back in my day this might have been SQL Server Integration Services (SSIS), Informatica etc. so how does the cloud change this?  The simplest approach might be to move this processing to a VM in the cloud the upside being that we can be clever about when batch processing occurs (like the overnight warehouse refresh) and only pay for what we use.  However we may still need to think about designing in redundancy to ensure our processing survives any failure. There is another way and that is to use the equivalent service in Azure, Data Factory. At the time of writing this is still very new and works in a very different way to SSIS etc. in that it is primarily about workflow and talks to sources and targets in their native languages and doesn’t have too many type of connectors.

    With data visualization there are a couple of choices in the cloud either Power BI or spinning up your preferred tools in Azure RemoteApp.  RemoteApp is essentially remote desktop services in the cloud using either a pre-built image with Office on it or supplying your own image with what ever tools you prefer to both build or consume analytical models. RemoteApp might also solve the problem of providing BI on the move away form the office as there are free RemoteApp clients for Android IOS and OS-X.  I like this solution as data only persists in the cloud under control and doesn’t leak out onto local devices which also means it is fast as it’s only RDP traffic going between device and cloud.

    Finally please be aware that so far I have just mentioned Microsoft’s BI tools here, but all of this could apply to Tableau over SQL Server, Oracle based BI, IBM Cognos and so on as any of these will also be fine in Azure assuming the vendors have sorted out how these tools are licensed when running in the cloud.

  • Azure Data Services–what are they?

    This might be a recurring theme but worth mentioning again -  Azure continues to change and expand at an alarming rate. On the one hand this is simply scale, for example we now have access to bigger compute resources like the G class virtual machines, ever cheaper storage, and higher and higher limits on what can be done.  However it’s the changes to the the range of services available in Azure that is more important and  nowhere is this more obvious than in the world of data..


    In the diagram above you can see what I mean (note: Power BI is in here which is not technically Azure, but is a cloud service like Azure).  I could write a series of posts on what these all are but actually there is good content on what each of them does and tutorials on how to use them.  What is harder to find and what I want to discuss is how these fit together to support certain scenarios that simply weren’t possible or economically viable in a pre-cloud world.  So I have grouped these services by function:

    Orchestration. The Event Hub and Data Factory services allow us to ingest data into Azure and then forward it to the other services in the diagram.  However they differ in that Event Hub is designed for consuming data streams from Internet of Things or RSS Feeds where Data Factory is a batch mode solution for large data sets.  

    Compute.  The raw data we have needs to be transformed and analysed to get value from it.  The most obvious way to do this is to spin up a VM with whichever tools we are most comfortable. I might use SQL Server or R studio where others might be more comfortable with Oracle or one of the many big data solutions in VM Depot  (a collection of 90 odd open source templates for Azure)..


    However Azure also has specific services for processing data such as HD Insight for big data, Machine Learning for predictive analytics and Stream Analytics for near real time scenarios.  I could also add SQL Azure to this list but I have listed it under storage as code is usually executed against a database from another application or service.

    Storage.  There are also several ways to store data in Azure blob and table storage, SQL Azure and the newer documentDB ( like MongoDB but as a service).  Note while data can be stored in virtual hard disks inside VMs they are all stored in Azure blob storage anyway.

    While I hope it’s useful to have a view of Azure services like this it still doesn’t help in understanding how to connect them up to do useful work. So if we imagine these are stations on the London Underground we get to those station by travelling on a particular line (hopefully not going round and round on the circle line!) . So in the next few posts I am going to take you on a journey through these services which I hope will expand your mind to the opportunities of working with data in Azure.

  • Machine Learning for the Business Intelligence developer

    Amy (the other half of the data duo in our team) and I have been gate crashing the Data Culture series and other events recently to see who’s interested in Azure Machine Learning (MAML) . It turns out that data scientists are pretty comfortable using their own tools and scripts be that in R and Python and while they are interested in MAML they don’t see then need for it.  However Business Intelligence (BI) developers seem to be far more excited about how this new Azure service can be quickly evaluated and deployed against their existing data without them needing the deep maths knowledge a data scientist has.  That’s not to say  a basic knowledge of maths and stats is needed or that Azure ML makes data scientists redundant rather it means that interesting ideas can be tested and evaluated if you have some idea of what you are doing. 

    However MAML  has only just gone into general availability and while the service is powerful fast and reliable, the full documentation around it is still a work in progress.  To fix that Amy and I have come up with a tutorial to help get you started.  It based around my earlier posts on predicting whether flights are delayed or not – in technical parlance a two class classification problem.  This lab guide is on One Drive and makes use of existing sample data sets in MAML, and it takes you through the whole process right up to publishing, note you’ll still need to sign up for MAML to do this.

    Amy has also posted two experiments at various stages of the lab into the MAML Gallery..


    So that you can start at the point where things get interesting i.e. you have got the data cleaned it up and are ready to do actual machine learning in ML studio, or see how to publish an experiment without doing all the rest of the lab.

    Note: you’ll still have to sign up to Azure to get started with MAML , but you can get a trial for a month or use your Azure hours included with your MSDN subscription.

  • Azure Machine Learning – General Availability

    Today Microsoft Azure Machine Learning (MAML)goes from preview to general availability and it’s also undergone quite a few changes on the way.

    Before I get into what’s new (and there is quite a lot) I do occasionally get feedback that Azure services are evolving too quickly for everyone to keep up.   I do agree this is a challenge but this is where the industry is going and it means that new features requested by you or dreamt up by product teams can be quickly put out there and refined based on feedback – so a bit like an experiment in MAML.

    So what’s new now it’s officially in production and why does this matter?

    When I talk about MAML at the events I go to I call out the three things that make it a bit different from other services and tools that do machine learning: It’s collaborative, it has some seriously good algorithms in it that you can grab and go , and it allows you to take your hypothesis from initial concept through to production.  In this post I’ll deal with the first of those – Collaboration.

    You might notice that from today, if you go to the ML Studio landing page there is a new preview option in the menu bar at the top, the gallery, and if you follow this link you’ll get something like this..


    The gallery is the place we can share our work with each other, where until now we could only do this by inviting each other into our workspaces, which is still fine for team work but is not so good if you want to share your finished work.  In the screen shot above you can see that as well as the ML team sharing some new samples, there are already contributions from those of us who have been on the preview for this. 

    The home page has also had a shakeup as more and more training and resource materials are coming on line.  I would also call out the contribute option where you can get help and also feedback on what you think needs to be in MAML.  I can’t say for certain that your idea will get through but I can guarantee that what you write will be reviewed by the product team.

    To start sharing our own experiments with each other we need to describe them to make them searchable and discoverable so you’ll notice that if focus is set to the project in ML studio I can do exactly that..


    Here I can add in a 140 character summary and a longer description. Once I have successfully run my experiment, the Publish to gallery option in the toolbar at the bottom is enabled, allowing me to share my work. Also notice the create scoring experiment option here which I’ll get to in a later post.

    The other data developer evangelist on the UK team I work in is Amy Nicholson, and she has  posted up an experiment we use during the instructor led lab on MAML that we have developed..


    If you open this you’ll realize that it doesn’t do any machine learning at all.  This is because this is the start point of the lab following a quick demo we do of how to load and transform data in MAML, which is just showing how to use ML studio to do the kinds of things we have always done with data.   It’s great for us to have this experiment in the gallery as until now we have had to invite you into our workspace which meant giving us the details of your azure accounts.  Now w you can just open this, save a copy and you are ready to do our lab.  If you do actually want to do this lab then we’ll be at the following events..

    • the Data Culture series which is at various location across the UK
    • the UK venues of SQL Saturday
    • SQLBits at the Excel Centre London 4-7th March (the community day is on 7th and free the other days are paid for).
    • other random events that’ll be on the MSDN UK  & TechNet UK newsletters

    so please come along to one of these with your curiosity and your ideas. However If you want to sit at home and work your way through our lab then please e-mail me and I’ll send you the guide.

  • TechDays on Line 2015–it gets better



    I have been out in Seattle for the last couple of weeks as a roving reporter for our upcoming TechDays Online 2015 and I have managed to get footage of some of the engineers who write the stuff we all use.  I have also been having coffee with some of them to make sure we can share the very latest thinking during the event whether that’s on enterprise mobility, on 3rd Feb, what we are doing with service management and automation on the 4th, and what Windows 10 means for developers on the 5th. 

    Of course it’s no substitute for a ticket to Build (which has sold out anyway) or Ignite, but if you hate flying and look home comforts then there is no better way to stay up to date. than to tune in next week. 

    Also be aware there is small army of experts both MVPs and our filed engineers online throughout the day so if you have questions and comments we will be get you answers right back to you on the side chat.

    If that’s not enough then we are also giving away all sorts of stuff, but hopefully I don’t need to mention that as it’s all about the technology isn’t it (btw I bought my own Surface Pro 3 while I was out there as I  am not allowed to enter!)

    so hopefully we’ll be able to virtually catch up, as there should be something over the three days that will catch your eye.


    Neon sign in downtown seattle – @deepfat 2015

  • Don’t be a turkey

    capercallie 006

    It’s not great being a turkey, pheasant or grouse  at Christmas, but my Christmas sketch this year isn’t too worried as he’s a protected and valued Capercaillie.  In the same  same way the cloud first world might not seem like a great time to be an IT Professional but there are certain species who are similarly valued and for whom 2015 is pretty exciting. 

    So as we all take time out to enjoy time with our friends and families, don’t panic about 2015.  Infrastructure guys are still needed and many of the skills we have acquired are still valued too like VPN’s , certificates, Active Directory, and of course PowerShell and I for one won’t be retiring to do art full time just yet.

    Happy Christmas and hopefully we’ll meet up  at an IT Camp , Future decoded, or user group somewhere next year


  • IT Camp - Extend your Data Centre with Azure

    I have been on a journey over the last year that some of you may already be embarked on and that is to transition my skills to use cloud services and specifically Azure.  For me it’s like going to America, the language is similar, the culture less so and everything over seems to be much bigger such as the cars and buildings. 

    In Azure we can still use the skills we have (PowerShell is amazing!) to run up virtual machines and we understand what they are. Our knowledge of gateways and VPNs is also useful but then there are new things like Web Roles and SQL Azure databases which actually are similar to the web and database services we provide to the business in our own data centres.

    So how do we get your head around this stuff? it all depends on how you learn, some of us like to read the manual , some like to be shown and the curious amongst just jump in. Actually we all use all three of these approaches but in any situation the amount of risk and uncertainty will influence which we use. 

    When it comes to Azure you could sit back and watch the excellent content on MVA, but it’s all in American and isn’t interactive unless you catch a jumpstart like the one in November on Azure.  There are also various lab guides lying around but there’s no immediate help if you get stuck.

    So your best bet might be to start your journey to Azure by coming along to an IT Camp which is a mix of theory and instructor led practical work.

    Specifically we’ll get you started by showing you how to create a group of Azure VMs and stand up a load balanced public facing web site.  This is easy to do but just as hard to maintain as it would be if the VMs were in your basement so in the next stage we move the web sites to Azure web roles and the backend database to SQL Azure. These Azure services mean that all you worry about is the code /database not the OS or application layer.

    Having done this we can then explore other parts of the Azure fabric such as how storage and networking works so you can connect and integrate  your on premises applications and extend those to Azure.  We also look at operations like monitoring your data centre form Azure and how to leverage Azure automation to hand off routine tasks. 

    We have lots of optional content which we cover to a greater degree depending on demand such as running Linux VMs on Azure and how to deploy web sites based on non Microsoft languages. Azure can also be used for disaster recovery so we can cover that if it’s of interest as well.

    Finally we can discuss questions like how much will it cost, how do I move stuff and the fear, uncertainty and doubt (FUD) of the cloud as answers on this are very dependant on  what business you are in and your existing investments in infrastructure.

    So if you are interested in Azure and want to get your hands dirty you need to do three things

    1. sign up for an IT camp 1

    2. Turn up on the day

    3. Bring along an open mind and a laptop/tablet to do the labs on

    1 Stop press 13 Jan 2015 - Open Registrations for our IT Camps are here but you may also get an invitation to closed camps directly from your account manager

  • IT Camps–What are they good for.


    One of the most rewarding yet most difficult part of my job over the last three years has been involved with designing and delivering something called an IT camp.  So I thought it might be useful to explain what they are in case an invite lands in your mailbox.

    What IT Camps are:

    Technical. it’s designed for technical people who like to learn by doing so we allocate 3 hours plus of lab time out of the day so it’s essential you bring a device (or even two – one for the lab guide).

    Focused on a key technology. Camps have a theme , currently either  Enterprise Device Management, and Extend Your Data Centre with Azure.  In both cases we want to get you to a place but how we get there and where we focus is up to you.

    Education rather than training.  We show the art of the possible but we do cut corners that you wouldn’t in production e.g. self signed certs and our PowerShell misses out the error trapping so it’s easy to see what it’s trying to do.  We do share best practice but we also encourage debate about what that is as you are all form very different organisations and what works in retail might not be appropriate in education.

    What IT camps are not:

    Death by PowerPoint. We do use decks to show architecture and setup instructions but that’s about it.  We do supply you with a pile of decks to explain content off line, and you can continue your labs after the day as we’ll make the guides available.

    a Lecture.  While events like Future Decoded provide deep technical content there is little opportunity to ask questions, or have any kind of discussion.  This does mean that IT camps can appear more unstructured than a session at something like TechEd, as we will deviate into areas based on what you are interested in on the day and so no two camps are ever quite the same.  We also make mistakes and things don’t always work because we are doing this live and actually experience has taught us that small hiccups actually reinforce learning so expect surprises!

    A Sales or marketing pitch. Evangelists don't get bonuses for generating leads or , we are there to show the art of the possible and often that's about making the best use of the Microsoft solutions you already have like, Hyper-V or the deployment tools we give away.

    The reason that Ed and I put a lot of effort into camps is about the direct help and advice we can share and the feedback we get from you.  Feedback enables us to continually improve camps a recent example being that we now give you much more time for labs than we used to.  We also get ideas for posts and other content delivered by the TechNet team like my Lab Ops series on using PowerShell in a sandbox.

    So hopefully we’ll have the pleasure of your company at a camp sometime next year.

    Stop press 13 Jan 2015 - Open Registrations for our IT Camps are here but you may also get an invitation to closed camps directly from your account manager.

  • How to train your MAML–Redux

    Since I wrote this series we have taken on two computer science graduates in the evangelism team in the UK, Amy Nicholson and Bianca Furtuna. Both studied machine learning as part of their courses so I introduced them to this series on Azure ML to get them up to speed.  They have taken this series apart and put it back together again as a full day training course which we have been testing at several events recently.  They also spotted some improvements and so I thought I would share these.

    Amount of Training Data:

    In the part of the experiment when I split the data into training and test data, I split out 20% of the data for training.  However what I should have done is used  80% of the data training instead as usually for continuous variable type machine leaning scenarios the majority of data is given to train the model and then 20% is used to validate the model, this allows the model to really refine the patterns it understands in the data and predict with more accuracy.  It’s also important to have a good balance of data as well and this case we need a mix of flights that are late and on time.

    Excluding Month and DayofMonth:

    I removed the time related columns used for the joining the Flight and Weather Data,  Amy wondered if the time of year was significant and ran some tests with the Boosted Decision Tree algorithm:

    Keep Month, DayofMonth

    Remove Month, DayofMonth




    However to be fair to me there isn’t enough data in the sample to really test this as there’s less than a years worth of data.  However Tine of day might be important as well.  The point is here is to experiment and test to iterate to a good result, and knowledge of the business data and maths combined with some scientific rigour is  needed for that.

    Quantize Module Use:

    I explained and used quantization in my experiment the process of putting a continuous variable like temperature into several buckets of ranges as this is in the Azure ML sample experiment on Flight Delay.  Amy decided to test whether this was improving the accuracy of the model  by removing it..

    Keep Month, DayofMonth + No Quantize Module

    Remove Month, DayofMonth + Quantize Module




    so it turns out it’s better not to do that for this scenario.  Amy then wondered whether other two class algorithms in ML worked better with quantised data..

    Without Quantize Module

    With Quantize Module

    Boosted Decision Tree



    Logistic Regression



    Support Vector Machine



    So it’s not helping here at all.  Amy did see a slight reduction in processing time using quantisation and you might argue that a small drop in accuracy might be acceptable to really speed up processing but the improvement


    I learnt three things from this feedback as well as the specific errors above:

    1.  Expertise. It is worth talking to and learning from a real data scientist and sharing your work with them. Azure ML makes this easy by allowing me to invite others into my workspace.

    2.  Do good science.  Don’t slavishly follow other peoples work, be curious but sceptical by testing your experiments rigorously

    3. Document your work.  If you want to share your work you'll need to explain your hypothesis and working like any good scientist. While Azure ML does allow comments but doesn’t have a way of storing other metadata about an experiment the way a solution in Visual Studio does.

  • What kind of data person are you?

    andrewI have been working with data in one form or another since 1991, but I wouldn’t describe myself as a DBA, data scientist or a developer so what am I, and am I alone?

    I spent ten of those years doing business intelligence and most of that was working with databases, designing systems data marts, and analytical models, as well as getting into search and unstructured data.  What these had in common was my desire to provide the business I was in with a better view on the decisions they were making.  However what I was creating in each case was a space in which business users could explore that data to come up with their own answers rather than me doing that as the questions changed quickly and needed a deeper insight into the business than I had

    More recently I have been getting back into working with data again and what’s interesting about that is that there seem to be a lot of other data orientated people like me who are exploring the many new data tools that exist both on the desktop and more interestingly in the cloud.  Some have deeper technical knowledge while others are really close to the business.  In all cases they are curious but cautious in that they will try test and evaluate rather than ignoring a new opportunity or taking someone else’s word for it.

    I would classify us (or should I say cluster us) as data scientists however that term seems to be reserved for those who have a deeper mathematical / statistical background so what should collective term should I use to describe us?  In the team I work in we call ourselves data developers and I kind of like this as the word development implies that we do all do some sort of coding (such as R, Python, SQL, MDX and F#) and it also means we are using some sort of work bench like the SQL Server data tools in Visual Studio.  It also suggests that we are developing the data into something in the process, for example in Machine Learning that something would be a prediction. in BI it might be an analytical model and in real time analysis that might be a trigger to fire an event when a condition is met.

    There are other groups around this field as well. I have already mentioned the data scientist and it might be that their role might be to oversee the models we build and sign them off for production much as an actuary would do in an insurance company.  Other data related roles are the visualisation experts who like user experience designers on an application team put the gloss and usability on data to ensure that its true meaning is self evident.  Then there are curators/librarian types who understand how disparate data sets fit together.  Perhaps one are that will emerge that might get you thinking is an ethicist - as the laws on data privacy are far behind what is possible these days so the question is not “is this illegal?” but “is it ethical?”. Just because someone didn’t uncheck a box on an online form, and just because someone can now mash that preference data with their social media data exhaust and a  bunch of commercially bought 3rd party data that enables a data scientist to infer some really intimate details about someone’s life, doesn’t mean we should.

    Anyway back to data developers.  I think there’s a lot of us out there and that we have had some sort of background in business intelligence, data management etc. and while we may have done some maths in the past it’s not the day job.   My conversations at recent events seem to support this and so our plan is to try and help you transition you into this new world with in person and online content which the MSDN UK team have curated onto a dedicated data developers portal

    What we need from you is feedback on what we are doing to ensure you are getting the content and events you need so it’s just like any  other science project.

  • Lab Ops – Working with Azure VMs

    A couple of months ago I had a very interesting chat with Andy a Data Centre admin at CP Ltd about using PowerShell to manage Azure as Andy is partially sighted, and his team mate Lewis is blind (For more on this please read Andy’s post on the TechNet UK blog) .  I wanted to go into some of the PowerShell in a little more detail so that you can be a good as administrator on Azure as these guys are.  For this post I am going to assume you know how to work with Azure and are familiar with concepts like storage, cloud services and networking, though you will get an idea if you follow this post!

    Firstly to get working with PowerShell on Azure we need to get hold of PowerShell for Azure and remember to check back regularly as they change as Azure changes.

    Before we can run any PowerShell against our subscriptions we need to setup some sort of trust otherwise anyone can create services against our subscription.  The simplest way to do this is with


    this will launch the Azure management portal and ask us to sign in. This command will then save a file to our local machine which we can then consume the file like this..

    Import-AzurePublishSettingsFile -PublishSettingsFile "C:\AzureManagement\some filename.publishsettings"

    However the problem with this approach is that you have access to the whole subscription which is fine for demos and labs. In production you’ll have some sort of Active Directory in place and you'll connect to that with:

    $userName = "your account  name" $securePassword = ConvertTo-SecureString -String "your account password" -AsPlainText –Force

    $Azurecred = New-Object System.Management.Automation.PSCredential($userName, $securePassword)

    Add-AzureAccount -Credential $Azurecred

    Now we can run any of the PowerShell for Azure commands against our subscription but before we can do too much with Azure VMs we will need a storage account to store them..

    $StorageAccountName = “lowercase with no spaces storage account name”

    $AzureLocation = “West Europe”

    New-AzureStorageAccount –StorageAccountName $StorageAccountName –Location $AzureLocation 

    where –Location specifies the data centre you want the storage account to reside in e.g. West Europe and get-AzureLocation will give you all the data centres you can choose. Now we have a storage account we need to declare that as the default location for our VMs ..

    $SubscriptionName = (Get-AzureSubscription).SubscriptionName
    Set-AzureSubscription -SubscriptionName $SubscriptionName -CurrentStorageAccountName $AzureStorageAccountName

    If you are familiar with VMs in Azure you’ll know that by default each VM get’s its own wrapper or cloud service but in this demo I want to put three of these VMs into the same cloud service which we can create with..

    $AzureServiceName = “This needs to be unique on”

    New-AzureService -ServiceName $AzureServiceName –Location $AzureLocation -Description "Lab Ops cloud service"

    Before we can create any VMs we need to also have a network in place and it turns out the PowerShell port for this in Azure is pretty weak all we can do is setup a network using an xml file in the form of

    <NetworkConfiguration xmlns:xsd="" xmlns:xsi="" xmlns="">
            <DnsServer name="AzureDNS" IPAddress="" />
          <VirtualNetworkSite name="My-VNet" Location="West Europe">
              <Subnet name="My-Subnet">

    you can hack this around and then save it as something like VNnet.xml and then apply it to your subscription with.

    $AzureVNet = Set-AzureVNetConfig -ConfigurationPath " path to VNet xml file"

    for more on how to hack this file with PowerShell rather than editing it then have a look at Rik Hepworth’s (azure MVP)  blog  -

    Now to create those VMs we have more choices -  we could use a template VHD of our own but for now we will  just use the gallery images just as we can in the Azure Management portal.  To do this we need to interrogate the gallery to find the right image with something like this..

    $AzureImage = Get-AzureVMimage | where imagefamily -eq "Windows Server 2012 R2 datacenter" | sort-object PublishedDate -Descending  | select-object -first 1

    which will get the most recent gallery image for Windows Server 2012R2 datacenter edition. I can then consume this in a script to create a VM

    $AdminUser = "deepfat"
    $adminPassword = "Passw0rd!"

    New-AzureVMConfig -Name $VMName -InstanceSize Medium -ImageName $AzureImage.ImageName | `
            Add-AzureProvisioningConfig –Windows -AdminUsername $AdminUser –Password $AdminPassword | `
            Set-AzureSubnet 'Deepfat-Prod' |`
            New-AzureVM –ServiceName $AzureServiceName –Location $AzureLocation -VNetName $AzureVNet

    Note that if you want to embed these snippets  in a script you’ll need to get clever and introduce some wait loops to allow the VMs to spin up.

    By default when you create a VM a couple of endpoints will be created one for RDP and one for PowerShell.  In reality you wouldn’t necessarily want to do this as you may have site to site VPN in which case this is redundant or you might just do this on one VM to manage the rest or use Azure Automation.  We need to query for these ports as in a cloud service each VM will have the same DNS entry but with different random ports:

    $VM = Get-AzureVM -ServiceName $AzureServiceName -Name $VMName

    $VMPort = (Get-AzureEndpoint -Name PowerShell -VM $VM).port

    In Andy’s post he published a self signed certificate to his cloud service which is needed to enable a secure remote PowerShell session to the VM.  However if we are just trying this in a lab then we can use the certificate that Azure automatically creates when a cloud service is created as this is also trusted by the VMs in that cloud service by default.  We can then pull this down and trust it on our local machine with

    (Get-AzureCertificate -ServiceName $AzureServiceName -ThumbprintAlgorithm SHA1).Data | Out-File "${env:PUBLIC}\CloudService.cer"
    Import-Certificate -FilePath  "${env:PUBLIC}\CloudService.cer" -CertStoreLocation Cert:\LocalMachine\AuthRoot

    Now we have all the setting and permissions we need to setup a remote PowerShell session to our VM..

    $VMCred = new-object -typename System.Management.Automation.PSCredential -argumentlist $AdminUser, (ConvertTo-SecureString $adminPassword -AsPlainText -Force  )

    $VMSession = New-PSSession -ComputerName ($AzureServiceName + "") -Port $WFE1Port -Credential $VMCred -UseSSL -Name WFE1Session

    with this session we can now add in roles and features, turn on firewall rules and so on like this

    Invoke-Command -Session $WFE1Session -ScriptBlock {
        Get-WindowsFeature Web-server | Add-WindowsFeature –IncludeAllSubFeature}

    If we want to work on a SQL server VM (there’s a bunch of gallery items on Azure with different editions of SQL Server on) then it might be useful to enable SQL Server mixed mode authentication in which case we need to pass parameters into the session and the simplest way to do this is by using the param() setting inside the script block with and –ArgumentList switch at the end (remembering to keep the parameters in the same order..

    Invoke-Command -Session $VMSession -ScriptBlock { param($VMCred, $VMName)
        #set SQL Server to mixed mode and restart the service in the process
        Get-SqlInstance  -machinename $VMName -credential $VMCred -AutomaticallyAcceptUntrustedCertificates |  Set-SqlAuthenticationMode  -Mode Mixed -Credential $VMCred -ForceServiceRestart -SqlCredential $VMCred
    -ArgumentList $VMCred, $VMName

    as this allows us to reuse the parameters we are already working with in the remote session and enhances readability.

    So that’s a quick introduction to some of the stuff that the more enlightened IT Professionals like Andy are using to make their lives easier and actually a lot of the stuff in this post works in your own data centre (like the stuff at the end to setup SQL Server) so using Azure really is just an extension of what you are used to.

    Be warned -  stuff keeps changing on Azure.
    For example a lot of older examples use Affinity Groups in azure to co-locate VMs but this is on the way out so I deliberately didn’t reference that here.  My advice is to be wary of older posts and follow the Azure blog particularly if what you are trying to do is still in preview
  • How to train your maml – Publishing your work

    Before I get into my next post you may have noticed that ML has changed compared with a few weeks ago -the main page previews experiments differently, and the visualization of data has also changed..


    One other problem hit me and that was some of the parts of my demo experiment didn’t work or ran really slowly – for example the Sweep Parameter module seems to be really slow now.  I did ask about this and apparently it’s not been configured to run as a parallel process.  So bear in mind that at the moment MAML is still in preview so it’s going to change and stuff might break.  What’s important is that we feed back to the team so we all get the tool we need.

    What I want to show in this post is how we can publish the work we have done so far, and because of the time it takes to do a sweep parameters module at the moment I am going to suggest we swap that module out for the train module as my experiment takes 4 mins to run with train compared to 5 hours with sweep!  as well as doing this we can now remove the less accurate algorithm (two class logistic regression) and just leave this


    such that the train module is set to train against the column Arrdel15 (whether or not the flight is delayed) .

    Now we’re ready to publish our experiment and share it with the world but what does that mean exactly?

    MAML allows us to expose a trained experiment as  a web service so that a transaction can be sent to it for analysis and a score returned to the calling application / web site etc. In our case we could then send an individual flight and relevant weather data and then get back the likelihood that the flight will be delayed.  While this may not be a unique feature the ease with which it can be achieved does make the use of MAML even more compelling so lets see how it works..

    The first thing we need to do is to define the data  inputs and outputs, by right clicking on the right and input and selecting Select as Publish Input and then on the output port Select as Publishing Output. This declares what data needs to be past in and what data we’ll get out of the model when it’s published.  We just need to rerun the experiment again and now we’ll see that that option to publish a web service at the bottom of the screen is now enabled..


    If we select this option our experiment is published as a draft and we get this screen..


    If we click on the test option on  the right we can see how this works as we’ll be presented with the input data we need to fill out in order to get a flight delay prediction back..


    We can also see how to use the service either in batch or transaction by transaction by clicking on the appropriate API Help page link..


    which gives the link and the format of the query we need to pass to our published service as well as links to sample code in C#, Python and R.

    However our web service is not yet live in production, it’s just a draft.  To put it into production we need to go back to the dashboard for the model and select the Mark as Ready for Production link.  We can then add descriptions to the input fields  and set the Ready for Production flag to Yes..


    where I have properly described what the Carrier ID is and then clicked Save to confirm what I have done.

    Having done all of that we now need to go back to the dashboard view and we’ll now get the option to deploy to Production and if we do that we’ll be taken back to the ML Workspace dashboard in the Azure portal where we can create a Production Web Service..


    There is a reason for all of these steps and that is to separate out who does the publishing form the data science guys in a larger team. Once it’s live note that we can see if there are updates available.  For example the model might be refined and then marked for publishing again as it is continually refined by the team who create and train experiments. 

    Now that our web site is in production we get a dashboard to show how it’s being used that’s very similar to other services in Azure like virtual machines and SQL databases..


    So that’s a quick run through of an end to end experiment in Azure ML which I hope has aroused your curiosity, however you might be wondering how much this stuff costs and that’s another recent change to the service there is a free standard version that lets you work with data sets of up to 10Gb on a single node server and publish a staging web api just I have done here. Above that is a standard edition which does incur costs which are detailed here.  Essentially you pay for every hour you are running experiments and then when you publish you pay per prediction hour and per thousand transactions (currently these are discounted while the service is in preview and are 48p per prediction hour and 12p per thousand predictions). 

    Couple those prices with the ease of use of the service and I think that MAML will take off even if it existed in isolation.  However there are other data services on Azure that have been recently announced like Near Real Time analytics, Storm on HD Insight, Cloudera and Data Factory which I would assert means that Azure is the best place to derive insight from data which by definition is data science.    

  • Future Decoded – with Data Science

    If you have been following my posts on Microsoft’s approach to data science with things like Machine Learning you’ll realise that it’s capable of making predictions. However to be honest it’s not going to tell you what devices and career you’ll have in five years time.  To do that you need expert humans not expert systems and we have gathered some of the best in their field at Future Decoded at the Excel Centre on 12th November..


    Future Decoded is much more than a data science orientated event but there is a special data track within it in the afternoon which is preceded by a morning of inspirational talks in the morning, and of particular interest to any one in the data science space would be Dr Brian Cox, Sir Martin Shadbolt and Michael Taylor (IT director of Lotus F1)  as they have unique perspectives on how to apply and use data to advance science the internet and business.

    The data track in the afternoon is not just for data scientists, it’s for anyone in a data rich e.g. traditional roles like BI and reporting, in fact anyone who can see what they do in this diagram..


    This is a big area to cover so what we have done is collect a diverse group of experts with a different perspective on how to derive value from data, based on their experience of implementing some or all of these techniques:

    image Data Science inside Microsoft Research
    Kenji Takeda,  Microsoft Research 
    JRJ Changing the game  - an agnostic hybrid approach to big data integration
    James Rowland-Jones MVP
    Chris headshot cropped Transformation and Analysis with Power BI
    Chris Webb MVP
    image Azure Machine Learning without a PhD.
    Jen Stirrup MVP

    So hopefully you can spare the time and expense to come along to Future Decoded  and get your head around what this means for you and your career. 

  • How to train your MAML

    In this fourth post in my Azure Machine Learning series we are actually going to do the training itself.  I have tidied up my experiment from last time to get rid of the modules to export data to SQL Azure as that has now served it’s purpose ..


    Before we get stuck into training there’s still a bit more tidying up to do:

    • We can get rid of the year month and day columns as they were only needed to join our datasets and don’t have any relevance in predicting flight delay (actually they might do but I am keeping things simple here).
    • We can put the weather reading data into discrete ranges rather than leaving them as continuous variables.  For example temperature could be aggregated into groups say less than –10,  –10 to 0, 0-10 , 10 , 20-40 and  40 plus.  This is called quantization in the same way that the  quantum in quantum physics recognises that particles like electrons have discrete amounts of energy.  There’s a special quantization module to help us do this in ML studio.

    So lets do those tow steps in ML studio. First add in another Project Columns module under the join to remove the time based columns..


    Now add in the a Quantize module underneath that.  Quantization can work in several ways depending on the binning and quantile normalization options we select in this module.  We can design our own bins into which each value falls by setting the bin edges (like the temperature groups I just mentioned).  If we decide to let the Quantize model set the bin edges automatically then we can choose the algorithm it will use,  the quantile normalization method, and we can set how many bins we want.  For our initial training we’ll select 10 bins for all of the numerical data we are going to use to base our predictions on and we’ll overwrite the columns with the new values..


    If we now Visualize the left hand output of the quantize function we can see that there are now 10 unique values for each of our numerical columns..


    Now we can  actually get on with the training itself, although in the real world we may have to revisit some of the earlier decisions we made to improve the output.  The process of training is much like learning simple arithmetic we are introduced to a new symbol say + and given example of how it works.  To confirm we’ve got it we use it ourselves and compare our results against some more examples.  In machine learning we do the initial training by carving out a sample from our data set.  In MAML we can use the split module to do this which we used this in my last post to just to get us the data from one airport we are now going to use the split function to get a random sample. 

    To do create our training set we drag a Split module to the bottom of the design surface and connect its input to the left hand output of the Quantization module and set its properties as follows to give a 20% random sample we can use for training..


    The question now is which algorithm to use to make our flight delay prediction? This is a huge topic in its own right and also requires prior knowledge of statistics to make sense of what some of these do.  Also you can elect to ignore all the good work Microsoft have done in providing some really powerful algorithms used in Xbox and Bing and bring your own algorithm written in open source R language. In MAML there are three types of built in algorithms, Classification, Clustering & Regression so what exactly do those terms mean?

    • Classification is where we want to determine which group something belongs to – in our case we want to determine whether a flight is in the delayed group r the on time group.
    • Clustering is similar to that but this time we don’t know what the groups are.  For example we might take a collection of purchases from a shopping web site and MAML would work out what things were bought with what other things to provide recommendations when a customer places something in their basket. 
    • Regression is the process of fitting points to curve or line (hence linear regression) so here we are looking at predicting a continuous variable say house price based on attribute of a property such as age area, nearest station distance etc.    

    So if we expand out the  Machine Learning | Evaluate | Classification object on the left of ML studio we can see some curiously named modules like Multiclass Decision Jungle  which we can use..



    Multiclass Decision Forest

    Create a multiclass classification model using a decision forest

    Multiclass Decision Jungle

    Create a multiclass classification model using a decision jungle

    Multiclass Logistic Regression

    Create a multiclass logistic regression classification model

    Multiclass Neural Network

    Create a multiclass classifier using a neural network algorithm

    One-vs-All Multiclass

    Create an one-vs-all classification model

    Two-Class Averaged Perceptron

    Creates an Averaged Perceptron binary classification model

    Two-Class Bayes Point Machine

    Create a Bayes Point Machine binary classification model

    Two-Class Boosted Decision Tree

    Create a binary classifier using a boosted decision tree algorithm

    Two-Class Decision Forest

    Create a two-class classification model using a decision forest

    Two-Class Decision Jungle

    Create a two-class classification model using a decision jungle

    Two-Class Logistic Regression

    Create a two-class logistic regression model

    Two-Class Neural Network

    Create a binary classifier using a neural network algorithm

    Two-Class Support Vector Machine

    Creates a Support Vector Machine binary classification model

    At the time of writing the help on MSDN on what these do is pretty opaque to mere mortals and to add to the fun many of these have parameters that can be tweaked as well.  At this point it’s worth recapping what data we have and what we are trying to do - In our case there is probably some sort of relationship between some of our columns for example  visibility humidity wind speed and temperature and our two outcomes are that the flight is delayed or it isn’t.  So let’s jump in as we can’t really break anything and take Microsoft’s advice that Two-Class Boosted Decision Tree is really effective in predicting one of two out comes especially when the data is sorted of related.

    The other thing we can do while training our model is to work out which of columns are actually informing the decision making process.  Eliminating non significant columns both improves computation time (which we are paying for!) and also improves accuracy by eliminating spurious noise.  In our experiment it might be that the  the non-weather related attributes a factor such as  the carrier (airline), and the originating and departing airports.  Rather than guess or randomly eliminating columns with yet another Project columns module we can get MAML to do the work for us by using the Sweep Parameters module. Let’s see this in action –drag the Sweep parameters module onto the design surface. Note it has three inputs which we connect as follows:

    • the algorithm we want to use which is a Two-Class Boosted Decision Tree module
    • the trained data set this is the 20% split from the split module – so the left hand output.
    • the data to be used to evaluate the sweep is working which is simply the rest of the data form the split module.

    So how does the evaluation but work – simply by selecting the column to be used which has the answer we are looking for , whether or not  a flight is delayed which is the ArrDel15 column.  We can simply leave the rest of the sweep parameters alone for now and rerun the model.


    Each of the outputs from the sweep look very different from what we have in our data set, and although accuracy (at an average of 0.91) is in one of the columns it’s difficult for a data science newbie like me to work out how good the predictions are is. Fortunately we can get a better insight into what’s going on by using the Score and Evaluate modules.  To do this we connect the Score module the left output of the sweep parameters module, underneath this we can then drag on the Evaluate module and connect the output of the score module to the left hand input and rerun the experiment( neither of these have any settings )


    If we now look visualize the output of the Evaluate module we get this graph..


    What is this telling us? The key numbers are the false positive figures just under the graph where we can see that there were 302,343 correctly predicted on late flights, but 98,221 were identified as being late but were on time and that 1,466,0871 flights were correctly predicted as being on time but 61,483 were predicted to be on time but were late.  The ROC graph is also an indication of accuracy the greater the area between the blue line and a straight line is an indication of accuracy as well so the closer the blue line is to un upside down L shape the better it is.

    Could we improve on this score and how might we do that?  We could tweak the parameters for the Two-Class Boosted Decision Tree and what parameters we sweep for or we could see if there is a better algorithm.  For this I would use the technique I use to evaluate a good wine which is to try blind taste two wines and select my favourite hang on to that and then  compare that with the next wine and continue until I have my favourite of a given group.  In MAML we can do this by adding another branch tot eh second input of the second input evaluate module and compare how they do.

    By way of an example we can drag on the Two-Class Logistic Regression module,  copying and pasting the Score Model and Sweep Parameters modules  and connecting up these as shown below..


    and if we visualize the evaluate module we now get two ROC curves and by highlighting them we can see the detailed results  where the red curve is the right hand input so our Two-Class Logistic Regression module


    We can see that the Two-Class Boosted Decision Tree module is slightly outperforming Two-Class Logistic Regression and we should stick with that or compare it to something else.

    So not only is MAML really easy to setup but provides sophisticated tools to help you to evolve to the right algorithm to meet your business need and we still haven’t had to enter any code or get too deep into maths & stats.  What now – well we could output the data to something like Power BI to show the results but we might want also want to use the experiment we have made to predict flights on a departure by departure basis in some sort of web site and that’s what we’ll look at in my next post in this series.

  • How to train your MAML – Looking at the data in SQL Azure

    In my last post we saw how to clean, transform and join datasets. I also mentioned I had trouble doing the join at all and even now it’s not quite right so how can we look at the data and find out what’s going on.  The visualisation option only shows a few rows of the 2 million in the dataset and there’s not really anything in ML Studio to help.  However we can export the data in a number of ways, to Azure storage either as a table or blob or directly into SQL Azure.  The latter seems more useful as we need to investigate what’s going on when the data is joined, however a little bit of pre-work is required as ML studio won’t create the tables for us – we’ll have to do that ourselves.  We also don’t need all the data just a consistent set to work on so let’s start by understanding the split module in ML studio, which we’ll need later on to train our experiment.

    Search for the Split module in MLStudio and drag it onto the design surface, connect it the last Apply Math Operation in the flight delay data set process and set the split properties as shown..


    What the split does is to send the filtered data that meets our criteria to the result dataset1 (on the left) and the data that doesn’t to the result set on the right (result dataset 2). as well see later normally we would use this to randomly split out some of the data at random but in this case we are using a relative expression - \"DestAirportID" > 11432 & <11434 ( I did try \"DestAirportID" = 11433 but I got an error!) to just give us the flights arriving at one airport. If we run the model now we can visualise the data in the left output the split module we’ll just one value (1433).

    Note when we rerun experiment only the bits that have changed and what they affect are actually run – in this case only the split itself is being run the rest is being read form cache (in the  storage account)

    Now we need somewhere to put the data which in this case will be SQL Azure.  As with MAML everything can be done from the browser.

    Go to the Azure management portal and select the SQL Databases option and at the bottom of the screen click on the plus to create a new database with quick create. Mines in northern Europe, it’s basic with no replication and  is 1Gb (you won’t need that much)..


    You will also be prompted for an administrator ID and password. Once the database is created we now need to ensure it remotely and so we need to open up the firewall much as we would do if the database was on any remote server in our data centre.  To this in SQL Azure  click on the configure option ..


    You will see your current ip address here which you can then use to make a rule (I have hidden mine in the screen shot above) . Now we can go back to the dashboard and click on the hyperlink to design or run queries against our new database from the SQL management portal (you’ll be asked to login first).  Now we can add in a table for our flight delay data as ML studio won’t do that for us.  We need it to have the right data types and rather than you doing it manually here is the query you can run to create it.

    USE ML

    CREATE TABLE [dbo].[FlightDelay] (
        [ID]              INT          IDENTITY (1, 1) NOT NULL,
        [Month]           INT          NOT NULL,
        [DayofMonth]      INT          NOT NULL,
        [Carrier]         NVARCHAR (5) NOT NULL,
        [OriginAirportID] INT          NOT NULL,
        [DestAirportID]   INT          NULL,
        [DepDelay]        INT          NULL,
        [CRSArrTime]      INT          NULL,
        [ArrDel15]        INT          NULL,

        ON [dbo].[FlightDelay]([ID] ASC);

    Notice that there is a separate ID column with the identity type so that we have a primary key for each row.

    Now we can see how to export data to SQL Azure from ML studio.  Drag the Data Writer model onto the design surface and connect it to the left hand output of the split module..


    Set the module properties as follows;

    • Data destination:  SQL Azure
    • Database server name: your database server name
    • Database name ML (in my case)
    • Server account name and password to match the admin account and password you setup for your database.
    • Check accept any server certificate
    • Coma separated list of columns to be saved:   Month,DayofMonth,Carrier,OriginAirportID,DestAirportID,CRSArrTime,DepDelay,ArrDel15
    • Data table name Flight Delay
    • Comma separated list of datatable columns: Month,DayofMonth,Carrier,OriginAirportID,DestAirportID,CRSArrTime,DepDelay,ArrDel15
    • Number of rows to be written per SQL Azure operation: 50 (the default)

    Note the columns names above are case sensitive and the number of columns input and output must be the same.  Also be aware if you run the experiment again you’ll add more rows to he SQL tables each time so remember to empty the table before a run –

    truncate table FlightDelay; truncate table AirportWeather

    If we run the experiment now we will populate the FlightDelay table in SQL azure and each rerun will truncate the table and repopulate this (I couldn’t see how to override this).  Once that’s working OK we can then repeat the exercise for the weather data:

    • In the SQL Azure management portal create a new table AirportWeather

     USE [ML]
    CREATE TABLE [dbo].[AirportWeather] (
        [ID]               INT IDENTITY (1, 1) NOT NULL,
        [AirportID]        INT NULL,
        [Month]            INT NULL,
        [Day]              INT NULL,
        [Time]             INT NULL,
        [TimeZone]         INT NULL,
        [Visibility]       INT NULL,
        [DryBulbCelsius]   INT NULL,
        [WetBulbCelsius]   INT NULL,
        [DewPointCelsius]  INT NULL,
        [RelativeHumidity] INT NULL,
        [WindSpeed]        INT NULL

    CREATE CLUSTERED INDEX [IX_AirportWeather_0]
        ON [dbo].[AirportWeather]([ID] ASC);

    • Copy and past the existing Split module and connect it to the last Output Math Operation for weather data process and change the relative expression to  \"AirportID" > 11432 & <11434
    • Copy and paste the existing write module and connect it to the new Split module for the weather data. Change the two setting for the columns to be used to AirportID,Month,Day,Time,TimeZone,Visibility,DryBulbCelsius,WetBulbCelsius,DewPointCelsius,RelativeHumidity,WindSpeed


    Now we’ll leave ML Studio and look at what we have in the SQL Azure portal. Click on new query and paste following in..

    Select F.Month, F.DayofMonth, F.CRSArrTime, F.DestAirportID, F.CRSArrTime, F.DepDelay, F.OriginAirportID, A.DewPointCelsius, A.DryBulbCelsius, A.RelativeHumidity, A.Visibility, A.WetBulbCelsius, A.WindSpeed
    from FlightDelay F
    inner join AirportWeather A
        on F.DestAirportID = A.AirportID  and F.Month = A.Month  and F.DayofMonth = A.Day
        and F.CRSArrTime = A.Time
    order by
    F.Month, F.DayofMonth, F.CRSArrTime

    and notice that we get back 61,231 rows compared to the row count of 62,211 for the FlightDelay table which means we are losing 980 rows of data.  887 rows of this are down to the fact that there are no rows in the weather data for 0  (midnight)  but there are in the flight data (midnight in the weather data shows as 24). So something is wrong with the mechanism for matching the flight arrival time with the nearest timed weather reading.  This was not a deliberate error on my part I just based this all on the example in ML studio but it does show two things:

    • Always check your findings – in this case the wrong weather data was being used which will affect the accuracy of our predictions
    • Don’t blindly trust someone elses work!

    Anyway in this case it’s easy to fix. all we need to do is to subtract one hour from the weather reading data by adding in another Apply Maths Operation Module as show below..


    if we run our query in SQL Azure again we will get back 62159 rows which means we are missing a few rows of weather data and so certain flights will are being dropped when we do the join.  If we were to use this in BI then we would need to fix this but what we need for machine learning is clean data and now we have made this fix we have a good set of data on which to make predictions and that's what we will start to look at in my next post.

  • How to Train your MAML–Refining the data

    In my last post we looked at how to load data into Microsoft Azure Machine Learning using the browser based ML Studio.  We also started to look at the data around predicting delayed flights and identified some problems with it and this post is all about getting the data into the right shape to ensure that the predictive algorithms in MAML have the best chance of giving us the right answer.   

    Our approach is three fold

    • To discard the data we don’t need, either columns that aren’t relevant or are derived from other data and to discard rows where there is missing data for the columns (features in machine learning speak)
    • To tag the features correctly as being numbers or strings and whether they are categorical or not.  Categorical in this context means that the value puts them in a group rather than being continuous so AirportID is categorical as it puts a row into a group of rows for the same airport where temperature is a continuous variable and the numbers do represent point on a line (where AirportID 1 is nothing to do with ID 3 or 4).
    • To join the flight delay dataset to the weather data set on the Airport and the date/time. In my last post I mentioned that we could either join the weather data in twice once to the departure airport and once to the arriving airport and indeed the sample experiment on flight delay prediction does exactly this but I think a simpler approach is to just model the arrival delay on the fact that some flight have a delayed departure time which may or  may not be influenced by the weather at the departure airport.

    Let’s get started..

    Open ML Studio, create a new experiment , give it a suitable name and drag the flight delays and the Weather datasets onto the design surface so it looks like this ..


    Clean the data

    As before we can right click on the circle at the bottom of the data set and select visualize data to see what we are working with- for example here’s the weather data.


    What’s is odd here is that the data is not properly typed in that some of the numeric data is in a column marked string such as  the weather data set temperature columns.  I spent ages trying to work out how to fix this and the answer turns out to be to use the Convert to Data set module which automatically does this.  So our first step is to drag tow of them onto the design service and connect them to each of our data sets..


    If we run our model (run is at the bottom of the screen) we can then visualize the output of the convert to dataset steps and now our data is correctly identified as being numeric etc.

    The next step is to get rid of any unwanted columns and this is simply a case of using the project columns module (to find it just use the search at the top of the modules list).  You can either start with a full list of columns and remove what you don’t need or start with an empty list and add in what you do need. So lets drag it onto to the design surface and then drag a line from the Flight Delays Data to it.  It’ll have a red X against it as it’s not configured and we can do this from the select columns on the task pane


    Here I have selected all columns and then excluded Year , Cancelled, ArrDelay, DepDelay15, and CRSDeptime.  At this point we can check to see that what we get is what we wanted by clicking the run button at the bottom of the screen.  

    Note It’s only when we run stuff in ML Studio that we are being charged for computation time using this service, the rest of the time we are just charged for the storage we are using (for our own data sets and experiments)

    As before at each stage we can visualize the data that’s produced by right clicking on its output node..


    Here we can see that we have one column Depdelay that has missing values so the next thing we need to do is to get rid of that and we can use the Missing Values Scrubber module for this so search for that and drag it on to the design service and drag a connector from the output of the project columns module to it.  We then need to set its properties to set how to deal with the missing values.  As we have such a lot of clean data we can simply ignore any rows with missing values by setting the top option to remove entire row..


    We can now run the experiment again to check we have no more missing values.

    Now we need to do some of this again for the weather dataset. We can then add in another project column module to select the columns we need – this time I am starting with an empty list and specifying which columns to add..


    and the data scrubber module again set to remove the entire row…


    Tag the Features

    Now we need to change the metadata about some of the columns to ensure ML studio handles them properly. Here I cheated which shows you another feature of ML studio.   Remember that some of the number in our data are codes rather than being a continuous number for example the airport codes and the airline code. We need to tell MLStudio that these are categorical  by using the Metadata Editor module. To this we are going to cheat and by simply copying that module form another experiment.  Open another browser window and go into the ML Studio home page and navigate to the Flight Delay sample prediction.  Find the Metadata Editor module on their and paste it to the clipboard and then go back into the browser with our experiment and paste it in, and you should see that this module is set to make Carrier, OriginalAirportID and DepAirPortID categorical...


    Join the datasets

    Now we have to sets of clean data we need to join them.  They both have an airport ID, month and day and the flight delay data set has an arrival time to the nearest minute.  However  the weather data is taken at 56 minutes part the hour every hour and is in local time with a separate time zone column. So what we need to do is round up the flight arrival time to the nearest hour and do the same for the weather data as follows:

    For the flight delay arrival time

    1. Divide the arrival time by 100

    2. Round down the arrival time to the nearest hour

    For the weather data time

    3. Divide the weather reading time by 100 to give the local time in hours

    4. round up to the nearest hour

    So how do we do that in ML studio? The answer is one step at a time making repeated use of the Apply Math Operation  module.  Help is pretty non existent for most of these modules at the time of writing so experimentation is the name of the game, and I hope I have done that for you. We’ll place 4 copies of the Maths Operation module on the design surface one for each step above (so two linked to the weather dataset and two to the flight delay set) ..


    Notice the comments I have added to each module (right click and select add comment) and here’s the setting for each step..

    Step 1


    note the output mode of inplace which means that the value is overwritten and  we get all the other columns in the output as well, so make sure this is set for each of the four steps.

    Step 2


    Step 3


    Step 4


    Now we can use the Join module (again just search for Join and drag it onto the design surface) to connect our data sets together.  Not surprisingly this module has two inputs and one output and we’ll see several modules with multiple inputs and outputs in future.  Connect the last module in each of our data set chains into the join and set the properties for the join as shown..


    so on the left (flight data ) we have Month,DayofMonth,CRSArrTime,DestAirportID and on the right (the weather data) we have Month,Day,Time,AirportID.

    I have to be honest it took a while to get here and initially I got zero rows back.  Even now it’s not quite perfect as I have got slightly more rows than I started with which I have tracked down to having the odd hour in the weather data that has two readings.  Finding that kind of data problem is beyond what you can do in ML studio in the preview so in my next post I’ll show you your options for examining this data outside of ML studio.

  • How to train your MAML – Importing data

    In my last post I split the process of using Microsoft Azure Machine Learning (MAML) down to four steps:

    • Import the data
    • Refine the data
    • Build a model
    • Put the model into production.

    Now I want to go deeper into each of these steps so that you can start to explore and evaluate how this might be useful for your organisation.  To get started you’ll need an Azure subscription; you can use the free trial, your organisation’s Azure subscription or the one that you have with MSDN.  You then need to sign up for MAML as it’s still in preview (note using MAML does incur charges but if you have MSDN or a trial these are capped so you don’t run up a bill without knowing it)..

    You’ll now have an extra icon for MAML in your Azure Management Portal and the from here you’ll need to create an ML Workspace to store your Experiments (models).  Here I have one already called HAL but I can create another if I need to by clicking on New at the bottom of the page, and selecting Machine Learning and clicking on it and then following the Quick Create wizard..


    Notice that as well as declaring the name and owner I also have to specify a storage account where my experiments will reside and at the moment this service is only available in Microsoft’s South Central US data centre.  Now I have somewhere to work I can launch ML Studio from the link on the Dashboard..


    This is simply another browser based app which works just fine in modern browsers like Internet Explorer and Chrome..


    There’s a lot of help on the home page from tutorials and sample to a complete list of functions and tasks in the tool.  However this seems to be aimed at experienced data scientists who are already familiar with the concepts of machine learning.   I am not one of those but I think this stuff is really interesting so if this all new to you too then I hope my journey through this will be useful but  I won’t be offended if you break off now and check these resources because you went to University and not to Art College like me!

    In my example we are going to look at predicting flight delays in the US based on one of the included data sets.  There is an example experiment for this but there isn’t an underlying explanation on how to build up a model like this so I am going to try and do that for you. The New option on the bottom of this ML studio screen allows you to create a new experiment and if you click on this you are presented with the actual ML studio design environment..


    ML studio works much like Visio or SQL Server Integration Services, you just drag and drop the boxes you want on the design surface and connect them up but what do we need to get started?

    MAML needs data and there are two places we can import this -  either by performing a data read operation from some source or creating a data set or. At this point you’ll realise there’s lots of options in ML Studio and so the search option is a quick way of getting to the right thing if you know it’s there.  If we type reader into the search box we can drag that onto the design surface to see what it does..


    The Reader module comes up with a red x as it’s not configured, and to do that there a list of properties on the right hand side of the screen.  For example if the data we want to use is in Azure blob storage then we can enter the path and credentials to load that in.  There are also options for http feed , SQL Azure,  Azure Table Storage as well as HiveQuery (to access Hadoop and HDInsight)  and PowerQuery. PowerQuery is a bit misleading as it’s actually a way of getting OData and one example of that is PowerQuery.  Having looked at this we’ll delete it and work with one of the sample data sets.

    Expand the data sources option on the left you’ll see a long list of samples from IMDB film titles to flight delays and astronomy data. If I drag the Flight Delays Data dataset onto the design surface  I can then examine it by right clicking on the output node at the bottom of it, right click and select Visualize..


    this is essential as we need to know what we are dealing with and ML Studio gives us some basic stats on what we have..


    MAML is fussy about it’s diet and heres’ a few basic rules:

    • we just need the data we that are relevant to making a prediction.  For example all the rows have the same values for year (2013) so we can exclude that.
    • There shouldn't be any missing values in the data we are going to use to make a prediction and 27444 rows of this 2719418 row data set have missing departure values so we will want to exclude those.
    • No feature should be dependant on another feature much as in good normalisation techniques for data base design.  DepDelay and DepDel15 are related in that if DepDelay is greater then 15 minutes then DepDelay = 1.  The question is which one is the best at predicting the Arrival Delay, specifically ArrDel15 which is whether or not the flight is more than 15 minutes late.
    • Each column (feature in data science speak) should be of the same order of magnitude.

    However eve after cleaning this up there is also some key missing data to answer our question “why are flights delayed?” It might be problems associated with the time of day or the week , the carrier our difficulties at the departing or arriving airport, but what about the weather which isn’t in our data set?  Fortunately there is another data set we can use for this – the appropriately named Weather dataset.  If we examine this in the same way we can see that it is for the same time period and has a feature for airport so it's be easy to join to our flight delay dataset. The bad news is that most of the data we want to work with is of type string (like the temperatures) and there redundancy ion it as well so we’ll have some clearing up to do before we can use it. 

    Thinking about my flying experiences it occurred to me that we might need to work the weather dataset in twice to get the conditions at both the departing and the arriving airport. Then I realised that any delays at the departing airport might be dependant on the weather and we already have data for the departure delay (DepDelay) so all we would need to do is to join it which we’ll look at in the next post in this series where we prepare the data. based on what we know about it.

    Now we know more about our data we can start to clean it and refine it and I’ll get stuck into that in my next post but just one thing before I go – we can’t save our experiment yet as we haven’t got any modules on there to do any processing so don’t panic we’ll get to that next.

  • Learning Machine Learning using a Jet Engine

    In this post I want to try and explain what machine learning is and put into context with what we used to do when analysing and manipulating data.  When I started doing all this the hot phrase was Business Intelligence (BI),  but this was preceded by EIS (Executive Information Systems) and DSS (Decision Support Systems).  All of these were to a larger extent looking backwards like driving a car using just the rear view mirror.  There was some work being done on trying to do look ahead (predictive analysis) and this was typically achieved by applying data mining techniques like regression (fitting points to a line or curve).

    Nowadays the hot topic is Machine Learning (ML) so is that just a new marketing phrase or is this something that’s actually new?  Data mining and ML both have a lot in common; complex algorithms that have to be trained to build a model that can then be used against some set of data to derive something that wasn’t there before. This might be a numeric (continuous value) or a discreet value like the name of a group or a simple yes/no.  What I think makes ML different is that it’s there for a specific purpose where data mining is more like a fishing exercise to discover hidden relationships where one such use is predictive analytics.  So data mining could be considered to be a sub set of ML and one use of ML is to make predictions.

    If I look at how Microsoft has implemented ML in Azure (which I will refer to as MAML), then there is a lot of processes around data acquisition before training and publishing occur.  In this regard we might relate this to a modern commercial jet engine.


    • Suck the data in to the ML workbench..


    This can be done either from a raw source or via HD Insight (Hadoop on Azure) which means MAML can work with big data.  Note that in the jet  engine diagram much of the air bypasses the engine if we are using big data in ML then we may ignore large chunks of that as not being relevant to what we are doing.  A good example is Twitter – most tweets aren’t relevant because they don’t mention my organisation.

    • Squeeze the data.


    If we haven’t sourced the data from something like HD Insight we need to clean it up to ensure  a high quality output and we understand it’s structure -  type, cardinality etc.

    • Bang.


    In a jet engine we introduce fuel and get a controlled explosion in ML we apply our analysis to get results and provide momentum for organisation.  Specifically this is where we build and refine our analytical model by taking an algorithm such as one used in data mining and training it against a sample set of data. MAML has specific features to support this a special split task for carving out data for model training and an evaluation task to tell you how well your model is performing by visualising the output against an ideal (upside down L curve).

    • Blow. 


    Having established that your model works you’ll want to unleash it on the world and share it which is analogous to the thrust or work produce by a jet engine.  In MAML we can expose the model as a web service or a data set which can be used in any number of ways to drive a decision making process. 

    Jet engines don’t exist in isolation typically these are attached to aircraft and it is the aircraft that controls where the engine goes.  In ML we are going to have a project to set our direction and ML might only be a small part of this in the way that the product recommendation engine on a web site like Amazon is only a small part of the customer experience.  So as with all the varying names for projects that use data to drive businesses forward we need to be aware that the analytics bit is a small part of the whole; we need to be aware of data quality, customer culture and all of the baggage that are essential to a successful implementation.  There is also one extra complication that comes from the data mining world and that is that it’s not always possible to see how the result was derived at so we have to build trust in the results particularly if they contradict our intuition and experience. 

    So that the good news for us experienced BI professionals is that we can apply many of the lessons we have learnt and apply them to ML and with MAML we don’t need to know too much about the data mining algorithms unless we want to.

  • Adventure Works!

    My title for this post is a pun on the Adventure Works databases, and samples that have been in SQL Server since I can remember.  There were also some data mining examples ( as referenced in  this old post) but this has not really moved on since 2011 when I last wrote about it so you might be forgiven for thinking that data mining is dead as far as Microsoft is concerned.

    However since that time two big things have happened;  the hyper-scale of cloud and the rise of social media as a business tool not just as a bit of fun to share strange pictures and meaningless chat.  Coupled together this is big data;   masses of largely useless data, being produced at a rate faster than can be downloaded in a variety of semi and unstructured formats – so Volume , Velocity and Variety.  Hidden in this data are nuggets of gold such as brand sentiment , how users navigate our web sites and what they are looking at, and patterns that we can’t immediately recognise. Up until now processing and analysing big data has really only been possible for large corporates and governments as they have the resources and expertise to do this. However as well as storing big data the cloud can also be used to make this big data analysis available to anyone who has the sense of adventure to give it a try -  all that’s needed is access to the data and an understanding of how to mine the information.  However the understanding bit of the equation is still a problem and this expertise aka data science is the bottleneck and a quick search on your favourite jobs board for jobs in this area will confirm this. 

    So what is Microsoft doing about this?

    What they have always done – simplify it , commoditise it, and integrate it.  If I go back to SQL Server 2000 we had DTS to load and transform data from any source and analysis services to slice and dice it from Excel and then we got reporting services in 2002 all in one product.  In 2014 we have a complete set of tools to mash, hack and slice data into submission from any source, but these tools are no longer in SQL Server they are in the cloud specifically Azure and in Office 365.   So what are the tools?

    • HDInsight which is Hadoop running as a service in Azure  where you can build a  cluster as large as you need and feed it data with all the industry standard tools you are used to (Mahout and Pig for example).
    • Microsoft Azure Machine Learning (MAML) can take data from anywhere including HDInsight and do the kind of predictive analytics that data mining promised but without the need to be a data scientist yourself.  This is because the MAML studio has a raft of the best algorithms that are publicly available and is also very easy to use from IE or Chrome – actually it reminds a bit of SQL Server Integration Services which is no bad thing.    


    Once you have trained your experiment (as they called) you can expose this as a web service which can then be consumed on a transaction by transaction basis to score credit, advise on purchase decisions etc. within your own web sites. 

    • Office 365 provides the presentation layer on the tools above with access to HDInsight data and machine learning datasets from the Power BI suite of tools.

    In order to play with any of these there’s two other tools you’ll need - An MSDN subscription to get free hours on Azure to try this out and to get a copy of Office 2013 for the Power BI stuff. You’ll also want to watch the Microsoft Virtual Academy for advice and guidance although at the time of writing there aren’t any courses on MAML as it’s so new.

    Finally a word of warning before you start on your own adventure  - these tools can all encode a certain amount of business logic and so it’s important to understand the end to end changes you have made in building your models from source to target and to consider where and when to use which tool.  For example Power Pivot can itself do quite a of data analysis but is best used in a big data world as a front end for HDInsight or machine learning experiment. 

    I will be going deeper into this in subsequent posts as this stuff is not only jolly interesting it’s also a huge career opportunity for anyone who loves mucking around with data.