Insufficient data from Andrew Fryer

The place where I page to when my brain is full up of stuff about the Microsoft platform

August, 2014

  • How to train your MAML – Importing data

    In my last post I split the process of using Microsoft Azure Machine Learning (MAML) down to four steps:

    • Import the data
    • Refine the data
    • Build a model
    • Put the model into production.

    Now I want to go deeper into each of these steps so that you can start to explore and evaluate how this might be useful for your organisation.  To get started you’ll need an Azure subscription; you can use the free trial, your organisation’s Azure subscription or the one that you have with MSDN.  You then need to sign up for MAML as it’s still in preview (note using MAML does incur charges but if you have MSDN or a trial these are capped so you don’t run up a bill without knowing it)..

    You’ll now have an extra icon for MAML in your Azure Management Portal and the from here you’ll need to create an ML Workspace to store your Experiments (models).  Here I have one already called HAL but I can create another if I need to by clicking on New at the bottom of the page, and selecting Machine Learning and clicking on it and then following the Quick Create wizard..


    Notice that as well as declaring the name and owner I also have to specify a storage account where my experiments will reside and at the moment this service is only available in Microsoft’s South Central US data centre.  Now I have somewhere to work I can launch ML Studio from the link on the Dashboard..


    This is simply another browser based app which works just fine in modern browsers like Internet Explorer and Chrome..


    There’s a lot of help on the home page from tutorials and sample to a complete list of functions and tasks in the tool.  However this seems to be aimed at experienced data scientists who are already familiar with the concepts of machine learning.   I am not one of those but I think this stuff is really interesting so if this all new to you too then I hope my journey through this will be useful but  I won’t be offended if you break off now and check these resources because you went to University and not to Art College like me!

    In my example we are going to look at predicting flight delays in the US based on one of the included data sets.  There is an example experiment for this but there isn’t an underlying explanation on how to build up a model like this so I am going to try and do that for you. The New option on the bottom of this ML studio screen allows you to create a new experiment and if you click on this you are presented with the actual ML studio design environment..


    ML studio works much like Visio or SQL Server Integration Services, you just drag and drop the boxes you want on the design surface and connect them up but what do we need to get started?

    MAML needs data and there are two places we can import this -  either by performing a data read operation from some source or creating a data set or. At this point you’ll realise there’s lots of options in ML Studio and so the search option is a quick way of getting to the right thing if you know it’s there.  If we type reader into the search box we can drag that onto the design surface to see what it does..


    The Reader module comes up with a red x as it’s not configured, and to do that there a list of properties on the right hand side of the screen.  For example if the data we want to use is in Azure blob storage then we can enter the path and credentials to load that in.  There are also options for http feed , SQL Azure,  Azure Table Storage as well as HiveQuery (to access Hadoop and HDInsight)  and PowerQuery. PowerQuery is a bit misleading as it’s actually a way of getting OData and one example of that is PowerQuery.  Having looked at this we’ll delete it and work with one of the sample data sets.

    Expand the data sources option on the left you’ll see a long list of samples from IMDB film titles to flight delays and astronomy data. If I drag the Flight Delays Data dataset onto the design surface  I can then examine it by right clicking on the output node at the bottom of it, right click and select Visualize..


    this is essential as we need to know what we are dealing with and ML Studio gives us some basic stats on what we have..


    MAML is fussy about it’s diet and heres’ a few basic rules:

    • we just need the data we that are relevant to making a prediction.  For example all the rows have the same values for year (2013) so we can exclude that.
    • There shouldn't be any missing values in the data we are going to use to make a prediction and 27444 rows of this 2719418 row data set have missing departure values so we will want to exclude those.
    • No feature should be dependant on another feature much as in good normalisation techniques for data base design.  DepDelay and DepDel15 are related in that if DepDelay is greater then 15 minutes then DepDelay = 1.  The question is which one is the best at predicting the Arrival Delay, specifically ArrDel15 which is whether or not the flight is more than 15 minutes late.
    • Each column (feature in data science speak) should be of the same order of magnitude.

    However eve after cleaning this up there is also some key missing data to answer our question “why are flights delayed?” It might be problems associated with the time of day or the week , the carrier our difficulties at the departing or arriving airport, but what about the weather which isn’t in our data set?  Fortunately there is another data set we can use for this – the appropriately named Weather dataset.  If we examine this in the same way we can see that it is for the same time period and has a feature for airport so it's be easy to join to our flight delay dataset. The bad news is that most of the data we want to work with is of type string (like the temperatures) and there redundancy ion it as well so we’ll have some clearing up to do before we can use it. 

    Thinking about my flying experiences it occurred to me that we might need to work the weather dataset in twice to get the conditions at both the departing and the arriving airport. Then I realised that any delays at the departing airport might be dependant on the weather and we already have data for the departure delay (DepDelay) so all we would need to do is to join it which we’ll look at in the next post in this series where we prepare the data. based on what we know about it.

    Now we know more about our data we can start to clean it and refine it and I’ll get stuck into that in my next post but just one thing before I go – we can’t save our experiment yet as we haven’t got any modules on there to do any processing so don’t panic we’ll get to that next.

  • Learning Machine Learning using a Jet Engine

    In this post I want to try and explain what machine learning is and put into context with what we used to do when analysing and manipulating data.  When I started doing all this the hot phrase was Business Intelligence (BI),  but this was preceded by EIS (Executive Information Systems) and DSS (Decision Support Systems).  All of these were to a larger extent looking backwards like driving a car using just the rear view mirror.  There was some work being done on trying to do look ahead (predictive analysis) and this was typically achieved by applying data mining techniques like regression (fitting points to a line or curve).

    Nowadays the hot topic is Machine Learning (ML) so is that just a new marketing phrase or is this something that’s actually new?  Data mining and ML both have a lot in common; complex algorithms that have to be trained to build a model that can then be used against some set of data to derive something that wasn’t there before. This might be a numeric (continuous value) or a discreet value like the name of a group or a simple yes/no.  What I think makes ML different is that it’s there for a specific purpose where data mining is more like a fishing exercise to discover hidden relationships where one such use is predictive analytics.  So data mining could be considered to be a sub set of ML and one use of ML is to make predictions.

    If I look at how Microsoft has implemented ML in Azure (which I will refer to as MAML), then there is a lot of processes around data acquisition before training and publishing occur.  In this regard we might relate this to a modern commercial jet engine.


    • Suck the data in to the ML workbench..


    This can be done either from a raw source or via HD Insight (Hadoop on Azure) which means MAML can work with big data.  Note that in the jet  engine diagram much of the air bypasses the engine if we are using big data in ML then we may ignore large chunks of that as not being relevant to what we are doing.  A good example is Twitter – most tweets aren’t relevant because they don’t mention my organisation.

    • Squeeze the data.


    If we haven’t sourced the data from something like HD Insight we need to clean it up to ensure  a high quality output and we understand it’s structure -  type, cardinality etc.

    • Bang.


    In a jet engine we introduce fuel and get a controlled explosion in ML we apply our analysis to get results and provide momentum for organisation.  Specifically this is where we build and refine our analytical model by taking an algorithm such as one used in data mining and training it against a sample set of data. MAML has specific features to support this a special split task for carving out data for model training and an evaluation task to tell you how well your model is performing by visualising the output against an ideal (upside down L curve).

    • Blow. 


    Having established that your model works you’ll want to unleash it on the world and share it which is analogous to the thrust or work produce by a jet engine.  In MAML we can expose the model as a web service or a data set which can be used in any number of ways to drive a decision making process. 

    Jet engines don’t exist in isolation typically these are attached to aircraft and it is the aircraft that controls where the engine goes.  In ML we are going to have a project to set our direction and ML might only be a small part of this in the way that the product recommendation engine on a web site like Amazon is only a small part of the customer experience.  So as with all the varying names for projects that use data to drive businesses forward we need to be aware that the analytics bit is a small part of the whole; we need to be aware of data quality, customer culture and all of the baggage that are essential to a successful implementation.  There is also one extra complication that comes from the data mining world and that is that it’s not always possible to see how the result was derived at so we have to build trust in the results particularly if they contradict our intuition and experience. 

    So that the good news for us experienced BI professionals is that we can apply many of the lessons we have learnt and apply them to ML and with MAML we don’t need to know too much about the data mining algorithms unless we want to.

  • Adventure Works!

    My title for this post is a pun on the Adventure Works databases, and samples that have been in SQL Server since I can remember.  There were also some data mining examples ( as referenced in  this old post) but this has not really moved on since 2011 when I last wrote about it so you might be forgiven for thinking that data mining is dead as far as Microsoft is concerned.

    However since that time two big things have happened;  the hyper-scale of cloud and the rise of social media as a business tool not just as a bit of fun to share strange pictures and meaningless chat.  Coupled together this is big data;   masses of largely useless data, being produced at a rate faster than can be downloaded in a variety of semi and unstructured formats – so Volume , Velocity and Variety.  Hidden in this data are nuggets of gold such as brand sentiment , how users navigate our web sites and what they are looking at, and patterns that we can’t immediately recognise. Up until now processing and analysing big data has really only been possible for large corporates and governments as they have the resources and expertise to do this. However as well as storing big data the cloud can also be used to make this big data analysis available to anyone who has the sense of adventure to give it a try -  all that’s needed is access to the data and an understanding of how to mine the information.  However the understanding bit of the equation is still a problem and this expertise aka data science is the bottleneck and a quick search on your favourite jobs board for jobs in this area will confirm this. 

    So what is Microsoft doing about this?

    What they have always done – simplify it , commoditise it, and integrate it.  If I go back to SQL Server 2000 we had DTS to load and transform data from any source and analysis services to slice and dice it from Excel and then we got reporting services in 2002 all in one product.  In 2014 we have a complete set of tools to mash, hack and slice data into submission from any source, but these tools are no longer in SQL Server they are in the cloud specifically Azure and in Office 365.   So what are the tools?

    • HDInsight which is Hadoop running as a service in Azure  where you can build a  cluster as large as you need and feed it data with all the industry standard tools you are used to (Mahout and Pig for example).
    • Microsoft Azure Machine Learning (MAML) can take data from anywhere including HDInsight and do the kind of predictive analytics that data mining promised but without the need to be a data scientist yourself.  This is because the MAML studio has a raft of the best algorithms that are publicly available and is also very easy to use from IE or Chrome – actually it reminds a bit of SQL Server Integration Services which is no bad thing.    


    Once you have trained your experiment (as they called) you can expose this as a web service which can then be consumed on a transaction by transaction basis to score credit, advise on purchase decisions etc. within your own web sites. 

    • Office 365 provides the presentation layer on the tools above with access to HDInsight data and machine learning datasets from the Power BI suite of tools.

    In order to play with any of these there’s two other tools you’ll need - An MSDN subscription to get free hours on Azure to try this out and to get a copy of Office 2013 for the Power BI stuff. You’ll also want to watch the Microsoft Virtual Academy for advice and guidance although at the time of writing there aren’t any courses on MAML as it’s so new.

    Finally a word of warning before you start on your own adventure  - these tools can all encode a certain amount of business logic and so it’s important to understand the end to end changes you have made in building your models from source to target and to consider where and when to use which tool.  For example Power Pivot can itself do quite a of data analysis but is best used in a big data world as a front end for HDInsight or machine learning experiment. 

    I will be going deeper into this in subsequent posts as this stuff is not only jolly interesting it’s also a huge career opportunity for anyone who loves mucking around with data.