• Data Science Perspectives: Q&A with Microsoft Data Scientists Val Fontama and Wee Hyong Tok

    You can’t read the tech press without seeing news of exciting advancements or opportunities in data science and advanced analytics. We sat down with two of our own Microsoft Data Scientists to learn more about their role in the field, some of the real-world successes they’ve seen, and get their perspective on today’s opportunities in these evolving areas of data analytics.

    If you want to learn more about predictive analytics in the cloud or hear more from Val and Wee Hyong, check out their new book, Predictive Analytics with Microsoft Azure Machine Learning: Build and Deploy Actionable Solutions in Minutes.

    First, tell us about your roles at Microsoft?

     [Val] Principal Data Scientist in the Data and Decision Sciences Group (DDSG) at Microsoft

     [Wee Hyong] Senior Program Manager, Azure Data Factory team at Microsoft

     And how did you get here? What’s your background in data science?

    [Val] I started in data science over 20 years ago when I did a PhD in Artificial Intelligence. I used Artificial Neural Networks to solve challenging engineering problems, such as the measurement of fluid velocities and heat transfer. After my PhD, I applied data mining in the environmental science and credit industry: I did a year’s post-doctoral fellowship before joining Equifax as a New Technology Consultant in their London office. There, I pioneered the application of data mining to risk assessment and marketing in the consumer credit industry. I hand coded over ten machine learning algorithms, including neural networks, genetic algorithms, and Bayesian belief networks in C++ and applied them to fraud detection, predicting risk of default, and customer segmentation.    

    [Wee Hyong] I’ve worked on database systems for over 10 years, from academia to industry.  I joined Microsoft after I completed my PhD in Data Streaming Systems. When I started, I worked on shaping the SSIS server from concept to release in SQL Server 2012. I have been super passionate about data science before joining Microsoft. Prior to joining Microsoft, I wrote code on integrating association rule mining into a relational database management system, which allows users to combine association rule mining queries with SQL queries. I was a SQL Server Most Valuable Professional (MVP), where I was running data mining boot camps for IT professionals in Southeast Asia, and showed how to transform raw data into insights using data mining capabilities in Analysis Services.

    What are the common challenges you see with people, companies, or other organizations who are building out their data science skills and practices?

    [Val] The first challenge is finding the right talent. Many of the executives we talk to are keen to form their own data science teams but may not know where to start. First, they are not clear what skills to hire – should they hire PhDs in math, statistics, computer science or other? Should the data scientist also have strong programming skills? If so, in what programming languages? What domain knowledge is required? We have learned that data science is a team sport, because it spans so many disciplines including math, statistics, computer science, etc. Hence it is hard to find all the requisite skills in a single person. So you need to hire people with complementary skills across these disciplines to build a complete team.

    The next challenge arises once there is a data science team in place – what’s the best way to organize this team? Should the team be centralized or decentralized? Where should it sit relative to the BI team? Should data scientists be part of the BI team or separate? In our experience at Microsoft, we recommend having a hybrid model with a centralized team of data scientists, plus additional data scientists embedded in the business units. Through the embedded data scientists, the team can build good domain knowledge in specific lines of business. In addition, the central team allows them to share knowledge and best practices easily. Our experience also shows that it is better to have the data science team separate from the BI team. The BI team can focus on descriptive and diagnostic analysis, while the data science team focuses on predictive and prescriptive analysis. Together they will span the full continuum of analytics.

    The last major challenge I often hear about is the actual practice of deploying models in production. Once a model is built, it takes time and effort to deploy it in production. Today many organizations rewrite the models to run on their production environments. We’ve found success using Azure Machine Learning, as it simplifies this process significantly and allows you to deploy models to run as web services that can be invoked from any device.

    [Wee Hyong] I also hear about challenges in identifying tools and resource to help build these data science skills. There are a significant number of online and printed resources that provide a wide spectrum of data science topics – from theoretical foundations for machine learning, to practical applications of machine learning. One of the challenges is trying to navigate amongst the sea of resources, and selecting the right resources that can be used to help them begin.

    Another challenge I have seen often is identifying and figuring out the right set of tools that can be used to model the predictive analytics scenario. Once they have figured out the right set of tools to use, it is equally important for people/companies to be able to easily operationalize the predictive analytics solutions that they have built to create new value for their organization.

    What is your favorite data science success story?

    [Val] My two favorite projects are the predictive analytics projects for ThyssenKrupp and Pier 1 Imports. I’ll speak today about the Pier 1 project. Last spring my team worked with Pier 1 Imports and their partner, MAX451, to improve cross-selling and upselling with predictive analytics. We built models that predict the next logical product category once a customer makes a purchase. Based on Azure Machine Learning, this solution will lead to a much better experience for Pier 1 customers.

    [Wee Hyong] One of my favorite data science success story is how OSIsoft collaborated with the Carnegie Mellon University (CMU) Center for Building Performance and Diagnostics to build an end-to-end solution that addresses several predictive analytics scenarios. With predictive analytics, they were able to solve many of their business challenges ranging from predicting energy consumption in different buildings to fault detection. The team was able to effectively operationalize the machine learning models that are built using Azure Machine Learning, which led to better energy utilization in the buildings at CMU.

    What advice would you give to developers looking to grow their data science skills?
    [Val] I would highly recommend learning multiple subjects: statistics, machine learning, and data visualization. Statistics is a critical skill for data scientists that offers a good grounding in correct data analysis and interpretation. With good statistical skills we learn best practices that help us avoid pitfalls and wrong interpretation of data. This is critical because it is too easy to unwittingly draw the wrong conclusions from data. Statistics provides the tools to avoid this. Machine learning is a critical data science skill that offers great techniques and algorithms for data pre-processing and modeling. And last, data visualization is a very important way to share the results of analysis. A good picture is worth a thousand words – the right chart can help to translate the results of complex modeling into your stakeholder’s language. So it is an important skill for a budding data scientist.

    [Wee Hyong] Be obsessed with data, and acquire a good understanding of the problems that can be solved by the different algorithms in the data science toolbox. It is a good exercise to jumpstart by modeling a business problem in your organization where predictive analytics can help to create value. You might not get it right in the first try, but it’s OK. Keep iterating and figuring out how you can improve the quality of the model. Over time, you will see that these early experiences help build up your data science skills.

    Besides your own book, what else are you reading to help sharpen your data science skills?

    [Val] I am reading the following books:

    • Data Mining and Business Analytics with R by Johannes Ledolter
    • Data Mining: Practical Machine Learning Tools and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems) by Ian H. Witten, Eibe Frank, and Mark A. Hall
    • Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie or Die by Eric Siegel

    [Wee Hyong] I am reading the following books:

    • Super Crunchers: Why Thinking-By-Numbers Is the New Way to Be Smart by Ian Ayres
    • Competing on Analytics: The New Science of Winning by Thomas H. Davenport and Jeanne G. Harris.

    Any closing thoughts?

    [Val]  One of the things we share in the book is that, despite the current hype, data science is not new. In fact, the term data science has been around since 1960. That said, I believe we have many lessons and best practices to learn from other quantitative analytics professions, such as actuarial science. These include the value of peer reviews, the role of domain knowledge, etc. More on this later.

    [Wee Hyong] One of the reasons that motivated us to write the book is we wanted to contribute back to the data science community, and have a good, concise data science resource that can help fellow data scientists get started with Azure Machine Learning. We hope you find it helpful. 

  • Results are Beautiful: 4 Best Practices for Big Data in Healthcare

    When you put big data to work, results can be beautiful. Especially when those results are as impactful as saving lives. Here are four best practice examples of how big data is being used in healthcare to improve, and often save, lives.

    Aerocrine improves asthma care with near-real-time data

    Millions of asthma sufferers worldwide depend on Aerocrine monitoring devices to diagnose and treat their disease effectively. But those devices are sensitive to small changes in ambient environment. That’s why Aerocrine is using a cloud analytics solution to boost reliability. Read more.

    Virginia Tech advances DNA sequencing with cloud big data solution

    DNA sequencing analysis is a form of life sciences research that has the potential to lead to a wide range of medical and pharmaceutical breakthroughs. However, this type of analysis requires supercomputing resources and Big Data storage that many researchers lack. Working through a grant provided by the National Science Foundation in partnership with Microsoft, a team of computer scientists at Virginia Tech addressed this challenge by developing an on-demand, cloud-computing model using the Windows Azure HDInsight Service. By moving to an on-demand cloud computing model, researchers will now have easier, more cost-effective access to DNA sequencing tools and resources, which could lead to even faster, more exciting advancements in medical research. Read more.

    The Grameen Foundation expands global humanitarian efforts with cloud BI

    Global nonprofit Grameen Foundation is dedicated to helping as many impoverished people as possible, which means continually improving the way Grameen works. To do so, it needed an ongoing sense of its programs’ performance. Grameen and Microsoft brought people and technology together to create a BI solution that helps program managers and financial staff: glean insights in minutes, not hours; expand services to more people; and make the best use of the foundation’s funding. Read more.

    Ascribe transforms healthcare with faster access to information

    Ascribe, a leading provider of IT solutions for the healthcare industry, wanted to help clinicians identify trends and improve services by supplying faster access to information. However, exploding volumes of structured and unstructured data hindered insight. To solve the problem, Ascribe designed a hybrid-cloud solution with built-in business intelligence (BI) tools based on Microsoft SQL Server 2012 and Windows Azure. Now, clinicians can respond faster with self-service BI tools. Read more.

    Learn more about Microsoft’s big data solutions

  • Relational Data Warehouse + Big Data Analytics: Analytics Platform System (APS) Appliance Update 3

    This blog post was authored by: Matt Usher, Senior PM on the Microsoft Analytics Platform System (APS) team

    Microsoft is happy to announce the release of the Analytics Platform System (APS) Appliance Update (AU) 3. APS is Microsoft’s big data in a box appliance for serving the needs of relational data warehouses at massive scale. With this release, the APS appliance supports new scenarios for utilizing Power BI modeling, visualization, and collaboration tools over on premise data sets. In addition, this release extends the PolyBase to allow customers to utilize the HDFS infrastructure in Hadoop for ORC files and directory modeling to more easily integrate non-relational data into their data insights.

    The AU3 release includes:

    • PolyBase recursive Directory Traversal and ORC file format support
    • Integrated Data Management Gateway enables query from Power BI to on premise APS
    • TSQL compatibility improvements to reduce migration friction from SQL Server SMP
    • Replatformed to Windows Server 2012 R2 and SQL Server 2014

    PolyBase Directory Transversal and ORC File Support

    PolyBase is an integrated technology that allows customers to utilize the skillset that they have developed in TSQL for querying and managing data in Hadoop platforms. With the AU3 release, the APS team has augmented this technology with the ability to define an external table that targets a directory structure as a whole. This new ability unlocks a whole new set of scenarios for customers to utilize their existing investments in Hadoop as well as APS to provide greater insight into all of the data collected within their data systems. In addition, AU3 introduces full support for the Optimized Row Column (ORC) file format – a common storage mechanism for files within Hadoop.

    As an example of this new capability, let’s examine a customer that is using APS to host inventory and Point of Sale (POS) data in an APS appliance while storing the web logs from their ecommerce site in a Hadoop path structure. With AU3, the customer can simply maintain a structure for their logs in Hadoop in a structure that is easy to construct such as year/month/date/server/log for simple storage and recovery within Hadoop that can then be exposed as a single table to analysts and data scientists for insights.

    In this example, let’s assume that each of the Serverxx folders contains the log file for that server on that particular day. In order to surface the entire structure, we can construct an external table using the following definition:

    CREATE EXTERNAL TABLE [dbo].[WebLogs]
    (
    	[Date] DATETIME NULL,
    	[Uri] NVARCHAR(256) NULL,
    	[Server] NVARCHAR(256) NULL,
    	[Referrer] NVARCHAR(256) NULL
    )
    WITH
    (
    	LOCATION='//Logs/',
    	DATA_SOURCE = Azure_DS,
    	FILE_FORMAT = LogFileFormat,
    	REJECT_TYPE = VALUE,
    	REJECT_VALUE = 100
    );
    

    By setting the LOCATION targeted at the //Logs/ folder, the external table will pull data from all folders and files within the directory structure. In this case, a simple select of the data will return data from only the last 10 entries regardless of the log file that contains the data:

    SELECT TOP 5
    	*
    FROM
    	[dbo].[WebLogs]
    ORDER BY
    	[Date]
    

    The results are:

    Note: PolyBase, like Hadoop, will not return results from hidden folders or any file that begins with an underscore (_) or period(.).

    Integrated Data Management Gateway

    With the integration of the Microsoft Data Management Gateway into APS, customers now have a scale-out compute gateway for Azure cloud services to more effectively query sophisticated sets of on-premises data.  Power BI users can leverage PolyBase in APS to perform more complicated mash-ups of results from on-premises unstructured data sets in Hadoop distributions. By exposing the data from the APS Appliance as an OData feed, Power BI is able to easily and quickly consume the data for display to end users.

    For more details, please look for an upcoming blog post on the Integrated Data Management Gateway.

    TSQL Compatibility improvements

    The AU3 release incorporates a set of TSQL improvements targeted at richer language support to improve the types of queries and procedures that can be written for APS. For AU3, the primary focus was on implementing full error handling within TSQL to allow customers to port existing applications to APS with minimal code change and to introduce full error handling to existing APS customers. Released in AU3 are the following keywords and constructs for handling errors:

    In addition to the error handling components, the AU3 release also includes support for the XACT_STATE scalar function that is used to indicate the current running transaction state of a user request.

    Replatformed to Windows Server 2012 R2 and SQL Server 2014

    The AU3 release also marks the upgrade of the core fabric of the APS appliance to Windows Server 2012 R2 and SQL Server 2014. With the upgrade to the latest versions of Microsoft’s flagship server operating system and core relational database engine, the APS appliance takes advantage of the improved networking, storage and query execution components of these products. For example, the APS appliance now utilizes a virtualized Active Directory infrastructure which helps to reduce cost and increase domain reliability within the appliance helping to make APS the price/performance leader in the big data appliance space.

    APS on the Web

    To learn more about the Microsoft Analytics Platform System, please visit us on the web at http://www.microsoft.com/aps

  • Six Benefits to Planning for SQL Server 2005 and Windows Server 2003 End of Support Now

    As the end of 2014 nears, now is the perfect time to review IT infrastructure plans for the coming year.  If you haven’t made supportability a key initiative for 2015, there are some important dates that you should know about:

    After the end of extended support security updates will no longer be available for these products.  Staying ahead of these support dates will help achieve regulatory compliance and mitigate potential future security risks. That means SQL Server 2005 users, especially those running databases on Windows Server 2003, should make upgrading the data platform an IT priority. 

    Security isn’t the only reason to think about upgrading. Here are six benefits to upgrading and migrating your SQL Server 2005 databases before the end of extended support:

    1. Maintain compliance – It will become harder to prove compliance with the latest regulations such as the upcoming PCI DSS 3.0. Protect your data and stay on top of regulatory compliance and internal security audits by running an upgraded version of SQL Server.
    2. Achieve breakthrough performance – Per industry benchmarks, SQL Server 2014 delivers 13x performance gains relative to SQL Server 2005 and 5.5x performance gains over SQL Server 2008.  Customers using SQL Server 2014 can further accelerate mission critical applications with up to 30x transaction performance gains with our new in-memory OLTP engine and accelerate queries up to 100x with our in-memory columnstore. 
    3. Virtualize and consolidate with Windows Server – Scale up on-premises or scale-out via private cloud with Windows Server 2012 R2. Reduce costs by consolidating more database workloads on fewer servers, and increase agility using the same virtualization platform on-premises and in the cloud.
    4. Reduce TCO and increase availability with Microsoft AzureAzure Virtual Machines can help you reduce the total cost of ownership of deployment, management, and maintenance of your enterprise database applications. And, it’s easier than ever to upgrade your applications and achieve high availability in the cloud using pre-configured templates in Azure.
    5. Use our easy on-ramp to cloud for web applications – The new preview of Microsoft Azure SQL Database announced last week has enhanced compatibility with SQL Server that makes it easier than ever to migrate from SQL Server 2005 to Microsoft Azure SQL Database. Microsoft’s enterprise-strength cloud brings global scale and near zero maintenance to database-as-a-service, and enables you to scale out your application on demand.
    6. Get more from your data platform investments - Upgrading and migrating your databases doesn’t have to be painful or expensive. A Forrester Total Economic ImpactTM of Microsoft SQL Server study found a payback period of just 9.5 months for moving to SQL Server 2012 or 2014.

    Here are some additional resources to help with your upgrade or migration: