Structured or unstructured – what to store? How about all of it?

I spoke at the O’Reilly Strata Conference last year and am back again this year to share thoughts about big data. It’s been a remarkable year for Microsoft’s data platform team. While we’ve had a long history in the data space, last year was noteworthy for several reasons.

  1. Dave Campbell Keynotes at Strata Santa Clara 2013We announced our commitment to the Hadoop community and we partnered with Hortonworks to provide enterprises with an Apache-Hadoop compatible implementation on Windows.
  2. We released new versions of our business intelligence (BI) tools, providing business users access to powerful analytics through familiar tools including Excel 2013, Power View and PowerPivot. The columnar in-memory analytics engine technology in PowerPivot, available in Excel, is the same technology which is inside SQL Server Analysis Services Tabular model. We’ve made it very easy in SQL Server 2012 to migrate PowerPivot models into Analysis Services so you can take solutions and insights developed in Excel and scale them up to server based deployments which can serve many users.
  3. And tomorrow you can start ordering SQL Server 2012 Parallel Data Warehouse (PDW), the next generation of our parallel processing data warehousing appliance. The latest version of PDW includes a technology we call PolyBase. PolyBase makes it very easy to meld Apache Hadoop with structured relational data using a standard T-SQL query, enabling large businesses to do efficient information production at scale with the existing skills and knowledge they already have in-house.

While that may read like a list of features, here’s why I think they are noteworthy. A recent Microsoft study revealed that 38% of respondents’ current data stores contain unstructured data and 53% rated increased amounts of unstructured data to analyze as extremely important. This trend is only increasing as businesses realize their unstructured data holds the key to new value not accessible in their existing structured data. In my personal experience, over the last two years, many businesses have shifted from thinking of big data as a challenge to perceiving it as an opportunity.

I’m often asked, “Where is the ultimate value in big data and how do I tap into it?” There are two key measures in my mind: 1) time to insight, and, 2) return on accessible data. These measures are, in turn, enabled through a process I call information production.

Information production is the process of converting data or information from one domain into another. Consider the following example. Assume you have an ambulance fleet which is equipped with GPS units which collect telemetry. Information production techniques allow you to convert the raw GPS telemetry – a sequence of records containing {Timestamp, Latitude, Longitude} elements – into an incident response time. The magic of information production is that it takes data which is difficult to deal with in traditional information systems, such as raw GPS telemetry, and transforms it into information, (incident ID and response time), which is both more structured and more business relevant. Once we’ve produced the incident response time, we can logically join it with models which predict patient outcome as a function of response time.

Great information production tools allow you reduce the time to insight. They allow you to get from a hunch to validation very quickly. In fact, there is an emerging class of information production tools which stimulate hypothesis generation by finding correlations in diverse data sets which may hold the key to new value.

Valuable answers require logically joining different data sets – something every database person is familiar with. In traditional databases the “accessible” data is constrained to data which is contained within the database. This data has been normalized, cleaned, and indexed so it can be used to efficiently answer a fixed set of questions over that data domain.

Big data and information production enable a much larger definition of accessible data though. Going back to the ambulance example, where would you get the patient outcome model to determine how many lives you could save by reducing response time? By using accessible demographic and population data, you could determine how many heart attack victims lives could be saved by moving or adding ambulances.

We will know big data has made it big when it makes every day experiences better. In fact, one of the things I spoke about in my Strata talk today is how our Halo 4 team is using our HDInsight and big data tools to create a better gaming experience. By using a preview of the Windows Azure HDInsight Service to do Hadoop-based analysis on their unstructured data, the team gained invaluable insight on usage patterns and as a result had the agility to make changes to improve the overall gaming experience.

Information production is a key part of our vision and product offerings and enables you to achieve fast time to insight and greater return on accessible data. Our BI tools, like PowerPivot and Power View, are geared to making it easier for a certain class of users to reduce their time to insight and then to be able to effectively share those insights with others. Data Explorer, which we released in preview yesterday, makes it easy to find, transform, and join information to both ease information production and increase the range of accessible data. HDInsight Service, our Apache Hadoop based service, available in preview on Windows Azure, is being used by our Halo 4 team and external customers to realize new value from their unstructured data. SQL Server 2012 PDW with its PolyBase technology enables extremely large scale information production for the largest business needs.

For all of us involved in big data, and me personally, it is an incredibly exciting time. The next 5-10 years are going to be breathtaking.

You can find out more about our big data solutions by visiting www.microsoft.com/bigdata, or for those interested in reading the Halo 4 case study on their cloud-based big data solution  – it’s now available online here.

Dave Campbell
Technical Fellow, STB

See more from the Busting Big Data Adoption Myth blog series: