Official News from Microsoft’s Information Platform
Machine Learning Blog
Do you have a data warehouse? Do you wish your queries would run faster? If your answers are yes, check out the new columnstore index (aka Project “Apollo”) in SQL Server Code Name “Denali” today! Why use a column store? SQL Server’s traditional indexes, clustered and nonclustered, are based on the B-tree. B-trees are great for finding data that match a predicate on the primary key. They’re also reasonably fast when you need to scan all the data in a table. So why use a column store? There are two main reasons:
1. Compression. Most general-purpose relational database management systems, including SQL Server, store data in row-wise fashion. This organization is sometimes called a row store. Both heaps and B-trees are row stores because they store the values from each column in a given row contiguously. When you want to find all the values associated with a row, having the data stored together on one page is very efficient. Storing data by rows is less ideal for compressing the data. Most compression algorithms exploit the similarities of a group of values. The values from different columns usually are not very similar. When data is stored row-wise, the number of rows per page is relatively few, so the opportunities to exploit similarity among values are limited. A column store organizes data in column-wise fashion. Data from a single column are stored contiguously. Usually there is repetition and similarity among values within a column. The column store organization allows compression algorithms to exploit that similarity.
2. Fetching only needed columns. When data is stored column-wise, each column can be accessed independently of the other columns. If a query touches only a subset of the columns in a table, IO is reduced. Data warehouse fact tables are often wide as well long. Typical queries touch only 10 – 15% of the columns. That means a column store can reduce IO by 85 – 90%, a huge speedup in systems that are often IO bound, meaning the query speed is limited by the speed at which needed data can be transferred from disk into memory.
It’s clear that cold start queries, when all the data must be fetched from disk, will benefit from compression and eliminating unneeded columns. Warm start queries benefit too, because more of your working set fits in memory. At some point, however, eliminating IO moves the bottleneck to the CPU. We’ve added huge value here too, by introducing a new query execution paradigm, called batch mode processing. When the query uses at least one columnstore index, batch mode processing can speed up joins, aggregations, and filtering. During batch mode processing, columnar data is organized in vectors during query execution. Sets of data are processed a-batch-at-a-time instead of a-row-at-a-time, using highly efficient algorithms designed to take advantage of modern hardware. The query optimizer takes care of choosing when to use batch mode processing and when to use traditional row mode query processing. Why not use a column store for everything? While it’s possible to build a system that stores all data in columnar format, row stores still have advantages in some situations. A B-tree is a very efficient data structure for looking up or modifying a single row of data. So if your workload entails many single row lookups and many updates and deletes, which is common for OLTP workloads, you will probably continue to use row store technology. Data warehouse workloads typically scan, aggregate, and join large amounts of data. In those scenarios, column stores really shine.
SQL Server now provides you with a choice. You can build columnstore indexes on your big data warehouse tables and get the benefits of column store technology and batch mode processing without giving up the benefits of traditional row store technology when a B-tree is the right tool for the job. Try it out: Build a columnstore index Columnstore indexes are available in CTP 3 of SQL Server Code Name “Denali.” You can create a columnstore index on your table by using a slight variation on existing syntax for creating indexes. To create an index named mycolumnstoreindex on a table named mytable with three columns, named col1, col2, and col3, use the following syntax: CREATE NONCLUSTERED COLUMNSTORE INDEX mycolumnstoreindex ON mytable (col1, col2, col3); To avoid typing the names of all the columns in the table, you can use the Object Explorer in Management Studio to create the index as follows:
Typically you will want to put all the columns in your table into the columnstore index. It does not matter what order you list the columns because a columnstore index does not have a key like a B-tree index does. Internally, the data will be re-ordered automatically to get the best compression. Be sure to populate the table with data before you create the columnstore index. Once you create the columnstore index, you cannot directly add, delete, or modify data in the table. Instead, you can either:
· Disable or drop the columnstore index. You will then be able to update the table and then rebuild the columnstore index
· Use partition switching. If your table is partitioned, you can put new data into a staging table, build a columnstore index on the staging table, and switch the staging table into an empty partition of your main table. Similarly, you could modify existing data by first switching a partition from the main table into a staging table, disable the columnstore index on the staging table, perform your updates, rebuild the columnstore index, and switch the partition back into the main table.
For more information about using columnstore indexes, check out the MSDN article Columnstore Indexes and our new SQL Server Columnstore Index FAQ on the TechNet wiki. If you haven’t already, be sure to download SQL Server Code Name “Denali” CTP3 and begin testing today!
Senior Program Manager
SQL Server Database Engine Team
Comments in this blog are open and monitored for each post for a period of two weeks after the posting date. If you have a specific question about a blog post that is older than two weeks, please submit your question via our Twitter handle @SQLServer