Columnstore Indexes in SQL Server Denali - more speed, less maintenance

Extending my previous post on compression, here's a new compression and storage concept that SQL Server's next release, codenamed Denali, will introduce. SQL Server Denali's CTP3 adds columnar indexes as a new storage format for secondary indexes. Let's have a closer look at them.

The benefits

The benefits of columnar storage are quite substantial

  • Massive query performance improvement, thanks to lower space consumption (less pages to read, higher probability for caching) and more differentiated loading of data (only columns that are actually queried will get loaded, not entire records). Examples I've seen have improved query performance by several hundred times, consuming a single digit percentage of the original query time without a columnar index.
  • Much lower maintenance costs: One columnar index can potentially replace dozens or hundreds of traditional nonclustered indexes that have so far been necessary for query covering. That single columnar index improves any query's performance, regardless of the actual columns that are queried.

In their current implementation in CTP3, columnar indexes address data warehousing workloads and are read-only.

What challenges get addressed by columnar indexes?

Traditionally, SQL Server stores data record-wise: All the columns that an index (clustered or nonclustered) consists of, are stored on the same disk pages. A disk page is filled record by record, with every record containing every column that has been defined for a table (there are exceptions for variable-length and BLOB data fields), I'll simplify a bit for this illustration). When a column for a certain record has to be read from disk or memory, an entire page is read, implying that all other columns for the requested record will be read too. When reading many a fields across many records, a lot of unneeded data gets read, generating large disk IO numbers.

So far, we've overcome this challenge when tuning query performance by creating specific covering indexes for often used queries. A nonclustered index covering a certain query only contains the columns that this specific query will need. Using less space than the base table, this covering index requires a smaller number of disk IOs, thus speeding up query performance.

When tuning a data warehouse, typically hundreds of covering indexes are created, addressing many different query patterns that have been discovered by monitoring the ad hoc query traffic. Maintaining these indexes is a substantial effort. If maintenance is neglected, the overall query performance can degrade drastically.

When using SQL Server's Enterprise Edition (since SQL Server 2008), query performance can further be improved by compressing all indexes. As database compression reduces space consumption on average by 50%, only half as many pages have to be accessed, further reducing disk IO.

We have of course been waiting for much bigger query performance improvements. They will now be available at last:

Columnar Index storage format in SQL Server Denali

Columnar indexes improve query performance based on two concepts:

  • Data is stored column-wise. A disk page consists of data from only one table column, across all records. When a certain column has to be read, only data from that column is accessed (from all records that reside on the same disk page, the minimum amount of data that gets accessed is still one 8k disk page). This has the same effect as a covering index for a certain query. But without the hassle of finding out which columns have to be added to a covering index and the whole maintenance story behind it. A single columnar index containing all columns of a table that can be queried improves performance for virtually every conceivable ad hoc query against that table. Nonclustered indexes will no longer be needed, at least not for query covering.
  • The second performance improving concept is the much higher compression rate of the pages that a columnar index creates. Due to the high similarity of the data that resides on a page (originating from only one column and therefore having the same data type) and thanks to the Vertipaq technology that is implemented here, data gets compressed at an estimated 80% on average. Vertipaq had its first implementation in PowerPivot and has been integrated into the SQL Server engine now.
  • A third factor is the improved algorithm that processes columnar indexes in a much more efficient way than traditional indexes. Data is processed in batches of columns, with better parallelization and reduced CPU usage.

Columnstore indexes perform the better the larger tables are. They aim at 10s of terabytes. With their implementation in SQL Server Parallel Data Warehouse they support data warehouses with 100s of terabytes and generate sub-second query response times for them.

Columnar indexes address data warehousing workloads in their first version. They aren't suited for tables used in OLTP workloads, at least not yet. Other than that, they behave like any other index, are a regular part of the database and of any SQL Server backup and are managed through SQL Server Management Studio as well as via regular T-SQL statements.

Some examples (simplified)

First, what we're generally used to now:


Traditional covering index: The query 'SELECT Green, Blue FROM Table' generates 5 IOs. A query also referencing column 'Red' would have to read the base table, as the index doesn't cover this column.

 

Traditional covering index using compression: The Query 'SELECT Green, Blue FROM Table' generates 3 IOs, 60% of the uncompressed version.

Introducing the new columnar index:


Columnar Index: The Query 'SELECT Green, Blue FROM Table' generates 2 IOs and doesn't have to be aware of the query structure, i.e. every other query benefits from this same columnar index too.

Features and limitations in version 1 (Denali RTM)

Columnstore indexes in CTP3 support this functionality:

  • Columnar storage of a table's data in a secondary index
  • High compression rate thanks to Vertipaq technology
  • Acceleration of inner joins (outer joins will be supported later)
  • Common business data types: int, real, char/varchar, money, datetime, decimal <= 18 digits (data types > 8 bytes will be supported later)
  • Index partitioning (aligned with partition table)

Some of the current limitations:

  • No replacement for the clustering index (i.e. the base table can't be stored in columnar format)
  • Columnar indexes and the tables that they are indexing are read-only (I'll describe some workarounds below)

Let's now look at some of the T-SQL syntax and maintenance methods that come with this new feature:

Creating a columnstore index

Columnstore indexes are created with an extended CREATE INDEX statement:

CREATE NONCLUSTERED COLUMNSTORE INDEX myColStoreIdx ON SampleTable(FirstColumn, SecondColumn, ThirdColumn, ...)

SQL Server's optimizer handles a columnstore index like any other index: It decides whether to use it or not based on its cost calculation.

In CTP3 columnstore indexes don't support data types that use more than 8 bytes (with the exception of char/varchar): Therefore fields of type UNIQUEIDENTIFIER, DECIMAL(>18), DATETIMEOFFSET etc. can't be part of a columnstore index.

Updating tables that have a columnstore index

Since tables that have a columnstore index can't be updated, some workarounds have to be used when those tables still need to be updated:

  • Dropping the index, adding/updating data and then recreating the columnar index is the most obvious way. When using columnar indexes in data warehouses, such steps will probably be part of any regular ETL process anyway.
  • Table partitioning can be used for adding new data to a table that has columnar indexes: You first create a non-indexed staging table and load data into that one. After creating a new columnar index on the staging table, it can be switched into the existing primary table.
  • Another way for adding data is using a UNION ALL view that joins the read-only primary table with a non-indexed staging table. By adding an Instead Of trigger for inserts, data gets inserted in the writable table. Instead of the primary table the view gets queried. It reads from both the indexed primary table and the non-indexed remainder in the staging table. The staging table can periodically receive a new columnstore index, too, and then be switched into the primary table.

Check it out yourself

SQL Server Denali CTP3, which is available to the public since July 12 2011, supports columnar indexes. It can be downloaded at https://www.microsoft.com/sqlserver.

As always with a CTP: The functionality described in this article is still under development and might change until RTM.