Here are some notes on "SQL Server 2008 Statistics" I took while attending an advanced class on SQL Server taught by Kimberly Tripp  (http://sqlskills.com/AboutKimberlyLTripp.asp).

Please note that, although these notes were taken during the class, I might have added some of my own (mis)interpretation :-). Always check your facts on Books Online (I try to provide links when applicable). As with anything else you get from a blog, never use any of this in production before you thoroughly validate it a test environment. Please keep in mind that some of those will be hard to follow without the right pre-requisite knowledge and the right context. Reading the post from top to bottom will help.


Understanding the job of the cost estimator

  • Table Scan (always an option)
  • Useful Indexes (look at WHERE, JOIN, GROUP BY, ORDER BY, covering)
  • Other options?

Statistics

  • Statistics = density information about a table kept by SQL
  • Help SQL "estimate" how many rows a certain query will return
  • Stored as a BLOB, but you can query the data using "DBCC SHOW_STATISTICS"

Not always right 

  • Statistics cannot be right for every case, or else it would be actual data
  • By definition, it is not all the data, it's a sample.
  • How frequently do you update? Depends on how the data is changing.
  • Statistically, certain types of data are consistent over time.
  • Unless that are some atypical events related to your data.

Could be statistically correct after update 

  • Look at the city column for an employee table for a company based in Seattle
  • Think of the frequency for Seattle, Redmond (suburb) and Spokane (hours away)
  • This would probably not change much (statistically) quickly
  • High number of rows with low selectivity - this is harder.

Looking at stats 

  • Things to look in your statistics (DBCC SHOW_STATISTICS):
  • How old are your statistics?
  • How many rows sampled in the statistics (out of the total rows)?
  • How many steps in the statistics (up to 201 steps in SQL Server 2008)?
  • Steps = number of rows in your histogram.
  • What is the density of that key?
  • Density * # of Rows = average rows returned for a given value
  • You also get the average length of the column.
  • For every step: actual value (high for step), total rows, count of unique rows.
  • Data could be sampled (not based on every row, but just a subset).

AUTO 

  • Generating statistics is usually not so expensive, it's done with sampling
  • The query optimizer will actually create statistics on the fly, as queries come in.
  • Future queries will benefit from having created that.
  • If you don't have the stats, SQL will not be sure of the usefulness of an index.
  • AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS are ON by default.

Stats with no index? 

  • Statistics do make sense in cases when you don't have indexes on a column
  • For instance, for helping with deciding how to do sorts
  • For instance, in a JOIN, to figure out which side has a smaller set of rows.

Let SQL do it

  • Let SQL Server do its thing with stats
  • You may start with a query taking a long time, than it improves "by itself"
  • This could be the results of statistic being autogenerated and helping you
  • Careful: On Read-only database, SQL cannot create statistics

Pre-create statistics 

  • You can also pre-create the statistics
  • You could pre-create for every column of every table, but you probably shouldn't
  • It probably helps generate them for every column of your non-clustered indexes
  • You can do that using "sp_createstats 'indexonly', 'fullscan'"

Leave it on 

Stats get old 

  • After they are created, statistics will get old over time, as data gets updated
  • You can use sp_updatestats to query
  • You can also query sys.stats and sys.indexes
  • You can use DBCC SHOW_STATISTICS (with HISTOGRAM?)

Auto updates 

  • Statistics are automatically updated if AUTO_UPDATE_STATISTICS are ON (default)
  • They get updated when about 20% of the data changes
  • In SQL 7, tracks row mod, immediately updates
  • In SQL 2000, tracks row mod, updates when needed
  • In SQL 2005, tracks column mod, updates when needed
  • In SQL 2008, tracks column mod, updates when needed. New: Filtered stats.

Async updates 

  • Even if stats need updating, you can use old stats if new ones are not ready
  • To do that, use ALTER DATABASE dbname AUTO_UPDATE_STATISTICS_ASYNC ON
  • Use with caution

Sampling 

  • Sampling is usually OK, but could create issues if data is not evenly distributed
  • You might miss something significant if your sampling does not give you enough
  • You can disable autoupdate and update statistics manually using a full scan
  • To do that "UPDATE STATISTICS... WITH FULL SCAN".

Very large sets 

  • For very large data sets, even a full scan (no sampling) might not be enough.
  • Think a few billion sales records, looking at statistics for customer_number.
  • In that case, you have 1000 large customers and 1 million small ones.
  • With only 201 steps, you just can't get good statistics for both large and small.
  • For those, consider partitioning into multiple tables to get better stats.
  • In SQL 2008, consider filtered statistics (for instance, WHERE customer_type=Large).
  • Automatically created for filtered indexes. Can be created manually.

Conclusion

  • Statistics will only help
  • However, it is always better to have the right indexes
  • As usual, you can't optimize for every single query
  • Pick your battles.

For details, see "Statistics Used by the Query Optimizer in Microsoft SQL Server 2005" at:
http://technet.microsoft.com/en-us/library/cc966419.aspx