Statistics - What, why and how (Part 1)
Published On: February 23, 2015
Categories: Statistics 1
I've had the opportunity to present my session on statistics at a few SQL Saturdays so far. I've found that there are more than a few people who are either unaware or unsure how statistics are used by the query optimizer, so I thought that I'd write a few of blog posts explaining what statistics are and how they're used.
If you've ever looked at an execution plan, you've seen statistics at work. They may or not be good stats, but they're there.
The value for estimated rows is based on the statistics provided to the Query Optimizer prior to runtime. That row count (or cardinality) helps the query optimizer to make decisions both about what query operators to use as well as choosing the 'best' plan overall. If statistics don't exist, then the Query Optimizer makes guesses based on the total row count of the table and has to make assumptions about distribution. As you can imagine, that doesn't always work out well. In the case of the query above, there were statistics, but they weren't accurate.
So what are statistics anyway?
Statistics contain information about the distribution of values within a column. If I were querying an employee table by last name, statistics would indicate how many distinct last names existed and the count for each of those last names.
Now, depending on the number of distinct values, that count might be derived, but more on that in part 2. This information helps the Query Optimizer (QO from now on) make educated decisions on how best to retrieve data. One of the most obvious decisions is whether to use an index scan or an index seek. I use this as an analogy - if someone gave me a book and asked me to circle each instance of the word 'and', it might be easiest to go through every page in the book (index scan). Now if they gave me the same book and asked me to circle each instance of the word 'giraffe', I'd probably go to the index and see if and whether it existed in that book (index seek). There are other decisions that are based on statistics - whether the QO should consider parallel plans and which tables would be the outer/inner table in a nested join to name a few.
How are statistics created?
There are a few different ways that statistics can be created.
- When an index is created - Whenever you create an index, statistics are also created on the first column of that index. That happens automatically and the statistic will have the same name as the index.
- AUTO_CREATE_STATISTICS - This is a database level setting. Enabling this setting will cause the query optimizer to create statistics on columns referenced in predicates that don't already have statistics. Statistics that are created automatically will be named _WA_SYS_... So, if I changed the query above to look for FirstName instead of LastName and there were no statistics on the FirstName, statistics would be created on that column prior to runtime. Typically this runs very quickly, but like anything, test and verify.
- CREATE STATISTICS - This can be used to explicitly create statistics on columns. This also provides the ability to create filtered statistics and multicolumn statistics.
- sp_createstats - This stored procedure will automatically create statistics on any columns that aren't already the first column of an instance. Also probably referred to as the sledgehammer of statistics creation. However the indexonly parameter will create statistics on any columns referenced in an index that aren't the first column of the index (this does not include included columns). You can make the argument that all of the columns in the index are possibly going to be queried, so it makes sense to have statistics available, especially in reporting or data warehouse environments.
In my next post, I'll talk about how to view stats and the statistics object (how the Query Optimizer finds those estimated row counts).
Lori (@loriedwards) has been with SentryOne since 2013, and has recently transitioned from her role as Senior Solutions Engineer to Training Manager. She is responsible for all learning strategy and architecture, including building, executing, measuring, and evaluating training for SentryOne. Lori is also currently serving as a Director at Large for PASS. Lori shares tips and tidbits for DBAs picked up over years of working as a DBA, as well as stories about her job here. Whether you are a new DBA who wants to learn or someone interested in learning about what it is like to work for SentryOne, be sure to check out Lori’s blog.