Statistics - What, why and how (Part 3)
In my last post, we covered the output of
DBCC SHOW_STATISTICS and saw how the the estimated rows are determined for a operator. That cardinality estimate (based on statistics) along with the operator cost model allows the query optimizer to come up with an operator cost. The total of these operator costs becomes the plan cost. As a result, if statistics are incorrect or out of date, operator costs may be off and, as a result, plan costs can be inaccurate.
There are several decisions that can be affected by cardinality such as:
- Index seek vs. index scan
- Clustered index scan vs. index seek + key lookup
- Serial or parallel plans
- Join algorithms (including inner/outer table selection)
...to name a few.
Let's look at how statistics are used to generate effective plans. Using the AdventureWorks2012 sample database, we'll query the
SalesOrderDetail table using the
ProductID column in the predicate. As you can see, there is an index on
ProductID, so there will be associated statistics on that column.
The queries we'll use will be the same except for the
ProductID that we're trying to find.
Now the only thing different about these queries is the value we're looking for in the predicate, but as you can see using SQL Sentry Plan Explorer, two different plans were generated. You can click on the image to enlarge it:
We can quickly see why those two different plans were generated by looking at the estimated row counts below.
Since both of these queries were pulling back a number of columns and there were no covering indexes, the query optimizer knew that there were two possibilities: Either a clustered index scan, or an index seek on
IX_SalesOrderDetail_ProductID plus a key lookup to get the rest of the column values. Key lookups can be fairly expensive, so you'll typically only see those when there are very few rows to return.
On the query looking for
ProductID = 870, the query optimizer could look at the histogram for that column and see that over 4,600 rows would be returned. That many key lookups would be too expensive, so it went with the clustered index scan. With the query looking for
ProductID = 897, the optimizer thought that only 75 rows would be returned and so decided to take the index seek + key lookup route.
This brings up another question. For the query with
ProductID = 897, the optimizer expected 75 rows, but only 2 were actually returned. Since the estimated vs. actual rows on the first query were correct, why is this one off? Let's go look at the histogram for
ProductID histogram, the value 870 is a
RANGE_HI_KEY, so the optimizer could look at the
EQ_ROWS column to get the exact count. The value 897 is not a
RANGE_HI_KEY, so the query optimizer used the value in the
AVG_RANGE_ROWS column to estimate the number of rows returned.
In the next (and final?) post for this series, we'll go over statistics maintenance.
Note: To get to the first post in the series - a quick explanation of statistics, you can go here
Lori (@loriedwards) has been with SentryOne since 2013, and has recently transitioned from her role as Senior Solutions Engineer to Training Manager. She is responsible for all learning strategy and architecture, including building, executing, measuring, and evaluating training for SentryOne. Lori is also currently serving as a Director at Large for PASS. Lori shares tips and tidbits for DBAs picked up over years of working as a DBA, as well as stories about her job here. Whether you are a new DBA who wants to learn or someone interested in learning about what it is like to work for SentryOne, be sure to check out Lori’s blog.