Statistics - What, why and how (Part 2)
In my last post, I wrote about what statistics are and how they're created. In this post, I'll be covering where the Query Optimizer (QO from now on) gets the estimated row counts that show up in an execution plan.
In an execution plan, all of the operators have an associated cost.
Part of determining that cost can be cardinality - or the number of rows that are expected to by returned by that operation.
That number, as I'm sure that you can guess, comes from statistics, if they exist. As I mentioned in my last post, if there aren't statistics on a column, then the QO has to guess and that rarely turns out well.
Using the DBCC SHOW_STATISTICS command, we can actually see the data that the QO is using in its decision making. DBCC SHOW_STATISTICS returns something called a statistics object. You'll pass in a table or indexed view name along with an index, column or statistic name.
DBCC SHOW_STATISTICS(N'Person.Person', N'IX_Person_LastName_FirstName_MiddleName');
There are three parts to the statistics object, so let's go through them.
The header provides meta data about this particular statistic:
- Name: If the statistic was created as a result of index creation, the name will match the index name. If it was user created, it will be the name provided. If the statistic was auto-created (as a result of having AUTO_CREATE_STATISTICS enabled), the name will be _WA_SYS__ - so something like _WA_Sys_00000004_21B6055D.
- Updated: Pretty straightforward, but good to know - this is the date that this set of statistics was last updated.
- Rows and Rows Sampled: Statistics, like indexes, can be built or rebuilt either with a full scan or a sample. These two fields will let you know which was used.
- Steps: This lets you know the number of steps used in the histogram (see below). Think of steps as buckets that can hold a range of data.
- Density: While density is important, the density shown in the header is for backwards compatibility for versions earlier than SQL Server 2008. The density that you'll want to see is covered in the next section.
- Average key length/String Index/Filter Expression/Unfiltered Rows: These columns provide more information about the statistics - storage needs, data type and if it's filtered. Like indexes, statistics can also be filtered.
The second section of the statistics object is called the density vector. Density, in regards to statistics, has to do with the number of unique values in the column. In other words, if you ran a SELECT DISTINCT (columnname) - the number of unique records returned would give you the density. Density is represented by a number greater than zero (0) and less than or equal to one (1). The density value is determined by dividing one by the number of distinct values (1/n). A density of one would indicate that there is one unique value.
While the histogram (explained below) provides cardinality estimates for distinct values, the density can be used to make an educated guess when the values aren't known at runtime or don't exist in the histogram. The OPTIMIZE FOR UNKNOWN query hint actually uses the All Density value multiplied by the number of rows to estimate the number of rows returned without knowing the parameter value at runtime.
Statistics that are created on indexes are created on the first column of the index. That being said, the density vector will display the estimated density for the combination of fields. For example, a statistic on column A where the index consists of A,B,C would show the density for A, for A, B and for A, B, C. One thing to keep in mind is that there is an algorithm that determines those densities - so it may or may not return valid results. This algorithm changed for SQL Server 2014 - for more information, read the white paper available here.
The histogram returned by DBCC SHOW_STATISTICS is a table structure that shows the values that exist within a column. For this type of histogram, there is maximum of 200 steps, so if there are more than 200 distinct values, they are grouped together in an ordered data range. For example, if a column had numeric values between 1 and 500, the first step might contain value 1, the second 2-8, the third 9-13, etc.
There are five columns in a histogram:
- RANGE_HI_KEY: This is the highest value in the current step.
- RANGE_ROWS: This contains the number of rows for values greater than the previous HI_KEY and the current HI_KEY.
- EQ_ROWS: The number of rows that match the HI_KEY
- DISTINCT_RANGE_ROWS: The number of distinct values between the previous HI_KEY and the current HI_KEY
- AVG_RANGE_ROWS: The average number of rows per value between the previous HI_KEY and the current HI_KEY. This equates to the RANGE_ROWS/DISTINCT_RANGE_ROWS.
In the example above, if we were to query for LastName = 'Alvarez', since Alvarez is a HI_KEY value, the QO would look at the EQ_ROWS column and see that 99 rows would be returned.
If we were to query for LastName = 'Alvaro', that value is between Alvarez and Anand. Looking at the RANGE_ROWS for Anand, we can see that there are 2 rows between the previous HI_KEY (Alvarez) and the current HI_KEY (Anand). We can also see that there are 2 DISTINCT_RANGE_ROWS. 2/2 provides an AVG_RANGE_ROW value of 1, so the QO will return estimated rows equal to 1.
Hopefully this provides some insight into how the QO determines how many rows to expect. In the next post, I'll talk a little more about the impact of these cardinality estimations.