Indexing Strategies for SQL Server Performance
One of the easiest ways to increase query performance in SQL Server is to make sure it can quickly access the requested data as efficiently as possible. In SQL Server, using one or more indexes can be exactly the fix you need. In fact, indexes are so important, SQL Server can warn you when it figures out there’s an index missing that would benefit a query. This high-level post will explain what indexes are, why they’re so important, and a bit of both the art and science of various indexing strategies.
What Are Indexes?
An index is a way of organizing data. SQL Server supports a variety of index types (see here for details), but this post will consider only the two most common ones, which are useful in different ways and for a wide number of workloads: clustered and nonclustered indexes.
A table without a clustered index is called a heap, where the data rows in the table are unordered. If there are no indexes on the heap, finding a particular data value in the table requires reading all the data rows in the table (called a table scan). This is very inefficient and becomes more so the larger the table grows.
A clustered index on a table arranges all the data rows in the table into a sorted order and places a navigational “tree” with the organized data, so it’s easily navigated. The table is no longer a heap; it’s a clustered table. The order is defined by the clustered index key, which is comprised of one or more columns from the table. The structure of a clustered index is known as a B-tree, and this basic data structure allows a specific data row to be located (called a “seek”) based on the clustered index key value, without having to scan the whole table.
A good example of a clustered index is a table with the details of a company’s employees, where the table has a clustered index using the Employee ID as the key. All the rows in the table are stored in the clustered index in order of Employee ID, so finding the details of a particular employee using their Employee ID is very efficient.
A clustered index only allows efficient location of data rows based on the clustered index key. If you need to be able to find data rows quickly using a different key value, then one or more additional indexes must be created, otherwise a table scan is required. For nonclustered indexes, each index row contains the nonclustered index key value and a locator for the corresponding data row (this is the data row’s physical location for a heap or the data row’s clustered index key for a clustered index).
Continuing the Employee table example, if someone wants to find the details of a particular employee and only knows the employee’s name, a nonclustered index could be created with a composite key of the LastName, FirstName, and MiddleInitial table columns. This would allow the Employee ID for an employee to be found, and then retrieve all the employee’s details from the corresponding data row in the clustered index.
Why Are Indexes So Important?
As you’ve no doubt gathered, the primary use of indexes is to allow the efficient retrieval of data from a table without having to perform a table scan. By limiting the amount of data that has to be accessed and then processed, there are a lot of benefits to overall workload performance:
- Minimal amount of data has to be read from disk. This prevents undue pressure on the I/O subsystem from many queries reading inefficient or larger amounts of data and helps prevent “churn” in the buffer pool (the in-memory cache of data file pages) by not forcing data already in memory to be dropped from memory to make space for data be read from disk. In some cases, no data will have to be read from disk, if the required data is already in memory.
- Minimal amount of data has to take up space in the buffer pool. This means more of the “working set” of the workload can be held in memory, further reducing the need for physical reads.
- Any reduction in the amount of physical reads a query must perform will lead to a drop in execution time.
- Any reduction in the amount of data that flows through the query plan will lead to a drop in execution time.
In addition to indexes, other things can help produce the benefits above, including:
- Using proper join conditions
- Using search arguments to further narrow the data required
- Avoiding coding practices that force a table scan to be used, such as in advertently causing implicit conversions
- Making sure statistics are maintained correctly, so the query optimizer can choose the best processing strategies and indexes
- Considering the execution method of a query where a cached plan has been used, resulting in parameter sensitivity problems
But these are all topics for future posts!
The Art and Science of Indexing
There are two parts to index tuning a workload—there’s both an art and a science. The science is that for any query there is always a perfect index, but the art is realizing the index may not be in the best interests of the overall database or server workload, and figuring out what the best overall solution is for your server takes analyzing the server’s workload and priorities.
Clustered index key choice is more of a science than an art, and is a whole discussion by itself, but we usually say a clustered index key should have multiple properties (in no particular order):
- The clustered index key is the data row locator included in every index row in every nonclustered index. This means the narrower it is, the less space it will take up overall, which will help with data size.
- Fixed-width. A clustered index key should be narrow but also use a fixed-width data type. When a variable-width data type is used, then the data row and all nonclustered index rows will incur additional overhead.
- If the clustered index key isn’t unique, then a special, hidden “uniquifier” column is added to the clustered index key for all non-unique data rows, making the clustered index key up to four bytes longer for those rows.
- If a clustered index key value changes, the data row must be deleted and reinserted internally, and all nonclustered index records containing that data row locator must be updated.
- Ever-increasing. This property helps to prevent index fragmentation from occurring in the clustered index.
- Non-nullable. The clustered index key should be unique by definition (see #3 above), so it implies it cannot allow NULL values. In some SQL Server versions and in some structures, a nullable column would incur more overhead than a non-nullable column. Ideally, none of the columns that make up the clustered index key would allow NULL values.
As a generalization and because you can only have one clustered index, usually nonclustered indexes (and multiple of them) help queries run more efficiently.
The science of constructing the best nonclustered index for a query involves:
- Understanding the search arguments being used and the type of query (as there are different indexing strategies, for instance, when search arguments use AND or OR clauses, when aggregates are involved, and for different join types). The search arguments are basically which table columns are necessary to identify the required data rows. These will likely be part of the nonclustered index keys.
- Understanding the “selectivity” of the data in each of these key columns. This will dictate the order of the columns in the index key, with the most selective predicates leading the key definition.
- Understanding the SELECT list for the query. Any of these columns may be candidates for being included in the index as non-key columns to avoid the query having to go to the data row to retrieve them (also known as “covering” a query).
And there’s also SQL Server’s missing indexes functionality that will recommend the best index for a query (it focuses on just the science of “query tuning” but not the art of “server tuning”).
The art then becomes taking the index and figuring out whether and how it can be consolidated with other existing or also recommended indexes, so the table doesn’t become over-indexed.
As a simple example, let’s say a table has ten int columns named col1 through col10.
The first query to index is “SELECT col2, col3 FROM table WHERE col6 = value.” A nonclustered index on col6 would avoid a table scan, but would require the query to go to the data row to get the values for col2 and col3. A more efficient nonclustered index would have col6 as the key and include col2 and col3 as non-key columns. This is called a covering index, because the index row has all the columns necessary for the index and removes the need to use the clustered index as well to get the additionally requested columns.
The second query to index is “SELECT col4 FROM table WHERE col6 = value.” The science tells us a nonclustered index on col6 that includes col4 is likely the best index for the query. But then there are two nonclustered indexes keyed on col6, each including different non-key columns. This is where the art comes in, as the best index for the overall workload is likely a single nonclustered index on col6 that includes col2, col3, and col4. Now you have one index with more uses and fewer overall indexes on the table.
And the art can continue through multiple iterations.
Let’s say a third query is created: “SELECT col4, col5 from table where col6 = value AND col2 = value.” The science may say the best nonclustered index is on (col6, col2) if col6 is more selective than col2, and including col4 and col5 as non-ley columns. The art then has us look at consolidation and end up with a single nonclustered index on (col6, col2) that includes col3, col4, and col5. This satisfies all three queries with a single nonclustered index instead of three, so it takes up less space overall at the expense of being less efficient for each query than the individual “perfect” nonclustered indexes would be. However, there’s an added benefit to this consolidation: the fewer nonclustered indexes there are, there less amount of index maintenance needs to be done when a data row is inserted, deleted, or updated.
Obviously, there’s point where you may over-consolidate as well, and experience in indexing design helps hone your art, so you’re not under-indexing, over-indexing, or over-consolidating.
There’s a lot more to the art and science of designing an indexing strategy than can be covered in a post such as this, but hopefully you now understand why having a good indexing strategy is so important. A deeper primer on indexing is Kimberly L. Tripp’s seven-hour Pluralsight course SQL Server: Indexing for Performance, and if you want programmatic help with your indexing strategy, the SolarWinds® Plan Explorer tool (a standalone free tool or part of SolarWinds SQL Sentry) has comprehensive Index Analysis capabilities.
Paul S. Randal is the CEO of SQLskills.com, which he runs with his wife Kimberly L. Tripp. Both Paul and Kimberly are widely-known and respected experts in the SQL Server world, and both are long-time SQL Server MVPs. Paul was a Contributing Editor for TechNet Magazine, where he wrote the bi-monthly SQL Q&A column and feature articles. He also had #1 top-rated workshops and sessions at the PASS Summit and TechEd. Paul is active in the SQL Server community, from user groups to online forums to helping out on Twitter (@PaulRandal – check out the #sqlhelp tag). His popular and widely-referenced blog can be found at https://www.sqlskills.com/blogs/paul/ and he can be reached at firstname.lastname@example.org.