There are many approaches to troubleshooting, tuning, and optimizing SQL Server® performance.
But which approach is optimal? This whitepaper will show you. The good news is SQL Server is a mature product with a long history of community-oriented staff and independent experts who happily share their best practices when troubleshooting, tuning, and optimizing SQL Server performance. This whitepaper will serve as a useful and effective template for performance troubleshooting you can use whenever SQL Server performance is called into question.
The optimal approach to SQL Server performance troubleshooting is a tried-and-true methodology utilizing wait statistics, performance monitor queues, and other relevant performance metrics.
Since its inception, SQL Server has provided another set of metrics, broadly known as queues, to show the internal behavior of SQL Server. (Queues were your only method of performance troubleshooting before SQL Server 2005.) The word “queues” in this context is an aggregate term encompassing server-side traces, Profiler traces, Performance Monitor (PerfMon) object counters, WMI counters, and other relevant metrics such as those revealed by the DBCC SQLPERF command.
In a nutshell, the optimal approach to performance troubleshooting is a tried-and-true methodology blending all actionable sources of system information, including waits, queues, and other non-SQL Server performance information.
SQL Server performance troubleshooting using a waits-only approach is problematic. It’s a lot like what cereal advertisers (especially cereals composed mostly of sugar) always say in their advertisements: “Part of a complete breakfast!” But another way to interpret the inferred meaning of this statement is “incomplete or unhealthy by itself.” And the same is true of SQL Server performance troubleshooting when you use too much of a metric, only one metric, or a single set of instrumentation. Sooner or later, you’ll expend a great deal of time and energy by going down a dead-end path, hit a wall, or follow a false lead—to your great peril.
Here’s the good news: wait stats analysis provides an excellent return on time invested in performance troubleshooting because it allows for easy visibility into the areas where SQL Server spends time waiting to do work. It’s especially useful when used in combination with other tools and metrics. In other words, it’s “part of a complete troubleshooting breakfast!”
As implied by the term “wait statistics,” these are a type of system metric designed to tell us in what ways user and system processes are forced to wait. SQL Server wait stats are, at their highest conceptual level, grouped into two broad categories: signal waits and resource waits. A signal wait is accumulated by processes running on SQL Server and waiting for a CPU to become available (so-called because the process has “signaled” it’s ready for time on the CPU). A resource wait is accumulated by processes running on SQL Server and waiting for a specific resource to become available, such as waiting for the release of a lock on a specific record.
The promise of wait stats is to quickly identify where the most pressing bottlenecks are building within SQL Server.
Resource waits are further broken down into hundreds of specific categories (the exact number varies by version of SQL Server). These categories, called wait types, enable you to determine a more specific bottleneck on a SQL Server—such as schema locking issues, plan cache memory pressures, or backup I/O problems—instead of just generic locking, memory, or I/O issues you might identify through other methods. The promise of wait stat analysis is you now have an excellent way to quickly determine where the strongest pressures are building up on specific resources in SQL Server.
In this whitepaper, we’ll walk through the optimal performance troubleshooting process for a variety of scenarios and illustrate the strengths and weaknesses of using a waits-only approach. As Microsoft says in their definitive whitepaper on the subject:
“Performance tuning of applications and solutions has been around for many years. The performance of SQL Server database applications should be evaluated from several different perspectives. Each perspective tells a different section of the complete performance story. Together they paint a detailed performance picture of the whole and also cross-validate observations in each specific perspective.”
Like many memes evangelized on the internet, waits have undergone a somewhat startling transformation in recent years. What was introduced as an additional arrow in your quiver for performance troubleshooting has been subverted into the only method used by many practitioners. This is short-sighted. Though a waits-only approach can be useful for assessing overall SQL Server performance, it shouldn’t be used for deep root cause analysis and troubleshooting. When a waits-only approach is used for troubleshooting, it can be inaccurate and misleading, which can result in wasted time and serious errors.
With SolarWinds® SQL Sentry®, you can effectively monitor, diagnose, and optimize your Microsoft® database environment.
Compare these three reasons for failure to the approaches we’ve personally witnessed, introduced in the first paragraph of this whitepaper.
In those examples, a Windows system admin using only Windows Event Logs and Task Manager has no visibility inside the processes in SQL Server. A .NET developer using only client-side traces and debugging switches can see neither the deep internals of SQL Server nor the Windows Server itself. Moreover, the business user—who only knows their interaction with the application is slow—has no clue at all what’s happening anywhere else in the organization’s infrastructure; often, they don’t even know what’s happening on their desktop workstation.
Using a waits-only approach can lead to all three major problems of performance troubleshooting! The flaw isn’t that you might use wait statistics to troubleshoot a SQL Server performance issue; the flaw is you’re using only wait statistics without including other important and actionable sources of performance data.
When performance troubleshooting SQL Server, you must collect appropriate and actionable information. You’ll need information found both inside and outside of SQL Server itself. Additionally, you’ll need overall contextual and historical information to determine what’s truly normal or abnormal behavior for your server.
Wait stats values, shown by sys.dm_os_wait_stats and other dynamic management views (DMVs), accumulate over time without distinction for granular, process-level identification of the root cause.
To use a metaphor for the appropriateness of waits-only analysis, high wait stat values in many cases are the symptom of an illness but aren’t the actual diagnosis for the illness. A doctor wouldn’t diagnose a patient with a high fever (i.e., the wait stat). Instead, they would note the patient has a high fever, though it may be caused by a wide variety of potential illnesses, and then proceed with more tests and investigation to find the root cause.
Waits-only analysis can reveal the symptoms of a SQL Server performance problem, but it rarely points directly to the root cause of the performance problem.
Here’s a real-world example of how waits-only troubleshooting can miss the mark on providing enough information and truly actionable information:
A small, regional bank who’s a customer of ours had a small IT team who vastly preferred to use Microsoft Access® for all their home-grown applications. Microsoft Access front-end applications will, by default, use a technique to retrieve data from a SQL Server back end. In this process, the application asks for each row it needs from the SQL Server, waits for an acknowledgement, and then asks for the next row, rather than asking for them all in a single cached batch. This painful process is sometimes referred to as RBAR (row-by-agonizing-row).
As a result, the SQL Server back end begins to accumulate a large amount of ASYNC_NETWORK_IO waits as it sends one row to the client, waits for an acknowledgment from the client, and sends the next. The IT team was alarmed to see ASYNC_NETWORK_IO waits go higher with each new branch office they migrated to their new application. Taking the wait stat at face value without assessing any other performance metrics, they began to assume the SQL Server simply needed a faster NIC or, at worst, they needed some improved networking infrastructure. When they analyzed the NIC traffic load by looking at PerfMon counters, they saw the SQL Server and the network itself had a vast amount of headroom still untapped. This was simply a case of bad application design—they didn’t disable the default row-by-row behavior, which at first glance appeared to be a problem with the SQL Server.
This whitepaper presents the steps to follow to achieve an optimal performance troubleshooting experience on SQL Server.
Consider yourself warned. This is just one of a multitude of examples where you might face significant misdiagnoses using a waits-only approach to SQL Server performance troubleshooting.
In the sections that follow, we discuss a variety of important and common shortcomings to consider in waits-only performance troubleshooting. At the same time, we’ll present the step-by-step process to follow to achieve an optimal performance troubleshooting experience.
You just returned to your desk from taking a quick lunch break. As you take your seat, you see you have several messages waiting for you on the corporate voicemail system. You can also see there are some email messages with the urgent marker glaring you in the face. And when you focus, you hear the footfalls of someone rapidly approaching. The program manager for one of the business units pops their head around your cubical wall and says, “What’s wrong with your database? The users are driving me crazy with complaints!”
When performance troubleshooting SQL Server, you want as much actionable information as quickly as possible. “Actionable” is the key word.
Though this may not have happened to you (yet), most DBAs in medium or large IT organizations have had this experience many times. This sort of complaint is so common it’s routine for DBAs interacting with their business customers. The customer can’t reach the database, so apparently, the database is at fault! DBAs are so used to being the first people in IT blamed for an application problem they often develop coping mechanisms to shield themselves from what might turn out to be someone else’s technical issue.
Experienced DBAs are so used to getting the blame for failures elsewhere in the corporate infrastructure they begin to figure out ways to “prove” the problem is caused by a separate silo of the IT organization.
However, any experienced DBA will point out there are a multitude of other places within the corporate infrastructure where the user might encounter a problem—the database could merely be one of many symptoms of the problem. Consequently, your first order of business in performance troubleshooting is to determine whether it’s even your job to be the person troubleshooting the problem!
Moreover, since so many problems originate outside of SQL Server, wait stats have no chance of detecting these issues.
Such common non-SQL Server issues might include the following:
These kinds of blame games are common. They’re so common many DBAs begin to anticipate them and, after enough negative experiences, build out contingencies for failures elsewhere in the corporate infrastructure.
Networking or OS configuration problems might frequently appear to the user as if their favorite database application is actually at fault, especially since they may not use corporate infrastructure for any other purpose. And in some cases, DBAs are the first people users call—not because they suspect it’s a database problem but because the DBAs themselves are the most responsive and knowledgeable about troubleshooting issues like this.
Refer below to Figure 1, where you can see an example of the SolarWinds SQL Sentry Performance Analysis Dashboard. This dashboard has integrated correlations of SQL Server wait statistics with all the other pertinent PerfMon queues.
Keep in mind all values on the left side of the dashboard are for Windows Server—with built-in overlays showing the specific resource consumption of SQL Server—and those on the right side are strictly for SQL Server. Highlighting a spike in wait stats on the middle right pane of the dashboard automatically highlights all the other charts during the same time period so you can correlate SQL Server wait stats with all the other activity occurring on the instance of SQL Server and on the Windows Server. This makes it easy to assess whether a problem is entirely contained within SQL Server, Windows Server, or a combination of the two.
For example, if you use a waits-only approach, you can see in Figure 1 your server has an unexpected spike in network-related waits. Is this a problem? Maybe, but you can only make a correct determination by comparing the wait statistics values to the actual PerfMon counter objects, which track network utilization.
Figure 1: Correlating wait statistics to PerfMon queues.
Looking at the data in Figure 1, we can see there’s indeed a spike in network waits (the light-colored portion highlighted in the waits chart on the right side of the screen). However, if you follow the red arrows from the SQL Server waits on the right side to the network PerfMon counter values on the upper-left portion of the dashboard, you can see the percentage-based value for the two NICs is still less than 1% busy. In other words, the NICs aren’t busy at all!
One interesting aspect of wait statistics is even on quiet servers, you’ll still accumulate wait statistics of various types. Some waits will accumulate a minute of wait time every minute by design—this is because some waits just represent wait time and don’t indicate a performance problem.
Furthermore, a SQL Server performing only a light workload will also accumulate wait statistics. If you just look at wait stats, you might think your top wait stat needs to be fixed pronto! But no, there will always be a top wait. Assuming the top wait always needs to be “dealt with” can be inappropriately turned into a time-consuming endeavor for inexperienced DBAs and system administrators. Imagine you look at a lightly used SQL Server hosting a SharePoint® installation. Its top wait stats relate to LCK (locking) and FTS (full-text search). Does this mean we need to get busy tuning this server? Well, not really—it’s not very busy, after all.
Wait statistics must be assessed in the context of overall server performance. Even a SQL Server with plenty of spare computing capacity will exhibit a wide variety of wait statistics.
In other words, SQL Server instances will always have wait stats in a ranked list from biggest wait stat to smallest. But just because a wait stat is a server’s biggest doesn’t mean it needs attention or remediation. For example, on SQL Server 2014, you can safely ignore a wide variety of wait stats because they measure the natural and internal management of the platform. A few examples of wait stats you can safely ignore include the following:
The examples above are by no means an exhaustive list. Refer to Glenn Berry’s SQL Server Diagnostic Information Queries for an example query with a representative list of wait stats you can safely ignore. Glenn does a great job of updating his diagnostic queries with each new release and cumulative update for both SQL Server and Azure® SQL Database.
Contributed by Paul Randal and reprinted with permission from SQLPerformance.com
Paul Randal (b|t) is the most respected and authoritative independent technical expert in the SQL Server world. I first met Paul in the summer of 2005 during a visit to the Microsoft campus. I was serving as the president of the Professional Association for SQL Server (PASS.org), and Paul was serving as the lead program manager for the SQL Server storage engine team. Paul’s writings and videos are must-have resources for any serious student of Microsoft SQL Server. In an ironic twist, Paul is married to Kimberly Tripp, another of the industry’s top experts and the instructor of my first SQL Server class way back at the release of SQL Server version 6.0. All the content of Surprise #3, with the exception of CXPACKET Waits, was written by Paul.
Performance troubleshooting is an art and a science. The art comes from experience (and learning from others’ experiences) and the science comes from well-known guidelines about what to do in which scenarios.
Or at least this is what I like to think and teach.
In reality, many DBAs and developers practice what I call “knee-jerk performance troubleshooting.” This commonly happens when a performance problem has reached the critical stage with, for instance, queries timing out, processes running slowly or failing, users disgruntled, and management wanting answers and action fast!
Knee-jerk performance troubleshooting happens when the practitioner quickly concludes the most prevalent symptom is the root cause of the problem.
The “knee-jerk” comes from doing some superficial analysis of the problem, jumping to the conclusion (really, it’s grasping at a straw) the most prevalent symptom must be the root cause, and trying to address this to little or no avail, often using misguided or downright incorrect advice found online. This leads to a lot of frustration and wasted time. It also often leads to wasted money when the organization decides to try to throw hardware at the problem by upgrading the server and/or the I/O subsystem, only to find the problem is still there or reappears quickly.
Wait statistics analysis is one of the areas where it’s easiest to knee-jerk, and I’m going to talk about a few of the common wait types and the mistakes people make around them. There isn’t enough scope to go into great depth about what to do in each case, but I’ll give you enough information to point you in the right direction.
Most people assume if locking waits are the most prevalent, then some kind of blocking problem must be the issue. Often it is—the lack of a suitable nonclustered index may be causing a table scan in REPEATABLE_READ or SERIALIZABLE isolation levels escalating to an S table lock. (And as a quick hint, if you think you never use SERIALIZABLE, you do if you use distributed transactions—everything is converted to SERIALIZABLE under the covers, which can lead to unexpected blocking and deadlocks.)
However, blocking is often being caused by something else. Under the default READ_COMMITTED isolation level, locks covering changes are held until the transaction commits and will block reads and other updates to the same rows. If anything prevents a transaction from committing, then blocking could show up.
For instance, if the database is synchronously mirrored, then the transaction can’t commit and release its locks until the log records have been sent across to the mirror and written to the mirror’s log drive. If the network is severely congested or there’s massive I/O contention on the mirror, this could seriously delay the mirroring operation and cause the transaction to take much longer to commit. This would look like blocking, but the root cause is a resource contention, which has to do with mirroring.
For locking waits, you might see an obvious cause by looking at the query plan or the locked resources, such as a table lock or an inappropriate transaction isolation level. If you can’t see an obvious cause, follow the block chain IDs to the head of the blocking chain by querying the blocking_session_id column of the sys.dm_exe_requests DMV. The highest-level block chain ID is the root cause of the issue.
With SolarWinds SQL Sentry, you can effectively monitor, diagnose, and optimize your entire database environment
The name of this one causes lots of confusion. What word do you focus on? NETWORK. The cause of this wait type usually has nothing to do with the network. It should really be called WAITING_FOR_APP_ACKNOWLEDGEMENT or something similar, as this is precisely what’s happening: SQL Server has sent some data to a client application and is waiting for the client to acknowledge it has consumed the data.
One of my favorite demos to do when teaching about wait statistics is to run a query returning a large result set in Management Studio and watch the server rack up ASYNC_NETWORK_IO waits. There’s apparently no network involved—it’s just SSMS taking a long time to reply to SQL Server. It’s doing what is known as RBAR (row-by-agonizing-row), where only one row at a time is pulled from the results and processed instead of caching all the results, immediately replying to SQL Server, and proceeding to process the cached rows.
This is the primary cause of ASYNC_NETWORK_IO waits—poor application design. Occasionally, it’s the network, but this is rare in my experience.
The common knee-jerk reaction here is to equate this wait type with linked servers. However, this wait time became more common to see when SQL Server 2005 shipped because 2005 contained a raft of new DMVs, and DMVs mostly use OLEDB under the covers. Before looking for linked server problems, I’d check whether a monitoring tool is running DMVs constantly on the server.
If you do have linked servers, continue troubleshooting by going to the linked server and looking at the wait statistics there to see what the most prevalent issue is. Then, continue the same analysis.
One other thing capable of causing OLEDB waits is DBCC CHECKDB (and related commands). It uses an OLEDB rowset to communicate information between its Query Processor and Storage Engine subsystems.
When SOS_SCHEDULER_YIELD is the most prevalent on a server, it’s common to see sustained, high CPU usage. The knee-jerk reaction here is the server must be under CPU pressure or a spinlock is the problem.
We need a bit of background here to understand these two reactions.
Thread scheduling in SQL Server is managed by SQL Server itself, not by Windows (i.e., it’s non-preemptive). The SQL OS portion of the Storage Engine provides scheduling functionality as threads transition from running on a Processor (where the thread state is RUNNING) to being on the Waiter List waiting for a resource to become available (state is SUSPENDED) to being on the Runnable Queue once the resource becomes available (state is RUNNABLE) waiting to get to the top of the queue and back onto the Processor again (back to state being RUNNING). I’ve capitalized Processor, Waiter List, and Runnable Queue to identify them as parts of a scheduler.
Whenever a thread needs a resource it can’t immediately acquire, it becomes suspended. Next, waits on the Waiter List are told (signaled) their resource is available. The time spent on the Waiter List is the resource wait time and the time spent on the Runnable Queue is the signal wait time. Together, they combine to be the overall wait time. SQL OS keeps track of the wait time and the signal wait time, so we have to do some math on the output from sys.dm_os_wait_stats to derive the resource wait time (see Paul's script here).
The Waiter List is unordered (any thread on it can be signaled at any time and move to the Runnable Queue), and the Runnable Queue is first-in, first-out (FIFO) almost 100% of the time. The only exception to the Runnable Queue being FIFO is when multiple Resource Governor workload groups have been configured in the same resource pool and have different priorities relative to each other. I’ve never seen this used successfully in production, so I won’t discuss it further.
There’s another reason why a thread may need to move off the Processor—it exhausts its quantum. The thread quantum in SQL OS is fixed at four milliseconds. The thread itself is responsible for determining its quantum has been exhausted (by calling helper routines in SQL OS) and voluntarily giving up the processor (known as yielding). When this occurs, the thread moves directly to the bottom of the Runnable Queue, as there’s nothing for it to wait for. SQL OS must register a wait type for this transition of the Processor, though, and registers SOS_SCHEDULER_YIELD.
This behavior is often mistaken for CPU pressure, but it’s not—it’s just sustained CPU usage. CPU pressure, and recognizing it, is a whole different topic for a future post. As far as this article is concerned, as long as the average signal wait time is low (0 – 0.1 – 0.2 ms), it’s a pretty safe bet CPU pressure isn’t an issue.
A spinlock is a low-level synchronization primitive used to provide thread-safe access to data structures in SQL Server that are extremely hot (very volatile and accessed and changed incredibly frequently by multiple threads). Examples of such structures are the buffer free list in each portion of the buffer pool and the proportional-fill weightings array for the data files in a filegroup.
When a thread needs to acquire a spinlock, it looks to see if the spinlock is free, and if so, it immediately acquires it (using an interlocked assembly-language primitive like “test bit clear and set”). If the spinlock can’t be acquired, the thread immediately tries to acquire it again and again and again for up to a thousand iterations. It does this until it backs off and goes to the bottom of the Runnable Queue, registering an SOS_SCHEDULER_YIELD wait.
There are two main causes for SOS_SCHEDULER_YIELD: a thread exhausting its scheduling quantum and a thread failing to acquire a spinlock. Heavily recurring instances of either cause can lead to SOS_SCHEDULER_YIELD being the most common wait, along with high CPU usage.
Nearly 100% of the time, the first case is what’s happening. It’s extremely rare for a spinlock to be the cause of high CPU and high SOS_SCHEDULER_YIELD. The only way to prove a spinlock is or isn’t the cause is to capture SQL Server call stacks when the wait type occurs using Extended Events and debug symbols from Microsoft. (Paul has a blog post describing and showing how to perform this investigation, and he also has a great whitepaper about spinlocks and spinlock investigations worth reading if you’re interested in this depth of internals.)
In the case of quantum exhaustion, it isn’t the root cause. It’s a further symptom. Now, we need to consider why a thread may be exhausting its quantum repeatedly.
A thread can only exhaust its quantum when it can continue processing SQL Server code for four milliseconds without needing a resource another thread owns—no waiting for locks, page latches, data file pages to be read from disk, memory allocations, file growths, logging, or the myriad other resources a thread might need.
The most common piece of code where quantum exhaustion can occur and rack up large amounts of SOS_SCHEDULER_YIELD waits is scanning an index/table where all the necessary data file’s pages are in memory and there’s no contention for access to those pages. This is what I encourage you to look for in query plans when you see SOS_SCHEDULER_YIELD as the top wait type—large and/or repeated index/table scans.
This doesn’t mean large scans are wrong, as the most efficient way to process your workload may be through a scan. However, if the SOS_SCHEDULER_YIELD waits are new and unusual and are caused by large scans, you should investigate why the query plans are using scans. Maybe someone dropped a critical nonclustered index or statistics are out-of-date, so an incorrect query plan was chosen. Maybe an unusual parameter value was passed to a stored procedure and the query plan called for a scan or a code change occurred without supporting index additions.
Just as with other wait types, understanding precisely what SOS_SCHEDULER_YIELD means is key to understanding how to troubleshoot it and knowing whether the behavior is expected because of the workload being processed.
When you have a performance problem, take the time to understand the data you have and perform further investigations to help narrow down the root cause of the problem. Don’t just grasp at whatever seems to be the top wait statistic and follow the first piece of advice you come across online (unless it’s from a well-known and reputable source), or you likely won’t solve your problem—and you may even make it worse.
Continuing the theme of wait stats capable of inducing a knee-jerk troubleshooting reaction from the less experienced, I’ll begin by saying there are indeed multiple kinds of PAGEIOLATCH waits, and I’ve signified them with the XX at the end. The most common examples are as follows:
Of these, the most common type—by far—is PAGEIOLATCH_SH.
When this wait type is the most prevalent on a server, the knee-jerk reaction is to conclude the I/O subsystem must have a problem, so this is where investigations should be focused.
The first thing to do is to compare the PAGEIOLATCH_SH wait count and duration against your baseline. If the volume of waits is more or less the same but the duration of each read wait has become much longer, then I’d be concerned about an I/O subsystem problem like the following:
In my experience, the pattern often goes like this: the number of PAGEIOLATCH_SH waits has increased substantially from the baseline (normal) amount and the wait duration has also increased (i.e., the time for a read I/O has increased) because the large number of reads overloads the I/O subsystem. This isn't an I/O subsystem problem—this is SQL Server driving more I/Os than it should. The focus now needs to switch to SQL Server to identify the cause of the extra I/Os.
SQL Server has two types of reads: logical I/Os and physical I/Os. When the Access Methods portion of the Storage Engine needs to access a page, it asks the buffer pool for a pointer to the page in memory (called a logical I/O), and the buffer pool checks through its metadata to see if the page is already in memory.
If the page is in memory, the buffer pool gives the Access Methods the pointer and the I/O remains a logical I/O. If the page isn’t in memory, the buffer pool issues a “real” I/O (called a physical I/O) and the thread has to wait for it to complete, incurring a PAGEIOLATCH_XX wait. Once the I/O completes and the pointer is available, the thread is notified and can continue running.
In an ideal world, your entire workload would fit in memory, so once the buffer pool has “warmed up” and holds all the workload, no more reads are required, only writes of updated data. It’s not an ideal world, though, and most of you don’t have this luxury, so some reads are inevitable. As long as the number of reads stays around your baseline amount, there’s no problem.
When a large number of reads are required suddenly and unexpectedly, this is a sign there’s a significant change in either the workload, the amount of buffer pool memory available for storing in-memory copies of pages, or both.
Here are some possible causes (not an exhaustive list):
One pattern suggesting a table/clustered index scan is the cause is also seeing a large number of CXPACKET waits along with the PAGEIOLATCH_SH waits. This is a typical pattern indicating large, parallel table/clustered index scans occurring.
In all cases, you could look at what query plan is causing the PAGEIOLATCH_SH waits using the sys.dm_os_waiting_tasks and other DMVs, and you can get the Transact-SQL code to do this in my blog post located at SQLskills.com. On the other hand, SQL Sentry can help you identify the culprit without getting your hands dirty.
In a simple (obviously contrived) example, let's assume I’m on a client system using SQL Sentry and see a spike in I/O waits in the SQL Sentry dashboard view, as shown in Figure 2.
Figure 2: Spotting a spike in I/O waits in the SQL Sentry dashboard.
I decide to investigate by right-clicking a selected time interval around the time of the spike and jumping over to the Top SQL view, which is going to show me the most expensive queries that have executed, as shown in Figure 3.
Figure 3: Highlighting a time range and navigating to Top SQL.
In this view, I can see which long-running or high I/O queries were running at the time the spike occurred and drill into their query plans. In this case, as shown in Figure 4, there’s just one long-running query, which ran for nearly a minute:
Figure 4: Viewing a long-running query in SQL Sentry Top SQL.
If I look at the plan in the SQL Sentry client or open it in SolarWinds Plan Explorer®, I immediately see multiple problems. The number of reads required to return seven rows seems far too high, the delta between estimated and actual rows is large, and the plan shows an index scan occurring where I would have expected a seek, as shown in Figure 5.
Figure 5: Seeing implicit conversion warnings in the query plan
The cause of all this is highlighted in the warning on the SELECT operator: it's an implicit conversion!
Implicit conversions are an insidious problem caused by either a mismatch between the search predicate data type and the data type of the column being searched or a calculation being performed on the table column rather than the search predicate. In either case, SQL Server cannot use an index seek on the table column and must use a scan instead.
This can crop up in seemingly innocent code, and a common example is using a date calculation. If you have a table storing the age of customers and you want to perform a calculation to see how many are 21 years old or older today, you might write code like this:
With this code, the calculation is on the table column, so an index seek can’t be used. This results in an unseekable expression (technically known as a non-SARGable expression) and a table/clustered index scan. This can be solved by moving the calculation to the other side of the operator:
Don’t fall into the trap of thinking excessive PAGEIOLATCH_XX waits are caused by the I/O subsystem. In my experience, they're usually caused by something to do with SQL Server, and this is where I’d start troubleshooting.
Perhaps the most misunderstood of all wait stats is the CXPACKET wait. The term refers to “Class eXchange Packet”—a “packet” of rows exchanged between two parallel threads, consisting of a producer thread and a consumer thread, within a single process. To make a long story short, CXPACKET waits are a naturally occurring wait stat on any SQL Server instance utilizing parallelized queries. Whenever you see a high CXPACKET wait stat, think parallelism. Not only is CXPACKET a completely natural wait stat, it also may be a good indicator, since parallelized queries often outperform their serialized brethren.
In an example described by Adam Machanic, we can see a rather simple looking query producing a high number of CXPACKET waits in Figure 6.
Figure 6: An example query from Adam Machanic reliably producing CXPACKET waits.
Many DBAs would instinctively react with a knee-jerk response to high CXPACKET waits by disabling or throttling parallelism, either by tinkering with the MAXDOP (i.e., maximum degrees of parallelism) and/or Cost Threshold for Parallelism settings. Instead, you should first consider the overall performance of the server.
As more and more servers these days ship with a large number of CPU cores, SQL Server seeks to distribute an ever-increasing amount of its workload in parallel across multiple CPUs simultaneously. There’s a simple calculation to keep in mind:
SQL Server + Multiple CPUs = Parallelism = CXPACKET Waits
Parallelism, when you think about it in light of the calculation shown above, is neither good nor bad. It’s simply normal operational behavior.
Consider, for example, all parallelized queries contain more than one thread. There’s a coordinator thread handling all of the gather and partition stream activities. It will always accumulate CXPACKET waits while performing this activity. Depending on the number of CPUs the query is spread across, additional CXPACKET waits may accumulate, especially if one or more threads execute their portion of the workload quickly while other threads (perhaps lumbering along on a different, slower I/O subsystem) are taking a few seconds longer to complete. In this case, all the waiting threads will also accumulate CXPACKET waits, even though the exact reason they’re waiting is a disparity in I/O subsystem speeds.
So what’s the appropriate response to a server with high CXPACKET waits? Like in the earlier section illustrated by Figure 2, there’s no better way to know if there’s a problem than to see wait stats for a specific period of time and then compare those specific wait stats against what’s normal for that time.
By comparing against the baseline value in SQL Sentry for wait stats, you can quickly see whether the overall wait stat values are abnormal. Then, after determining there’s an issue, you can perform deeper analysis to determine the other issues in play making the workload different from normal.
Upon further research, you may find one or more of the following root conditions manifesting themselves as higher CXPACKET waits:
A common problem for alerting systems is they don’t provide holistic performance information across the entire Microsoft stack where the SQL Server instance is installed.
Paul Randal and the team at SQLskills.com have several useful resources related to wait statistics:
Brent Ozar has a fun and concise write-up on wait stats and performance troubleshooting in an article entitled "The Ozar Family Tradition of Performance Monitoring."
For more in-depth explanations of specific wait types, use these references:
A couple of free and useful tools you should consider adding to your tool kit include the following:
This whitepaper discussed the most common problems facing SQL Server database teams who focus their performance troubleshooting activities exclusively on wait statistics. By focusing only on wait statistics, DBAs and performance engineers may waste valuable time chasing down misguided paths and following red herrings. The most common issues, which we discussed in detail, include the following:
SolarWinds SQL Sentry is designed from the ground up to handle these issues quickly, directly, and easily. SQL Sentry also makes it easy to scale out performance troubleshooting activities throughout your IT enterprise by enabling you to create alerts to foresee problems and create your own baselines so the system knows what’s “normal” for your workload. SQL Sentry also provides vivid visualizations to enable quick drill-down troubleshooting and root cause detection, diagnostics, and resolution.