Troubleshooting SQL Server Part 4: Correlating Wait Statistics Against Other Metrics
In the popular white paper entitled Troubleshooting SQL Server Wait Statistics, I take you through three scenarios in which a wait stats-only approach is insufficient or even harmful to detecting, diagnosing, and resolving performance problems on an instance of SQL Server. The white paper identifies FIVE key steps to effective performance troubleshooting on Microsoft SQL Server (below), with one follow-up step to basically learn from past problems.
SQL Server Troubleshooting Checklist
In the previous three blog posts in this series, I discussed:
- Ways we can take shortcuts in the troubleshooting process, few that there are. Step 1 on the checklist above.
- Determining whether the problem was inside of SQL Server (or not) and caused by a SQL Server error (or not). That's step 2 and 3 on the checklist above.
- Assessing the ranking of overall wait statistics to narrow down the performance bottlenecks occurring in our SQL Server. Step 4 on the checklist above.
Today, we're going to talk how to go about Step 5 on the checklist above - Correlate wait statistics against other system metrics.
Using PerfMon to Correlate and Corroborate Wait Stat Findings
PerfMon queues, if you will cast your mind back to the introduction early in this white paper, are instrumented through PerfMon object counters and pre-date SQL Server 2005 when wait statistics were first introduced. There are a variety of tools and methods for retrieving SQL Server PerfMon object counters, including common GUI tools like Windows Performance Monitor and Resource Monitor. Those tools are particularly useful for object counters that are strictly related to the OS and hardware. However, SQL Server also exposes a very useful DMV, sys.dm_os_performance_counters, showing the present state of all PerfMon counters related to SQL Server.
We might conclude, by looking only at the wait statistics information displayed in the last article in this series, that this SQL Server is greatly hindered by blocking. But that conclusion, as with many others based strictly on waits, is often misguided because other system metrics and indicators have not corroborated the initial observation. A natural next step to assessing the activities of the individual sessions might be to see if there is excessive locking overall, which can be determined with a quick query of sys.dm_os_performance_counters, as shown in Figure 1 below:
Figure 1: Correlating Wait Stats with Performance Counters
As shown in the example above, querying sys.dm_os_performance_counters enables us to target specific locking metrics, such as the Average Wait Time. Of all of the types (i.e. instance_names) of locks going on at the time, only the average key wait time was increasing. Put another way, all of those locks which looked so dramatic a moment ago were basically waiting for a shared or exclusive lock on a key lookup value. When looking at the table, I discovered that it did not have a clustered index. What might have turned into a long running Transact-SQL refactoring project to improve the SQL transaction in this application can instead be resolved by putting a clustered index on the NEW_ORDERS table.
A small handful of useful PerfMon counters, among many others not mentioned in this white paper, include the following:
|:Access Methods||Forwarded Records/sec||<10 per 100 br/s||Tables with records traversed by a forwarded record pointer. Should be < 10 per 100 batch requests/sec.|
|:Access Methods||Page Splits/sec||<20 per 100 br/s||Number of 8k pages that filled and split into two new pages. Should be <20 per 100 batch requests/sec.|
|:Databases||Log Growths/sec;Percent Log used||< 1 and <80%, respectively||Do not let transaction log growth happen randomly!|
|:SQL Statistics||Batch Requests/sec||No firm number without benchmarking, but > 1000 is a busy system.|
|:SQL Statistics||Compilations/sec;Recompilations/sec||Compilations should be <10% of batch requests/sec; Recompilations should be <10% of compilations/sec|
|:Locks||Deadlocks/sec||< 1||Nbr of lock requests that caused a deadlock.|
Some of the counters listed in the table above are important because you would miss these issues entirely if you used a waits-only approach. For example, the forwarded records/sec indicator should be a very low number – the lower, the better. If, on the other hand, you have a lot of forwarded records/sec happening, this is a pretty good indication of poor database design - because it specifically tells us that we have heavy I/O operations against tables that are heaps. (Perhaps the table really should have a clustered index?) Conversely, in situations where we have a lot of forwarded records, we might not ever see evidence of that in the wait stats. It is possible that these forwarded records might surface in some other way; maybe contributing to various I/O waits or other I/O-related counters. But by simply examining this single PerfMon counter, you can tell right away whether you have tables that need to be redesigned.
Another metric to mention specifically from the table above, which you will not catch with wait stats, is lock deadlocks/sec. Keep in mind that there are three things that naturally terminate a transaction in SQL server: COMMITs, ROLLBACKs, and deadlocks. Consequently, it is important to know if there are a lot of deadlocks happening on a system and if they are being handled properly. If there are, again, this is probably an indication that there are some opportunities for code optimization.
Correlating PerfMon with Trace (SQL Profiler)
This section shows you how to use PerfMon and Profiler jointly to find the root cause of poor performance in a SQL Server application. We do this, after using wait statistics analysis to find the generalize performance issue, by correlating PerfMon’s resource consumption information with Profiler's specific event information.
What vexes you?
I’ve always enjoyed my grandparents’ use of old-timey words and phrases. When my great-uncle wanted to know what was troubling me, he’d always ask “What vexes you?”
When it comes to database problems, Gartner Research has already answered that question for us. Their research shows that about 80 percent of database performance problems comes from poorly or improperly coded SQL statements, 15 percent from database design, and 5 percent from poor or inadequate hardware configurations.
Contrary to Gartner’s wisdom, lots of times the first remedies applied are exactly the opposite of what research shows us is the source of the problem. We’ve all heard product managers who, when encountering a performance problem with their applications, declare, “Let’s fill up all those RAM slots or another CPU!” But if you’ve been around for a while, you know that a poorly coded SQL Server application can suck up all the hardware you give it, and still want more. When adding hardware doesn’t work, the product manager usually says, “Let’s try to throw a few more indexes in there. That ought to fix it!” But more indexes can’t help a query or transaction that doesn’t utilize the indexes. By implementing a few simple processes to check your code using PerfMon and Profiler, you can help ensure that all the SQL and T-SQL code used in your application performs optimally.
Getting started with PerfMon and Profiler
To get the most useful information about your SQL Server’s behavior using native tools, you need to combine the use of both PerfMon and Profiler (server-side traces are an even better alternative to Profiler). That’s because of the fundamental nature of each tool.
PerfMon is great at showing the overall resource consumption of most every area of Windows and of SQL Server. Profiler and traces, on the other hand, shows you specific events as they occur (or fire) inside the SQL Server. By combining the output of both tools, we get a great view of overall resource consumption and the individual SQL statements or processes causing that resource consumption.
When resource consumption gets unacceptably high, say CPU or the I/O subsystem, we can tell exactly what statement, batch, or job caused it. That way, we know exactly where to focus our attention when rewriting or tuning a particular SQL Server activity. There are definitely limitations and shortcomings to this approach, such as the fact that these tools are really too overhead-intensive to run around the clock. But when it comes to native tools, they're your best option for truly deep diagnostics.
Best Practices using PerfMon
Methods for invoking and using PerfMon are well documented in the Microsoft help system, so I won’t cover all of the basics of PerfMon here. Just in case you’ve never seen or used PerfMon before, Figure 2 (below) shows you what it looks like in action. However, keep some best practices in mind:
Figure 2: Windows Performance Monitor in Action
- Always start with the Big Four PerfMon categories - Memory, CPU, I/O, and Network.
- Don’t wing it when figuring out which PerfMon counters to use with SQL Server. Use the accumulated decades of experience of other SQL Server experts (like myself, Jimmy May, Brent Ozar, and the SQL Server CSS team) to guide you in choosing from the literally thousands of PerfMon counters that apply to SQL Server.
- To reduce overhead, monitor only the counters that you need. The more you monitor, the more the overhead. When in doubt, start monitoring the Big Four and then add in other PerfMon counters during subsequent troubleshooting scenarios.
- If needed, reduce PerfMon overhead by lengthening the polling frequency of your logging activity from 15 seconds to 30 seconds (or more) when the system is under strain.
- The longer the polling interval, the smaller the log file. You can also cap the size of the log file, if disk space is scarce, by setting the file type to circular file and giving it a specific file size. When the log file reaches its maximum size, it continues to record PerfMon information, overwriting data at the start of the circular file.
- Unless you’re really good at SELECT queries, resist the temptation of logging PerfMon data directly to a SQL Server table rather than to a flat file. The flat file technique is simple and straightforward. The flat file is also the only way to correlate PerfMon and Profiler performance information. (However, go ahead and log directly to a database table when you want to do extensive analysis, data manipulation, or automation using SQL).
- Unless there’s an emergency, don’t use PerfMon directly on the computer you intend to monitor. Instead, run PerfMon on your personal workstation and monitor the target SQL Server over the wire.
- PerfMon stops recording data when you log off. Instead of logging off, minimize PerfMon and lock the workstation. Alternately, you can run PerfMon as a service or set it to run at a scheduled time.
- Use the System Monitor view to see what is happening in real time, but remember that logging is an activity separate from the visual charting activity, which you must stop and start yourself.
Best practices using Profiler
SQL Profiler is also well documented in the SQL Server Books Online. It is based on recording an entry every time a specific event fires, SQL Profiler tracing is useful for determining the granular, specific operations of a SQL Server database over a specific chronological period. (The dark side of Profiler is that it has fairly high overhead, which increases as the overall workload of the monitored SQL Server increases). Figure 3 shows an instance Profiler that is running a trace in the background, while the user is about to start a new trace.
Figure 3: SQL Profiler when Starting a New Trace
Many of the best practices that applied to PerfMon also apply to Profiler. For example, the less you monitor with Profiler, the less overhead it will consume on the monitored server. Therefore, to further the example, you’ll probably want to create filters on your Profiler trace to restrict tracing to the production databases that you’re concerned about (plus tempdb). Here are a few SQL Profiler best practices to consider:
- Reuse the existing Profiler templates to quickly get a trace up and running, choosing the one that sounds closest to your particular needs. For example, the Standard template is a good place to start, but you might use the TSQL-Locks template if wait statistics lead you to think locking is the main issue.
- Reduce overhead by filtering your trace events. For example, you can track only those events fired by a specific application, user, or within a specific databases. If you filter by database, be sure to capture tempdb as well as any production database you monitor.
- Create your own traces by modifying the templates (or by starting with a blank trace template) and selecting the events you want to trace. However, always capture the start time and end time of the events. You won't be able to correlate the results with PerfMon without the start time and end time.
- When analyzing Profiler trace data, search the results by using the Find command to quickly locate the command you’re searching for or group by Client-ProcessID, DatabaseName, or ApplicationName to quickly find what you’re looking for.
- Reduce overhead on the monitored server by recording the Profiler log file on a workstation or server other than the one you’re monitoring. The trace file, by default, will be stored in the Windows temp directory.
- Record Profiler data to a SQL Server table only after you are experienced with the tool. Recording to a flat file is so simple and direct that you don’t need the advanced techniques like those used for most troubleshooting or performance monitoring.
- You can invoke Profiler’s functionality through a server-side trace, which is a stored procedure-based method of doing the same thing that Profiler does. Although you can certainly learn all of the server-side trace stored procedures yourself, I find it’s easiest to configure a trace using the SQL Profiler GUI and then selecting File ==> Export ==> Script Trace Definite ==> For SQL Server 20xx. Voila! You’ve got all the stored procedures (including filters and special configurations) scripted for you.
A correlated view of performance
So far, I’ve briefly described how to invoke SQL Profiler and Windows PerfMon, and presented a handful of best practices so that you can monitor SQL Server performance most effectively. So how do you go about pulling both sets of data, the granular and event-driven data of Profiler along with the overall resource consumption data of PerfMon, into a single view?
Here are simple step-by-step instructions for bringing the two data sets together into one cohesive view:
- Start Windows PerfMon and record to a file.
- Start SQL Profiler and record to a file.
- When you’re ready for analysis, halt the recording to both files.
- Open a new session of SQL Profiler and select File ==> Open ==> Trace File ==> my_profiler_file.trc.
- In SQL Profiler, select File ==> Import Performance Data ==> my_perfmon_ file.blg.
- Select the PerfMon counters that you want to correlate against Profiler trace events. (The system may warn you if you specify a boatload of PerfMon counters.)
Figure 4: Correlated PerfMon and Profiler Information
At this point, you should have a magical overlay showing the SQL Server Profiler event data in the top pane and the Windows PerfMon resource consumption data in the middle pane as shown in Figure 4 above. And because the data for both PerfMon and Profiler are correlated, if you click on a Profiler entry on the top pane, the red marker bar in the PerfMon window jumps to the point in time for that Profiler entry. Conversely, if you click on a particular point on the PerfMon chart (say, just before disk I/O activity begins to crest upwards), the Profiler event in the top pane is highlighted. Finally, the exact SQL or T-SQL statement issued by that Profiler event is shown in the bottom pane.
This three-part correlation of data provides a powerful means of monitoring performance and troubleshooting problems: the middle-pane graph of PerfMon-derived resource consumption enables you to see one (or more) bottlenecks; the top-pane table of Profiler-derived events enables you to see the exact batch (or batches) that precipitated the bottleneck; and the bottom pane shows you the exact SQL or T-SQL code that needs to be examined, tuned, or replaced.
Although Microsoft has continued to implement new and innovative means of troubleshooting problems and monitoring performance, only SQL Profiler and Windows Performance Monitor are available on every version of SQL Server since SQL Server 7.0. That means when you learn to use these tools for performance troubleshooting, you'll always be able to use them. (As a side note, SQL Profiler is partially deprecated for the relational engine. But I think it'll be around for many years in the future).
To get the maximum value from the two tools, be sure to follow these simple best practices, such as monitoring a problem SQL Server from a different workstation. When you’ve collected logs from both PerfMon and Profiler, you can load both log files into SQL Profiler where you can review the correlated information from both data sets. Application performance in a Microsoft SQL Server environment is directly related to the efficiency of the SQL statements involved.
By reviewing the bottlenecks in SQL Server resources as revealed by PerfMon, and correlating those bottlenecks with Profiler data, you can see exactly which SQL statements have hindered performance. Such statements are obvious tuning targets.
Of course, if you've gotten the idea that this is hard work with lots of limitations, caveats, and shortcomings - YOU ARE RIGHT. And that's part of the reason why the SQL Sentry Performance Advisor product is so popular, because it makes all of this hard work I just describe really easy. If you'd like a demo of Performance Advisor, don't hesitate to let me know. I'll be delighted to get you a free trial license and give you a full demo.
Do you have any added questions? Do you want more detail on a particular topic? If so, then get in touch with me by posting a comment here! Many thanks,