SQL Server Monitoring - Getting It Right
One of the best aspects of my role at SQL Sentry is the daily contact I have with our amazing customers. For every opportunity I have to impart a bit of knowledge about our products or about SQL Server, I experience five opportunities to learn something about a customer's business and their particular needs vis-a-vis monitoring, alerting, and performance troubleshooting.
With that ongoing experience firmly in mind, I have been somewhat reluctant to suggest any universal truths regarding essential aspects of SQL Server monitoring. Perhaps equally intimidating is having read several excellent pieces by other authors on the topic, particularly Mike Fal's (b|t) Hierarchy of Monitoring Needs and Adam Machanic's (b|t) Ten Commandments of SQL Server Monitoring. What more is there to say on the topic?
Interestingly, the second installment in Adam's A Month of Activity Monitoring series actually answers my previous question by changing the perspective from that of a production DBA keeping watch over his or her servers to a tool developer providing a solution. The design principles Adam describes are essential reading for anyone thinking about building a home-grown monitoring solution. They also provide an excellent reality check for folks using someone else's solution. Four design principals that particularly resonate with me are:
- Show Only Interesting (Relevant) Data
- Impact the Server as Little as Possible; Return Data as Quickly as Possible
- Show as Much Data as Possible Without Going Overboard
- Provide a Flexible and Configurable Experience
While I encourage you to read Adam's posts for more detail about these principles, I also urge you to consider them while using tools like SQL Sentry Performance Advisor (PA) and Event Manager (EM). These four design principles (and a few others) are the rockbed of our approach to SQL Server monitoring. Let me provide examples of each principle guiding design decisions in PA and EM:
Show Only Interesting (Relevant) Data
This design principle can engender a lot of discussion, because there is legitimate disagreement on what data is interesting and what data is relevant. Experts often have starkly different views relative to certain metrics, and quite frankly, this is a moving target as both SQL Server internals and the way we deploy and use it changes. Perhaps that is why we are discussing principles rather than rules.
Nonetheless, a tool developer is forced to make choices about what data to collect and how to display it. Mastery of a tool requires the user to understand the developers' choices. The choices we have made are presented throughout SQL Sentry PA and EM, but here are two examples that illustrate our thinking:
SQL Server Wait Stats: Performance Advisor discards non-performance impacting wait types, then groups the rest into classes and categories of "like" waits while still exposing specific wait types and wait times. In Sample Mode, PA also differentiates between Signal (CPU) and Resource waits, and displays the CPU % of Total Waits.
In Top SQL, PA captures and displays only those queries that exceed a minimum duration threshold you specify. In most cases, that's what is interesting and relevant to troubleshooting performance issues - queries that run long. Long-running queries are usually the most resource intensive, and often cause response-time killing long-blocking events. In the relatively rare situations where this is not the case, PA provides user-controllable settings and additional data (query stats and procedure stats, query plans, index usage) to root out tougher query-related issues.
Impact the Server as Little as Possible
Ensuring a low overhead out-of-the-box experience is a driving design principle for our entire suite. A monitoring solution that imposes a significant load on a monitored server becomes part of the problem instead of a solution. Our Overhead Study lets our customers know exactly what kind of load to expect from our software.
Return Data as Quickly as Possible
The SQL Sentry repository has been designed to store the data we collect as efficiently as possible. That helps ensure data collected from hundreds of monitored instances is stored quickly and can be retrieved by the client quickly. Our roll-up and purge process serves the same purpose for aging data.
Show as Much Data as Possible Without Going Overboard
Admittedly, what constitutes "going overboard" is a judgement call, but when combined with design principle 1 above, a reasonable balance can be found. Performance Advisor's tabbed interface strikes that balance by segmenting the presentation of data into logical contexts.
The Disk Activity, Disk Space, Top SQL, Query Plans, Blocking SQL, and Deadlocks tabs each provide a detailed view of a particular aspect of a SQL Server instance's behavior synchronized across tabs by time. Comparing those details against the Dashboard's performance metrics and Event Manager's Calendar provides DBAs with the relevant information needed to troubleshoot the vast majority of performance issues.
As another example of Performance Advisor's adherence to this design principle, the Dashboard displays all the essential information needed to provide a single view of key resource utilization and performance metrics. The point-in-time snapshot of sample mode displays more detail (and more metrics) than the trend oriented history mode.
Similarly, each PA grid starts with a default number of columns that display the essential data necessary to facilitate performance analysis and troubleshooting. The Column Chooser provides users with the ability to add and remove columns to suit individual preferences and situational needs. Sorting, filtering, and grouping facilitate presenting the maximum amount of digestible data.
Additionally, this design principle is implemented consistently in the alerting message system. Regardless of the delivery mechanism (email, actions log, dashboard alerts, etc.), messages generated by alerts provide all the information necessary to allow DBAs to understand clearly the source and purpose of an alert:
- What Condition was met?
- What Threshold was crossed, and what was the measured value?
- What Resource(s) were involved?
- When did it happen and how long did it last?
- What action should be taken (optionally available when using Notes)?
Provide a Flexible and Configurable Experience
This design principle is met in part by some of the items discussed in item 3 above. Data collection granularity and frequency, data retention, alerting thresholds and frequency, user permissions, a host of user preferences, deployment architecture and much, much more can be configured to meet most everyone's preferences and requirements. Yet, even more is available.
Conditions, Actions and Settings can be configured for sites, groups, computers, SQL Server instances, databases, indexes, jobs, tasks, queries, and more. Data can be viewed from the rich Windows client, through Reporting Services, or in a web browser. The truly ambitious can mine the SQL Sentry database directly.
Whether you use our products, SQL Server native tools, or build your own solution, I hope this article will spark some thought and discussion about your approach to SQL Server monitoring. Leave a comment below and let me know what you think!
Justin (@jh_randall) leads our Professional Services Organization with the perspective of nearly 30 years of IT experience. Professional Services provides customers with mentoring, consulting and training services focused on configuring and tuning SentryOne software, and SQL Server performance troubleshooting and optimization. Justin's blog focuses on helping our users become highly valued DBAs who use our software to prevent and solve real-world SQL Server problems while making positive contributions to their organization's success.