SentryOne Tips & Tricks: Common Alert Tuning Examples

Patrick Kelley

Published On: February 5, 2020

Categories: Alerting, Tips and Tricks 0

I created this post to build off the topics discussed in my previous blog in the Tips & Tricks series, “SQL Server Alert Tuning Basics with SentryOne. We will use the previously discussed alerting features to cover common use cases I run into when working with customers during alert tuning sessions. Alert tuning is always a balancing act. On the one hand, you do not want to be spammed; on the other, you do not want to miss important events. The use cases discussed here are a good starting point on the journey of alert tuning but should also serve as examples of how to handle other situations.  Please note, if you ever have issues setting up alerting, reach out to your SentryOne Customer Success Manager or SentryOne Support. 

Disabling Alerts for a Subset of SQL Server Agent Jobs 

In many environments, I find that there are subsets of jobs that are not meaningful to be alerted on, or that being alerted on every single failure is more distracting than helpful. So, let’s go over some ways we can treat these subsets of jobs differently. 

Let’s say we have a handful of jobs that we do not care about being alerted on when they fail. You could drill down to each job in the Navigator pane > Select > Disable the Condition(s) in question at that level. 

Disabling job failure alert at a job level

Disabling job failure alert at a job level 

However, per job is not a scalable option if we are dealing with a larger subset of jobs. So, this is where Object Groups can be helpful. We can create an Object Group and put all the Jobs in question into this group.  

View of Object Groups

View of Object Groups 

Now that the jobs in question are in the Object Group, we can add and disable the Job Failure Condition at the Object Group level by selecting the Object Group in the Navigator Pane > open the Conditions Pane > add the SQL Server Agent Job: FailureCondition > select the Behavior column > Disabled. As discussed in the previous post, Object Groups live outside the Topology. As such, you don’t technically “override” at the Object Group level. Instead, you explicitly define what you like at the Object Group level, and this will be honored before going through the Topology hierarchy. Whatever is not explicitly set at an Object Group level will continue to honor the hierarchy as logically expected. 

Applying and disabling a condition at an Object Group level

Applying and disabling a condition at an Object Group level 

Divergently Alert on Similar Agent Jobs 

Now Object Groups are great for something like this. However, they are static by nature, meaning as you add new jobs to the environment and you wish them to be treated the same way, they must manually be put into the Object Group to begin honoring those settings. Because of this, sometimes creating and modifying additional versions of a condition at a desired level can be more beneficial, as it can inherently be dynamic. This is especially true whenever you can leverage a job property, such as category or job name, to broadly treat those jobs differently. 

As an example of this, let's look at transaction log backup jobs. Often, users do not want to hear about every single transaction log backup failure. These can happen intermittently for many reasons and is usually not an issue unless multiple failures occur consecutively. However, this is also not a good use case for broadly disabling job failure alerts. A better option is to implement a Ruleset that adds a buffer that must be met before the alert fires. If our transaction log backups run every five minutes, it may be better only to alert when there have been three failures within 20 minutes. 

To do this we will set up two SQL Server Agent Job: Failure Conditions: one that ignores transaction log backups, and another that only alerts on transaction log backups and applies a Ruleset. This is possible because even though these appear to be the same condition and have the same action, each row in the Conditions pane is its own unique and independent object (condition aliasing to come). 

Showcasing two versions of the same condition or action, each with a different filter

Showcasing two versions of the same condition/action, each with a different filter 

In the screenshot above, you can see two versions of the same condition, with different filters. Now, if we leave it here, we effectively have two versions of the same alert, but still get alerted on any job failure. We now need to apply a Ruleset to the version specifically filtered for transaction log backups to enable a buffer to avoid alerting on any single failure.  

Select the version filtered for log backups > select Rulesets > New. Again, our log backups run every five minutes, and we only want to be alerted when there have been at least three failures within 20 minutes.  

Example of a RulesetExample of a Ruleset 

Now we have not only reduced general noise for this condition but when it triggers for a log backup, we know it is a meaningful alert that should be investigated quickly.  

Incidentally, this same method is a good way to implement alert escalation. 

SQL Server Agent Job Runtime Max Conditions 

SentryOne monitors the average runtime for any given object, including Jobs. These averages are used to alert on long-running jobs through the SQL Server Agent Job: Runtime Threshold Max Condition. Default behavior is to alert if any job runs for 250% of its average. Overall percent-based alerting is a great way to be alerted on Jobs running longer than they normally do. That said, there will be scenarios or outliers where default settings are not ideal. For one, it could be that 250% (or 2.5 times the average) is too high, and you would like to see it closer to 150%-200%. Additionallyalerts triggered by percent of average may not work well for objects that have inconsistent runtime. So, what can you do? 

First, let’s talk about the threshold settings for the SQL Server Agent Job: Runtime Threshold Max Condition. The settings for this can be found in the Settings Pane > SQL Server > SQL Server Agent Job. 

SQL Server Agent Job Runtime settings

SQL Server Agent Job Runtime settings 

As you can see in the screenshot above, with the explicit threshold zeroed out, the 250% threshold is being honored. You could navigate to and select any specific job and adjust these settings at individual job levels, but this does not scale. This is another good use case for Object Groups. Once the jobs in question have been added to an Object Group, we can navigate to the “SQL Server Agent Job” settings and specify an explicit threshold that makes more sense for that group of jobs. 

SQL Server: Top SQL Duration Thresholds 

By default, the duration threshold is explicitly an hour for Top SQL events. For some environments, this is just fine. For others, not so much. What if a query that normally takes five minutes ends up running for 20 minutes? You would not be notified of this with the default. In some environments, it makes sense to switch the default setting to leverage percent-based alerting, giving you more insight into whether queries are running longer than they normally do by leveraging their averages. This can be done through the Settings Pane > SQL Server > Top SQL > zero out explicit threshold. 

Top SQL Duration SettingsTop SQL Duration Settings 

Just as explained by using percent-based alerting for Agent jobs, you will likely have some outliers where this does not work well for alerting due to sporadic durations. While not as easy to adjust at an object level as a job, you do have some options for adjusting thresholds at the query level. To do so, find an occurrence of the query in question via Top SQL > Right-Click > Jump To > Runtime Stats. 

Jumping to Runtime stats from Top SQLJumping to Runtime stats from Top SQL 

This will open the Runtime Stats for the query in question, where you have another visualization for execution durations over time. In the grid, you will also see an option to explicitly override Max Duration settings at the query level. You can check the box for the explicit threshold and define the exact duration that you would want to be notified at for the query in question.  

Overriding duration thresholds at the query level in Top SQL Runtime stats

Overriding duration thresholds at the query level in Top SQL Runtime stats 

SQL Server Blocking Conditions 

By default, SentryOne will collect any blocking over 15 seconds. On the alerting side, the default blocking condition is “SQL Server: Blocking SQL.” The default behavior of this condition is to alert as soon as we collect any blocking at the 15-second threshold. However, I often find many users prefer not to get alerted at the collection threshold of 15 seconds, but some secondary threshold like a minute. One way to address this is to use the non-default condition "SQL Server: Blocking SQL: Duration Threshold Max.” This condition has its own threshold setting, just like Agent jobs and Top SQL. 

Blocking SQL Duration threshold for alertingBlocking SQL Duration threshold for alerting 

If you replace the default “SQL Server: Blocking SQL” Condition for the “SQL Server: Blocking SQL: Duration Threshold Max” Condition, then we would continue to collect blocking at 15+ seconds, but not alert on it until the 1-minute threshold has been exceeded 

Another option is to have both but use the Condition Settings for the “SQL Server: Blocking SQL" Condition to filter for a specific situation where you do need to know about blocking at the collection threshold. This could be based otext data, application, database, etc. 

Condition Setting ExampleCondition Setting Example 

SQL Server Deadlocks 

A common issue I see is users getting spammed by deadlock events that are created by third-party applications (*cough* SharePoint). In these cases, there is rarely anything that can be done on your side of things to resolve the issue. In this case, it does not always make sense to get hundreds of emails a day about an event you have no control over. One option is to simply filter those events out through Conditions Settings.  

However, some users do not want to be spammed in this situation, but they are also apprehensive to filter them from alerting completely. To address this, we set up twSQL Server: Deadlock Conditions. Have one filter out the application in question, and then one filter only for the application in question. 

Filter For Filter Out

Condition Settings for two different Deadlock Conditions 

Then we set a Ruleset for the version configured to only alert on the application in question. Implementing the Ruleset applied logic to only alert if there were X number of deadlocks within Y amount of time for just the filtered application. This allowed them to still be alerted in the event where deadlocking would begin to occur in volume, without all the noise of intermittent deadlocking.   

Ruleset ExampleRuleset Example 

 Quick Tricks 

Reminder Emails 

The default Ruleset for most conditions is Notify Every Time. The behavior of these conditions is to alert you immediately, but it requires a state change to re-alert you. For example, the “SQL Server: Offline” Condition. By default, SentryOne will alert you once we see a SQL Server instance is no longer up and running. However, with the “Notify Every Time” Ruleset, you would not be alerted to this again until we evaluated that it was back online and then went offline again. To ensure you continue to get alerted at some cadence as long as a particular state is active, you can use a Time Based Ruleset. You set ‘Process Actions After’ to zero if you wish to be alerted immediately. (Note: for environments with intermittent network issues, we recommend 30 seconds to eliminate false positives for Offline alerts.) Set the Subsequent actions to your desired cadence. Setting “for up to” to zero will cause the cadence to continue indefinitely until the condition in question is no longer true or active.  

Time Based Rules configured to alert immediately and every hour, for as long as the condition is true, and for up to a dayTime Based Rules configured to alert immediately and every hour, for as long as the condition is true, and for up to a day 

On Call Groups 

Ever been awakened by your phone blowing up due to issues occurring even when you are not on call? You can make that a thing of the past by configuring an On-Call Contact Group in SentryOneLook at Scott Fallen’s blog post, “Alerting on a Call Schedule with SQL Sentry,” for more details. 

Automated Response 

Automation is more and more becoming a requirement rather than a buzz word or wishful thinking. This is no different in the context of SQL Server monitoring. If you have a scenario where a fix, update, or even just logging additional data can be automatically deployed, SentryOne can help. Check out Richard Douglas’s blog post, “A Million Little Things: Remediation,” on some of the automated response actions SentryOne offers.  

Conclusion 

Alert tuning is a critical step in transitioning from a reactive stance to a proactive stance in monitoring your environment. Hopefully, through reading this, you have identified some ways to tune alerts in your system. While this blog post focuses on event-based alerts, keep an eye out for my next blog post, where I will discuss tuning Advisory Conditions.  

 

Patrick is a Customer Success Engineering Manager and helps to provide customers with support services, troubleshooting, and defect resolution. Patrick also does some side work with SentryOne’s Professional Services team, who provide training and consulting. Patrick’s blog focus is on helping SQL Sentry users get the most out of our products. His hope is to ensure our users are as knowledgeable and comfortable as possible with SQL Sentry products, so they have as much ammo as possible to solve real world SQL Server problems.


Comments