SentryOne Tips & Tricks: Common Alert Tuning Examples
I created this post to build off the topics discussed in my previous blog in the Tips & Tricks series, “SQL Server Alert Tuning Basics with SentryOne.” We will use the previously discussed alerting features to cover common use cases I run into when working with customers during alert tuning sessions. Alert tuning is always a balancing act. On the one hand, you do not want to be spammed; on the other, you do not want to miss important events. The use cases discussed here are a good starting point on the journey of alert tuning but should also serve as examples of how to handle other situations. Please note, if you ever have issues setting up alerting, reach out to your SentryOne Customer Success Manager or SentryOne Support.
Disabling Alerts for a Subset of SQL Server Agent Jobs
In many environments, I find that there are subsets of jobs that are not meaningful to be alerted on, or that being alerted on every single failure is more distracting than helpful. So, let’s go over some ways we can treat these subsets of jobs differently.
Let’s say we have a handful of jobs that we do not care about being alerted on when they fail. You could drill down to each job in the Navigator pane > Select > Disable the Condition(s) in question at that level.
Disabling job failure alert at a job level
However, per job is not a scalable option if we are dealing with a larger subset of jobs. So, this is where Object Groups can be helpful. We can create an Object Group and put all the Jobs in question into this group.
View of Object Groups
Now that the jobs in question are in the Object Group, we can add and disable the Job Failure Condition at the Object Group level by selecting the Object Group in the Navigator Pane > open the Conditions Pane > add the “SQL Server Agent Job: Failure”Condition > select the Behavior column > Disabled. As discussed in the previous post, Object Groups live outside the Topology. As such, you don’t technically “override” at the Object Group level. Instead, you explicitly define what you like at the Object Group level, and this will be honored before going through the Topology hierarchy. Whatever is not explicitly set at an Object Group level will continue to honor the hierarchy as logically expected.
Applying and disabling a condition at an Object Group level
Divergently Alert on Similar Agent Jobs
Now Object Groups are great for something like this. However, they are static by nature, meaning as you add new jobs to the environment and you wish them to be treated the same way, they must manually be put into the Object Group to begin honoring those settings. Because of this, sometimes creating and modifying additional versions of a condition at a desired level can be more beneficial, as it can inherently be dynamic. This is especially true whenever you can leverage a job property, such as category or job name, to broadly treat those jobs differently.
As an example of this, let's look at transaction log backup jobs. Often, users do not want to hear about every single transaction log backup failure. These can happen intermittently for many reasons and is usually not an issue unless multiple failures occur consecutively. However, this is also not a good use case for broadly disabling job failure alerts. A better option is to implement a Ruleset that adds a buffer that must be met before the alert fires. If our transaction log backups run every five minutes, it may be better only to alert when there have been three failures within 20 minutes.
To do this we will set up two “SQL Server Agent Job: Failure” Conditions: one that ignores transaction log backups, and another that only alerts on transaction log backups and applies a Ruleset. This is possible because even though these appear to be the same condition and have the same action, each row in the Conditions pane is its own unique and independent object (condition aliasing to come).
Showcasing two versions of the same condition/action, each with a different filter
In the screenshot above, you can see two versions of the same condition, with different filters. Now, if we leave it here, we effectively have two versions of the same alert, but still get alerted on any job failure. We now need to apply a Ruleset to the version specifically filtered for transaction log backups to enable a buffer to avoid alerting on any single failure.
Select the version filtered for log backups > select Rulesets > New. Again, our log backups run every five minutes, and we only want to be alerted when there have been at least three failures within 20 minutes.
Example of a Ruleset
Now we have not only reduced general noise for this condition but when it triggers for a log backup, we know it is a meaningful alert that should be investigated quickly.
Incidentally, this same method is a good way to implement alert escalation.
SQL Server Agent Job Runtime Max Conditions
SentryOne monitors the average runtime for any given object, including Jobs. These averages are used to alert on long-running jobs through the “SQL Server Agent Job: Runtime Threshold Max” Condition. Default behavior is to alert if any job runs for 250% of its average. Overall percent-based alerting is a great way to be alerted on Jobs running longer than they normally do. That said, there will be scenarios or outliers where default settings are not ideal. For one, it could be that 250% (or 2.5 times the average) is too high, and you would like to see it closer to 150%-200%. Additionally, alerts triggered by percent of average may not work well for objects that have inconsistent runtime. So, what can you do?
First, let’s talk about the threshold settings for the “SQL Server Agent Job: Runtime Threshold Max” Condition. The settings for this can be found in the Settings Pane > SQL Server > SQL Server Agent Job.
SQL Server Agent Job Runtime settings
As you can see in the screenshot above, with the explicit threshold zeroed out, the 250% threshold is being honored. You could navigate to and select any specific job and adjust these settings at individual job levels, but this does not scale. This is another good use case for Object Groups. Once the jobs in question have been added to an Object Group, we can navigate to the “SQL Server Agent Job” settings and specify an explicit threshold that makes more sense for that group of jobs.
SQL Server: Top SQL Duration Thresholds
By default, the duration threshold is explicitly an hour for Top SQL events. For some environments, this is just fine. For others, not so much. What if a query that normally takes five minutes ends up running for 20 minutes? You would not be notified of this with the default. In some environments, it makes sense to switch the default setting to leverage percent-based alerting, giving you more insight into whether queries are running longer than they normally do by leveraging their averages. This can be done through the Settings Pane > SQL Server > Top SQL > zero out explicit threshold.
Top SQL Duration Settings
Just as explained by using percent-based alerting for Agent jobs, you will likely have some outliers where this does not work well for alerting due to sporadic durations. While not as easy to adjust at an object level as a job, you do have some options for adjusting thresholds at the query level. To do so, find an occurrence of the query in question via Top SQL > Right-Click > Jump To > Runtime Stats.
Jumping to Runtime stats from Top SQL
This will open the Runtime Stats for the query in question, where you have another visualization for execution durations over time. In the grid, you will also see an option to explicitly override Max Duration settings at the query level. You can check the box for the explicit threshold and define the exact duration that you would want to be notified at for the query in question.
Overriding duration thresholds at the query level in Top SQL Runtime stats
SQL Server Blocking Conditions
By default, SentryOne will collect any blocking over 15 seconds. On the alerting side, the default blocking condition is “SQL Server: Blocking SQL.” The default behavior of this condition is to alert as soon as we collect any blocking at the 15-second threshold. However, I often find many users prefer not to get alerted at the collection threshold of 15 seconds, but some secondary threshold like a minute. One way to address this is to use the non-default condition "SQL Server: Blocking SQL: Duration Threshold Max.” This condition has its own threshold setting, just like Agent jobs and Top SQL.
Blocking SQL Duration threshold for alerting
If you replace the default “SQL Server: Blocking SQL” Condition for the “SQL Server: Blocking SQL: Duration Threshold Max” Condition, then we would continue to collect blocking at 15+ seconds, but not alert on it until the 1-minute threshold has been exceeded.
Another option is to have both but use the Condition Settings for the “SQL Server: Blocking SQL" Condition to filter for a specific situation where you do need to know about blocking at the collection threshold. This could be based on text data, application, database, etc.
Condition Setting Example
SQL Server Deadlocks
A common issue I see is users getting spammed by deadlock events that are created by third-party applications (*cough* SharePoint). In these cases, there is rarely anything that can be done on your side of things to resolve the issue. In this case, it does not always make sense to get hundreds of emails a day about an event you have no control over. One option is to simply filter those events out through Conditions Settings.
However, some users do not want to be spammed in this situation, but they are also apprehensive to filter them from alerting completely. To address this, we set up two “SQL Server: Deadlock” Conditions. Have one filter out the application in question, and then one filter only for the application in question.
Condition Settings for two different Deadlock Conditions
Then we set a Ruleset for the version configured to only alert on the application in question. Implementing the Ruleset applied logic to only alert if there were X number of deadlocks within Y amount of time for just the filtered application. This allowed them to still be alerted in the event where deadlocking would begin to occur in volume, without all the noise of intermittent deadlocking.
The default Ruleset for most conditions is Notify Every Time. The behavior of these conditions is to alert you immediately, but it requires a state change to re-alert you. For example, the “SQL Server: Offline” Condition. By default, SentryOne will alert you once we see a SQL Server instance is no longer up and running. However, with the “Notify Every Time” Ruleset, you would not be alerted to this again until we evaluated that it was back online and then went offline again. To ensure you continue to get alerted at some cadence as long as a particular state is active, you can use a Time Based Ruleset. You set ‘Process Actions After’ to zero if you wish to be alerted immediately. (Note: for environments with intermittent network issues, we recommend 30 seconds to eliminate false positives for Offline alerts.) Set the Subsequent actions to your desired cadence. Setting “for up to” to zero will cause the cadence to continue indefinitely until the condition in question is no longer true or active.
Time Based Rules configured to alert immediately and every hour, for as long as the condition is true, and for up to a day
On Call Groups
Ever been awakened by your phone blowing up due to issues occurring even when you are not on call? You can make that a thing of the past by configuring an On-Call Contact Group in SentryOne. Look at Scott Fallen’s blog post, “Alerting on a Call Schedule with SQL Sentry,” for more details.
Automation is more and more becoming a requirement rather than a buzz word or wishful thinking. This is no different in the context of SQL Server monitoring. If you have a scenario where a fix, update, or even just logging additional data can be automatically deployed, SentryOne can help. Check out Richard Douglas’s blog post, “A Million Little Things: Remediation,” on some of the automated response actions SentryOne offers.
Alert tuning is a critical step in transitioning from a reactive stance to a proactive stance in monitoring your environment. Hopefully, through reading this, you have identified some ways to tune alerts in your system.
While this blog post focuses on event-based alerts, keep an eye out for my next blog post, where I will discuss tuning Advisory Conditions.
Patrick is a Customer Success Engineering Manager and helps to provide customers with support services, troubleshooting, and defect resolution. Patrick also does some side work with SentryOne’s Professional Services team, who provide training and consulting. Patrick’s blog focus is on helping SQL Sentry users get the most out of our products. His hope is to ensure our users are as knowledgeable and comfortable as possible with SQL Sentry products, so they have as much ammo as possible to solve real world SQL Server problems.