SentryOne Tips & Tricks: Tuning Advisory Conditions
How do you find that perfect balance? Start with a data-driven approach.
In this blog, we’ll discuss tuning Advisory Conditions. You may find that default SentryOne Advisory Conditions are sensitive, but this is actually by design. By having default settings set a bit more toward the sensitive side, these alerts build up data that you can use to drive strategy for alert tuning. Let’s go through where that data is, and how to use it.
As we begin, there are three main questions to consider.
- For the occurrence being triggered, how long is it active?
- When are these occurrences triggering?
- What are the values ultimately triggering these occurrences?
Where to Start
The Environmental Health Overview on the Start Page is a good place to start, as it can quickly show you what conditions have high counts. Set the sample range to the max 24 hours and sort the results by Count descending.
Using the Environmental Health Score to identify Advisory Conditions with high counts
This will give you some heavy offenders in creating noise to start with.
In larger environments, it may be better to query the SentryOne database for a broader overview. The below queries will assist with this. Note that there are two different groups. One for Advisory Conditions sending emails and another for Advisory Conditions set to Send to Alerting Channels (the brunt of what shows up on the Environmental Health Score). Additionally, I recommend that you narrow the time range to just the last 30-90 days. The default alert retention is a year. So, you want to make sure you’re looking at recent data, especially as you go through additional rounds of tuning.
Global Advisory Condition counts – Send to Alerting Channels/Health Score:
/* Global Advisory Condition counts – Send to Alerting Channels/Health Score*/ SELECT dcd.NAME AS Condition, Count(*) AS TheCount FROM DynamicConditionDefinition dcd JOIN AlertingChannelLog acl ON dcd.ID = acl.DynamicConditionID --WHERE acl.NormalizedStartTimeUtc > ‘2019-03-25 18:05:40.807’ -- AND ObjectName = ‘%<ServerName>%’ -- AND dcd.name = ‘%<ConditionName>%’ GROUP BY dcd.NAME ORDER BY thecount DESC
Global Advisory Condition counts – Send Email:
/* Global Advisory Condition counts – Send Email*/ SELECT dcd.NAME AS Condition, Count(*) AS TheCount FROM DynamicConditionDefinition dcd JOIN ObjectConditionActionHistory ocah ON dcd.ConditionID = ocah.ConditionTypeID WHERE ActionTypeName = ‘Send Email’ -- AND EventStartTime > ‘2018-05-21 10:49:06.137’ -- AND ObjectName = ‘%<ServerName>%’ -- AND dcd.name = ‘%<ConditionName>%’ GROUP BY dcd.NAME ORDER BY thecount DESC
Advisory Condition counts by target – Send to Alerting Channels/Health Score:
/* Advisory Condition counts by target - Send to Alerting Channels/Health Score*/ SELECT ObjectName AS Target, dcd.NAME AS Condition, Count(*) AS TheCount FROM DynamicConditionDefinition dcd JOIN AlertingChannelLog acl ON dcd.id = acl.DynamicConditionID -- WHERE acl.NormalizedStartTimeUtc > ‘2019-03-25 18:05:40.807’ -- AND ObjectName = ‘%<ServerName>%’ -- AND dcd.name = ‘%<ConditionName>%’ GROUP BY ObjectName, dcd.NAME ORDER BY thecount DESC
Advisory Condition counts by target – Send Email:
/* Advisory Condition counts by target – Send Email*/ SELECT ObjectName AS Target, dcd.NAME AS Condition, Count(*) AS TheCount FROM DynamicConditionDefinition dcd JOIN ObjectConditionActionHistory ocah ON dcd.ConditionID = ocah.ConditionTypeID WHERE ActionTypeName = ‘Send Email’ -- AND EventStartTime > ‘2018-05-21 10:49:06.137’ -- AND ObjectName = ‘%<ServerName>%’ -- AND dcd.name = ‘%<ConditionName>%’ GROUP BY ObjectName, dcd.NAME ORDER BY thecount DESC
How to Get to the Data
It's important to tune at the global level before getting too inundated at the, more granular, site, group, and target levels. This will ensure that any new targets have a tuned alerting system applied to them by default. It will also provide better data to work with as you begin to work deeper into the monitoring hierarchy.
Once you've found a condition to start with, open the Global Advisory Conditions List under the All Targets node in the Navigator pane.
Global Conditions List
Once there, you can filter the conditions list by typing into the Name column. Then right click the condition > Show Event Logs.
Filtering for and opening the Event Logs for a specific Advisory Condition in the Conditions List
Now you should see the historical occurrences for the condition in question.
What to Do with the Data
As I mentioned earlier, three data points I like to look at are how long the occurrences being triggered are active for, when these occurrences are triggering, and what values are triggering these occurrences.
I like starting with analyzing the duration spectrum of the occurrences. This way you’re less likely to obscure issues, because you can adjust how long a condition must be active rather than fiddling with the evaluation threshold.
Start by sorting the Duration column in ascending order. Then slowly scroll down the result set and look for the distribution of durations. The default duration threshold for most Advisory Conditions is around 1 minute. Below you can see the result sets’ duration values shifting from 3 to 4 minutes. Looking at the scroll bar, you will see that we are about halfway through the result set. If I increase the duration threshold from 1 minute to 4 minutes, I will effectively reduce overall volume and potential noise for this condition in half.
Viewing the duration spectrum of a specific Advisory Condition in the Event Logs
Next, I like to scroll to the very bottom to see what my upper bound is. These occurrences tend to be those tied to maintenance periods and/or periods of extremely heavy contention. As you begin to go through multiple iterations of this process, it may be helpful to further filter by Start Time. This excludes events from before previous tuning efforts. Do this by hovering over the Start Time column > selecting the funnel. You can do the same for the server name column when you are ready to review per target.
Filtering by Start Time in the Event Logs
If you ever need to reset the filter, simply close out of it in the bottom left. Remove the check in the same area to keep the filter, but disable it temporarily.
Removing a filter in the Events Log
When Are the Occurrences?
After tuning by duration I move on to looking at the time of day. Sort the result set by Start Time to try and see if there’s a particular pattern for when the Advisory Condition in question is triggering. Maintenance periods are common culprits here. There may not be an obvious pattern during the first iteration of alert tuning. This is another reason I like to start with duration, as it helps reduce the dataset, making it easier to identify a pattern.
If you do identify a pattern and/or a time range that you do not benefit from being alerted on, then a Maintenance Window at the condition level can stop you from getting emails during those time periods.
To configure a Maintenance Window, navigate to the Conditions pane > single click All Targets in the Navigator pane to set the Conditions pane to the Global scope(or desired level of topology) > select Advisory Conditions in the Condition Pane > select the condition in question > Rulesets tab. Then select ‘New’ under the Window section to create a time frame in which you do not want to be alerted. Then make sure you set Window Behavior to “Don’t perform actions that occur within the window.”
Applying a Window to an Advisory Condition
Rulesets and Windows are not yet supported for the “Send to Alerting Channels” action. If you're not getting these options, that could be why. Make sure to choose a sample period that makes sense when using the Environmental Health Score. I’m personally a fan of the 3-hour view, which gives you a good range that is also narrow enough to ensure relevance.
Sample Range options for the Environmental Health Score
What Values Are Triggering Occurrences?
Now that you’ve gone through a cycle or two of tuning based on when and how long a condition has been active, you should have a better result set to work with to identify whether now is the time to increase evaluation thresholds.
When looking at the Event Logs, you can click through each row of occurrences and see the evaluation information update at the bottom. As you’re clicking each row, make sure you’re clicking the Start Time column. If you begin clicking on each occurrence by clicking the End Time column value, you’ll see the last evaluation result that was no longer true for a given condition. That could be a bit misleading.
Event Logs, showcasing selecting rows under the Start Time column
Finally, go through several occurrences to look for a possible pattern or a potential new baseline.
We discussed a general process for tuning out noise from Advisory Conditions. Doing so will protect your email inbox and make data shown in the Environmental Health Score and on the Dashboard via the “Send to Alerting Channels” more meaningful. It is also important that you remember to review monitoring settings regularly. As issues come up, you should reflect on what alerts you received, or alerts you didn’t receive—but should have, and add or modify conditions accordingly. This will ensure efficient and actionable alerting across your SQL Server estate.
Patrick is a Customer Success Engineering Manager and helps to provide customers with support services, troubleshooting, and defect resolution. Patrick also does some side work with SentryOne’s Professional Services team, who provide training and consulting. Patrick’s blog focus is on helping SQL Sentry users get the most out of our products. His hope is to ensure our users are as knowledgeable and comfortable as possible with SQL Sentry products, so they have as much ammo as possible to solve real world SQL Server problems.