Storage Forecasting and Machine Learning With SQL Sentry

To paraphrase British statistician George Box, making perfectly accurate predictions is impossible, but we can strive for predictions that are illuminating and useful. Today, I’ll be reviewing some of the conceptual details that went into the development of our new Storage Forecasting capability. The intent is to show you the work that went into making these models useful, and why the way they are created will allow you to trust their results.

To provide some context, SQL Sentry aims to provide predictive insights along with the monitoring and visibility into history that customers have come to expect from our products. There are two operating modes for the Storage Forecasting functionality in the SQL Sentry client: Standard and Advanced. The features that permit the user to interact with forecasts in the product are identical between the two modes, but the way the forecasts are generated varies depending on which mode is active. Steve goes into detail about the technical requirements for Advanced mode, but Standard mode is enabled for all installations.

We'll discuss some of the reasons we created two distinct forecast generation processes later in this post, but a significant difference between the modes is the use of Microsoft’s Machine Learning Services (ML Services), available in SQL Server 2016 and later. Advanced mode takes advantage of ML Services’ ability to execute external R-language machine learning scripts on data within a SQL Server database, which in this case is the SQL Sentry repository.

The SentryOne Storage Forecasting Panel
The SQL Sentry Storage Forecasting Panel

The Disk Space Problem

There is an almost endless array of ways to generate predictions about the available space that will exist on a disk partition at a specified time in the future. Some of these might be extremely complicated and depend on modeling the inputs and outputs of the individual databases and processes on the drive. Other methods might be very simple and only use an average of past activity. When we started evaluating this problem, we identified a few key features about storage related to the model complexity.

  • Storage usage history can be highly variable and doesn’t behave in a manner that is what I’ll call “statistically friendly.”
  • There is a distinction between two types of events that affect the available space on a disk: general use versus space-freeing.
  • There are transient usage events that don’t contribute meaningfully to long-term usage patterns.
  • There’s great diversity in storage use cases, as not all disks experience the same usage patterns, run the same applications, or host the same types of data.
  • There’s potential for improving storage forecasts if we become aware of changing conditions, forget mistakes, and remember past successes to get better over time.

By creating a model that addresses each of these problems, we have created a tool that is robust to internal and external pressures and variability that can learn from errors and successes. Let’s talk about each feature individually, and what element of the tool has been developed to account for each observation.

Statistical Properties of Disk Space

First, the rate of consumption on most disks isn’t steady, incremental, and linear. We move files onto disks, and then we move them off. We have sudden issues that explode log files. We add new functionality that suddenly changes usage rates, and then shut off the function. Market forces and seasonality can cause extreme transaction processing variability. We can’t possibly hope to anticipate and predict all the types of real-world events that might affect our disk usage.

To create a model that would explicitly account for these aspects of disk usage would require such complexity that it wouldn’t be useful to anyone. However, since we have all this variance in our disk utilization, it probably isn’t the best approach to simply assume that we can use a very simple trend model. Regarding the statistics, disk space data doesn’t have a consistent mean or variance over time, nor are the errors of a linear forecast distributed in a statistically helpful way. Our path through this morass lies in the selection of a set of multiple models that don’t rely exclusively on linear forecasting approaches but instead can take patterns of variance into account when generating predictions.

This isn’t to say that a linear forecast isn’t at all useful. The Standard forecast mode (available to all users) in Storage Forecasting performs in just that way, with the added benefit of some of the other features described below. But if you desire a particularly illuminating forecast, non-linear methods are required.

General Use vs. Space-Freeing

Transactions populate database data files and logs. The size of a SQL data file auto-grows with a specific pattern. Users incrementally create and grow files that consume space. These events fall under the umbrella of normal usage patterns.

An administrator allocates additional storage on a cloud resource, compresses and archives temporary files, or empties a recycling bin. Each of these actions by an administrator is functionally distinct from the normal usage described above.

Therefore, we needed to create an event model that recognizes these sorts of events as fundamentally different and then eliminates them from the data used to generate a forecast. You don’t want your forecasting tool to account for your impending response to storage exhaustion; you need it to let you know that an action is necessary.

Space-Freeing Activity
Space-Freeing Activity

Transient Events

In the same way that active, space-freeing events should be excluded from the data used to generate a forecast, we also don’t want to include transient events such as the one-time placement and subsequent removal of a large file on the disk. If there’s a sudden and significant change in the available space that’s immediately reversed, we don’t want that event to influence the prediction of when you’ll run out of disk space. As described above, our event model detects these sudden changes, along with their reversals, and then excludes them from the data used to generate the forecast.

Transient Event
Transient Event

Ensemble Modeling

To this point, each of the observations made about disk forecasting has resulted in a feature that is present in both Standard mode and the Advanced mode of Storage Forecasting. From this point on, however, the features discussed are unique to the forecasts generated in Advanced mode.

Different types of models pay attention to different features of the source data. The result is that each type of model tends to make certain kinds of errors. For example, if we’re trying to predict traffic density on a highway, we might have two models. One model uses the traffic in the past 10 minutes on nearby feeder streets, and another model uses the historical patterns in traffic based on the time of day. Each of these models will be biased to make different types of predictions and exhibit a tendency to be accurate depending on the situation.

In general, the best predictive approaches combine different forecasts, average out the biases and errors, and move the ensemble prediction in a direction that more closely models the real results. This is how we built the Storage Forecasting tool. Several forecasts of different classes are generated, and the best performing amongst them are combined to produce an ensemble forecast that will provide the best estimate of your resource availability.

Machine Learning

The final, and most exciting, feature present in our Storage Forecasting tool is its ability to adapt and learn. As mentioned previously, the forecast that is shown to the user is an ensemble that’s composed of a set of predictive models of different types. Each of these model types is biased to perform well in different circumstances. Some are very sensitive to recent variation or long-term trends, while others are better at picking up periodic patterns in the data. For a specified disk, any one of these models might be more effective at producing a useful forecast, and it isn’t necessarily clear ahead of time which will be most effective. As a result, we monitor the results of the individual components of the ensemble and adjust the way we combine them to give a greater voice to those forecasts that are doing a better job.

Over time, the algorithm learns which classes of forecasts are most effective for that specific disk and serves results to the user that reflect those improved performance components. This is the essence of machine learning techniques in general. They observe a repeated behavior on a specified task and tune their performance to do a better job at performing that task. We’re proud to bring this sort of methodology to storage analysis.

Conclusion

We’re incredibly excited to add the new Storage Forecasting feature to our rich disk monitoring toolset present in the Disk Space tab, Reporting, and Advisory Conditions in SQL Sentry. We’re already industry leaders in providing best-in-class analytical views into the complicated world of the Microsoft technology stack. This new feature is the first that brings modern predictive modeling capabilities that leverage dynamic machine learning approaches to solve problems for the Microsoft data professional. We can’t wait to learn how SQL Sentry users feel about this new capability, and we look forward to hearing from you about how the tool is providing value.

Thwack - Symbolize TM, R, and C