Testing data and data-centric applications is a vital step for organizations that are using their data to drive their business. This article covers verifying data in production and how this process is different from testing performed in the development process. Verifying data in production is a critical step for organizations that rely on their data for decision making.
This step is performed after data integration processes are executed or after any changes to the data of the system. This verification includes several important aspects, but they can be summed up in the three R’s: Is the data reconciled? Is it related? And is it reasonable?
The reconciliation of data is performed by comparing counts, totals, and balances amongsources. For example, if the daily sales transactions in your source system total $100,000, then the daily sales in your data warehouse should also be $100,000.
The relationship of data is verified by examining the categorization of the data and how it relates to other data. Continuing the example above, your data warehouse total sales might match the source system, but if 80% of the sales are recorded against an “unknown” product, the data isn’t related in a useful way for decision making.
The reasonableness of data is determined by looking at trends, history, and tolerances. If historically your data warehouse daily sales total within +/- 10% of $100,000 (a range of $90,000 to $110,000), seeing daily sales of $200,000 would trigger some additional investigation to ensure the results were valid.
In the sections below, the different types of verification will be discussed in more detail. These three types of verification are often implemented as queries against the data stores and applications in a data-centric system.
There can be a large number of these queries to execute in a production system to verify all the data. These might not all be queries in the traditional relational or SQL sense, as you will often find the need to interact with systems that have non-SQL interfaces or use web APIs.
In all likelihood, you'll want to automate the process of interacting with these heterogeneous systems, running the queries and comparing the results, whether you use a homegrown system or implement a commercially available package like SentryOne Test. Key features to look for (or implement, if you are creating your own) are the ability to store the test results and the ability to access and query the breadth of data sources that you use.
The supporting infrastructure needed to perform data verification effectively will be discussed further later in this whitepaper. First, however, we will discuss the differences between testing in development and verifying data in production.
Related content: Testing Data and Data-Centric Applications | Testing Data-Centric Code in Development
There are a number of similarities between testing in development and verifying data in production. Both scenarios can leverage the same techniques and the same tools. However, there is a significant difference—and that is in your focus.
Goal for testing in development: Ensuring the code is working successfully.
Goal for verifying data in production: Verifying the manipulated data is valid.
When you're testing in development, the goal is to make sure that the code is working successfully, so you typically control the inputs to the processes by using known test data. This approach gives you the ability to verify that the process produces the expected results. So, in this case, the focus of the testing is on the
When you're verifying data in production, however, you don’t control the inputs. You're now dealing with real, production data.
So the focus shifts from verifying the
However, since the data volumes in production systems can be quite large, and the data is expected to be transformed as it moves between systems, you don’t usually focus production data verification on a line-by-line reconciliation.
The performance impact from doing row-by-row validations in a production system usually outweighs the benefits. Rather, you would focus initially on testing aggregates and roll-ups of information. If these show that the data is incorrect, then you would go to a more detailed level of data to identify the underlying issues that caused the incorrect data.
As mentioned above, when you are considering data verification, there are many considerations. We like to sum these up as the 3 R’s: Is the data reconciled? Is it related? And is it reasonable?
The 3 R’s are interconnected and build on each other. Reconciliation is a fundamental verification step and should be your first goal to implement. It provides a baseline of confidence in the data, as you will know that the totals and key values in each system match.
Verifying data relationships is the next logical step and adds a significant amount of value because it ensures that data is related in a way that's useful for the business. Finally, you'll want to verify that your data is reasonable. Implementing this depends on having appropriate reconciliation and relationship tests in place, as you can’t verify that data is reasonable if it can’t be reconciled.
The ability to find and be alerted to exceptional scenarios by reasonableness testing can be invaluable, particularly since this type of verification can catch problems that you haven’t even considered.
Reconciliation of data involves comparing the sources of your data to the targets, and verifying that the data matches. This is rarely as simple as just comparing rows of data from the source and the target systems.
Data structures are usually different between source operational systems and reporting data structures, and the data is often cleaned, transformed, and otherwise modified when it is moved from repository to repository. Reporting applications can add their own layers of transformation, including new calculations and filtering, that have to be accounted for in the reconciliation.
The primary measures of reconciliation are totals, counts, and balances. You can calculate these easily from data storage by running aggregate queries against the data store. For example, if you are verifying a SQL Server repository, you might use a query like the images provided to get the count of orders and the totals for sales.
A similar query can be executed against the data warehouse to retrieve the same numbers, and the results can be compared to validate that they match. Any differences should trigger additional research to discover the source of the mismatch. These queries might also need to be filtered for specific time ranges.
It usually isn’t necessary, nor wise for performance, to process your entire history for the totals each time you run your data verification. Rather, you will likely want to filter the results to data that has been modified or moved since you last ran the verification.
Another common approach is to run the verification for a specific time window—for example, records modified in the last 24 hours or the last 7 days.
When verifying the second "R," that data is related, you are checking the relationships in the data. For example, you expect sales data to relate to a number of other pieces of information. The record of a sale could relate to a product, a customer, a date or a number of other entities. If those relationships are missing or invalid, the data might lose a significant amount of value.
One scenario where you can encounter the potential for unrelated data can be found if you work with non-relational data stores, like file-based systems or NoSQL stores. These types of data stores don't enforce referential integrity, so there is nothing to guarantee that records are properly related.
If you work primarily with relational databases, you might not think you have much to worry about here. However, it’s not uncommon for relational databases to be missing the referential integrity checks that they need to properly enforce relationships. In addition, many data warehouses intentionally don’t enforce referential integrity for performance reasons.
It's not uncommon for relational databases to be missing the referential integrity of checks needed to enforce relationships.
Even when you are working with a relational database where referential integrity is enforced, you might need to verify that your data is related. For example, in data warehouses, it is common to have an “Unknown” record in dimensions and fact records that don’t match existing dimensions might be assigned to these unknown records.
These records are valid in that they pass referential integrity checks, and a certain volume of “unknown” relationships might be acceptable for your business scenarios. However, all organizations have limits to how many unknown relationships they can have in their data before it becomes unusable for decision making.
For example, when processing product sales, it might be acceptable to have 5% of sales map to an “unknown” customer location (geography), because that information either wasn’t provided from the customer or the provided information was invalid (the provided region and the postal code don’t match so the geography can’t be determined accurately).
However, if the volume of “unknown” geography relationships for sales grows to 10%, that might be unacceptable to your business users for reporting accuracy. This tolerance will be different for other details—most business users would be unhappy about 5% of their sales relating to “unknown” products.
To verify that data is related, you'll be testing counts and totals, but they'll be subdivided by the relationships. In the case that you are checking a specific relationship, like an “unknown” record of a dimension, you might use a query that filters results to just that subset of data. This allows your data verification tests to do specific checking for certain relationships and report any unexpected variances.
Another technique for validating relationships can be to look at specific groupings and the distribution across groups. This can help you to quickly identify outliers and unusual relationships.
For example, the query above would let you easily compare categories to see whether sales were abnormally high in any particular category, which would indicate that the categories were being related incorrectly. By using a second query (see below) that leveraged the results from the categorized sales query, you can test for anomalies in the distribution.
These values can also be useful in the next step, checking the reasonableness of the data. This is another place where being able to persist the values and results from the verification tests is beneficial.
The 3rd "R" is that data is reasonable. This verification is a bit more subjective and will require you to work with your business users to determine which data scenarios are expected and normal, and which data scenarios are abnormal and require someone to be alerted.
It’s very helpful to know the trends for key metrics in your historical data and to have that available for discussions with business users. Generally, you can consider data reasonable if a business user would look at it and not find it surprising.
Reasonable verification will require you to work with your business users to determine which data scenarios are normal.
A simple example of reasonableness is looking at daily sales. Imagine your organization typically sells 10,000 widgets a month. After loading the current day’s data into your data warehouse, you find that the data shows that you’ve exceeded 10,000 widgets for the month and it’s only the 4th day of the month.
While that’s certainly exciting data (it’s a record sales month!), you might not consider it reasonable that sales have skyrocketed that quickly. It’s certainly possible that the data is legitimate, but you would want to be alerted and do some research to confirm those numbers, particularly before telling your CEO to buy a new house.
While that might be an extreme example, the same reasonableness criteria can be applied in other scenarios. If you pay salespeople commission on your product sales, there’s likely a specific range that you expect the commissions to fall into when you look at as a percentage of the sale. Doing a reasonableness verification on orders to ensure that commissions fall into the appropriate range can be very beneficial.
Simple reasonableness verification can be done by getting expected values for key metrics from the business and querying to validate that they fall within an acceptable range. For example, if you wanted to verify that your daily sales were within 10% of an expected value of $5,000, you could use the following query.
As currently implemented, this test is useful but doesn’t deal well with the normal variances that most organizations experience.
Perhaps Fridays tend to be your busiest days or there are seasonal factors that lead to significant differences in the data for different time periods.
This type of test gets much more powerful when you combine it with historical results and expectations. If your data verification framework enables you to store test results, you can implement verification tests that compare the current results to historical data and trends.
You can also leverage other sources of historical data to implement trend comparisons. With the proper setup, these tests can automatically adjust to changing business scenarios and trends.
One way to leverage historical information is to use standard deviation. In the example query batch below, the historical test results for total sales are used to determine a baseline standard deviation value.
The difference between the current day’s sales and the previous day’s sales is then calculated and compared to the standard deviation. If it exceeds the calculated standard deviation, an alert would be raised.
Data-centric systems often contain multiple data stores and applications. There can be numerous separate processes that move or transform the data. So how do you go about verifying that it’s happening correctly? At what points should you apply the three
One approach is step-by-step validation. In this case, you look at the major components of your data infrastructure. Within the infrastructure, you identify each transition point for your data—where does it move from one system to another, or where is it transformed? You would then implement testing for each of these points.
Testing for reconciled data works very well in the step-by-step approach and it's highly recommended that you do reconciliation verification at each transition point. This helps you quickly identify spots where some of the data might have been missed or not transformed correctly.
Checking for related and reasonable data is usually not as critical for each step, although you might want to implement these types of verification testing at specific transition points. Step-by-step validation is particularly useful for identifying where in the overall data infrastructure a problem was introduced.
However, step-by-step validtion is targeted toward technical users who have a good understanding of the complete system. This approach might not work well for business users, who are often less concerned with where the problem occurred than the end result. Operations people do find the finer-grained approach beneficial.
End-to-end verification focuses comparing the starting values in your source systems (one end) with the final values in your reporting data structures, reports, and dashboards (the other end).
The goal of this validation is to look at the macro level and major endpoints of the system, without considering the individual transition points. Endpoints are usually the places where the data crosses the boundaries of your data-centric systems. Endpoints can be originating (where data enters your data-centric system) or terminating (where data flows outside the control of your data-centric system). In some cases, an endpoint might function as both.
Some data-centric systems might have many endpoints. For example, systems that include point-of-sale applications might have hundreds or even thousands of endpoints. In this case, you have to identify a reasonable place to start the testing. If there is an easy way to replicate the tests over multiple endpoints, that might be an option. If not, you might consider working a little further up the flow of data to a point where the systems are combined into a common store.
Out of the three R’s, end-to-end tests usually focus on reconciliation and reasonableness. You will typically want to reconcile the data at your endpoints, and verifying the reasonableness of the data in your endpoints is critical to having confidence in the data. Relationships can be validated in these verifications too, but more commonly step-by-step verification is used to validate relationships.
End-to-end verification works very well as a way of reporting system status for business users, who are generally more interested in the high-level picture. Often, the business user really wants to know: “Can I run my reports this morning and have confidence in the results?” An end-to-end test that reconciles the source system totals to the values in the data warehouse can fill this need nicely.
End-to-end tests are also useful as quick smoke checks. Because end-to-end verification usually has fewer test cases and focuses on high-level aggregates, they can be run more quickly than the step-by-step verifications. Many organizations benefit from having both—they run the end-to-end verifications to determine whether there are any major issues, and then run the step-by-step validations to identify specific problem areas.
For an idea of how you might apply data verification to a real-life system, consider the following example. You are working on a data infrastructure that contains the following systems:
An OLTP order processing system
An OLTP customer relationship management (CRM) system
A star-schema data warehouse (DW), which includes a staging area (Stage)
An OLAP analysis repository
A series of reports and dashboards that report on information from the data warehouse and OLAP repository
The transition points where data is moving or being transformed in this system are listed below, with details on example verifications for each.
Very little transformation is done at this point. The primary goal is to move the data to the staging area for additional processing.
To validate accurate data movement, you would implement reconciliation that compared the key metrics from the order processing system to the staging area. This would include total orders, total sales amounts, and current inventory levels. The queries for retrieving this information would be very similar since there aren’t significant changes to the data structure in this step.
Relationship and reasonableness verifications are optional at this step. In many scenarios, it would be beneficial to implement some reasonableness verification that looks at the count and total of the data transferred to determine whether it matches the historical trends.
If the data movement is done incrementally, the verifications should be filtered so that they look at the current increment. For example, if the sales information for the previous day is being moved to the staging area, you would want to ensure that your queries were filtered to include only that day’s information. That helps narrow the scope of any identified data mismatches.
Much like Order Processing to Stage, this transition is focused on the movement of data, not transformation. The validation logic is very similar, but you would focus on customer counts, total amounts for quotes, etc.
Stage to DW is primarily a transformation step. As part of this transition, the data is restructured to fit the data warehouse data, existing values are summarized, and new values are created. Data that is determined to be invalid would also be excluded at this step.
You would implement all of the three
Relationship verification would be applied to validate that categorization of the data hadn’t been impacted as facts were related to dimensions. Finally, reasonableness verification would compare the loaded values to ensure that they matched the historical trends and that there weren’t unusual data scenarios encountered during the data transformation processing.
Validation of this step involves a good understanding of the business requirements. To implement the validation, you need to understand the data scenarios that might be encountered, as well as how they are expected to be handled. The validation would still focus on the key metrics identified in the previous steps (total orders, sales amounts, customer counts), but the queries might be more complex, as the data has changed during the transition.
For example, you might have sales that are filtered out for missing or invalid data during processing. Validation would ensure that all totals from the original staging tables are compared to totals that include both the data inserted into the data warehouse, as well as any data that has been excluded as invalid.
This transition point is primarily focused on changing the storage of the data rather than modifying it. Typically, the data is retrieved from a relational store and written to a multidimensional store with different physical storage properties. As part of the OLAP storage, additional calculations might be provided.
Because the data and business rules aren’t typically coming into play, the verification for this transition point is focused on reconciliation and relationships. The total amounts and counts should not change, and the data categorization should remain the same.
Your verification tests should focus on ensuring that the totals match, that the associated facts and dimensions are still aligned, and that any new calculations implemented in the OLAP store return correct values.
This transition point enables the visualization of the data for end users. Normally, the data isn’t heavily transformed in this step, but it's often aggregated and summarized differently for reporting purposes. Many times there are additional calculations implemented in the reporting tool.
As you implement verification for this transition point, your primary concern should be reconciling that the totals displayed on the reports match the totals from the data warehouse. Relationships are not typically modified in the reporting step, and the reasonableness of the data should be verified in a previous step.
If the reports are simple and straightforward, with direct pass-through to the underlying storage, verification tests might not be necessary at this step. However, new calculations and transformations for display purposes are frequently implemented in the reporting layer. In these cases, reconciliation verification is important to ensure that the reporting layer isn’t presenting an invalid display of the source data.
The verification for this transition point matches the DW to Report transition point. In this case, the only difference is the data source for the reports.
Endpoints are the places in your data-centric system where data is introduced from outside, or where the data
Order processing is an originating endpoint. Data flows into this system from users or other systems interacting with the order processing system. Endpoint verification at this point would consist of reconciliation steps as well as reasonableness verification.
For reconciliation, your primary goal for this endpoint is to gather the totals, counts and balances that will be used for reconciliation with other endpoints. These values should be stored so that you have a history of the values. The reasonableness verification should be based on these historical values. For example, you might capture the total order amounts and item counts for reconciliation. Your reasonableness tests would compare those totals to the historical trends, and if the variance was greater than 15 %, an alert would be triggered.
The CRM, like order processing, is an originating
The data warehouse is considered a terminating endpoint, as some users report directly from the warehouse, and other applications and systems use it as a source of data. For verification, your primary concerns are reconciliation and reasonableness.
The reconciliation verification should compare the amounts and counts with the originating endpoints. In this case, that would mean comparing the data warehouse total sales amount and item counts with the values captured from the order processing system, and the customer counts and quote amounts with the CRM system.
This is another point where you will want to persist the values for the recompilation verification so that you can test the reasonableness of the data.
Even though the data was tested for reasonableness at the source, and the data is reconciled, the processes for loading the warehouse can be complicated. The reasonableness check, in this case, verifies that the data matches trends, and those unusual scenarios (like a single salesperson making all of the sales for the day) are caught.
Since the OLAP system is another point where data is consumed by other systems and users, it's also considered a terminating endpoint. In this case, you would focus on reconciling to the CRM and order processing systems, using the same metrics as described above for the data warehouse.
Reports and dashboards are often the terminating
For this endpoint, reconciliation to source systems is the most important verification step. Particularly since this might be the only aspect of your data-centric system that end users interact with, it's important to make sure that the values on the report reconcile to the source systems and to the data warehouse or OLAP store, as appropriate.
As mentioned earlier, doing complete data verification can involve interacting with a large number of systems and running a large number of queries against them. Whether you choose to build your own or use a commercial solution like SentryOne Test, there are some key features you should look for:
Automated execution of queries: As mentioned, there can be a large number of queries to execute, and if you want to test regularly (which is certainly recommended), manual execution will get very time-consuming, very quickly.
Support for comparison of results across disparate systems: Your test framework should handle the need to compare data across different systems that might treat the same nominal data type differently. For example, SQL Server, Oracle, and JSON all use slightly different representations for date values. Your test framework should be able to account for this and still make accurate comparisons.
Alerting: When tests fail, you will want someone to be notified. Make sure your test framework supports this.
Flexible query result checking: Once you’ve run the query, you need a way to verify that it returned the expected data. This should be part of the automated execution so that you get a pass/fail message for each test case. Bear in mind that some queries might return single values and some may return tables of information. Your test framework should handle either.
History: Storing a history of test results is beneficial for reporting purposes, as well as seeing trends in test results. It’s also vital to doing effective reasonableness verification, so this is a must-have feature.
Support for multiple types of data stores: Very few organizations work with a single application interface or data storage technology. Your test framework should support all the tools you work with and be flexible to handle future ones. Ideally, it will support an extensibility model so that new data providers can be added easily.
Dashboards/reports: In addition to alerting, some users will want the ability to see current verification status of their data-centric systems, and to drill into details about test failures and issues. Ideally, this should be web-based or in a format that is easy to share with your users.
Having a proper framework in place means that you can focus on implementing the business logic for your tests, rather than worrying about how to run a query. While you can do some level of data-centric testing without a framework, it is strongly recommended.
This article and the related content, Testing Data and Data-Centric Applications and Testing Data-Centric Code in Development have focused on data-centric testing from a development standpoint and for production data verification. I've presented some of the benefits of adopting a data-centric view of testing in your organization and provided some examples of how to implement data-centric testing at both the development and production level.
With this introduction to data-centric testing, you can evaluate your approach, tools, and techniques for dealing with the data-driven nature of organizations today. Data is a vital part of the business and is only becoming more important.
At SentryOne, we believe that it’s necessary to go beyond saying that your data is good—you should be able to prove it. If you can show, through repeatable tests and verification, that your data is reconciled, related, and reasonable, you will have confidence in your work and give your users confidence to make the decisions they need to make. And at the end of the day, everyone working on your data-centric systems will be happier and more productive.