Testing data and data-centric applications is a vital step for organizations that are using their data to drive their business. This article covers verifying data in production and how this process is different from testing performed in the development process. Verifying data in production is a critical step for organizations that rely on their data for decision making.
This step is performed after data integration processes are executed or after any changes to the data of the system. This verification includes several important aspects, but they can be summed up in the three R’s: Is the data reconciled? Is it related? And is it reasonable?
The reconciliation of data is performed by comparing counts, totals, and balances amongsources. For example, if the daily sales transactions in your source system total $100,000, then the daily sales in your data warehouse should also be $100,000.
The relationship of data is verified by examining the categorization of the data and how it relates to other data. Continuing the example above, your data warehouse total sales might match the source system, but if 80% of the sales are recorded against an “unknown” product, the data isn’t related in a useful way for decision making.
The reasonableness of data is determined by looking at trends, history, and tolerances. If historically your data warehouse daily sales total within +/- 10% of $100,000 (a range of $90,000 to $110,000), seeing daily sales of $200,000 would trigger some additional investigation to ensure the results were valid.
In the sections below, the different types of verification will be discussed in more detail. These three types of verification are often implemented as queries against the data stores and applications in a data-centric system.
There can be a large number of these queries to execute in a production system to verify all the data. These might not all be queries in the traditional relational or SQL sense, as you will often find the need to interact with systems that have non-SQL interfaces or use web APIs.
In all likelihood, you'll want to automate the process of interacting with these heterogeneous systems, running the queries and comparing the results, whether you use a homegrown system or implement a commercially available package like SentryOne Test. Key features to look for (or implement, if you are creating your own) are the ability to store the test results and the ability to access and query the breadth of data sources that you use.
The supporting infrastructure needed to perform data verification effectively will be discussed further later in this whitepaper. First, however, we will discuss the differences between testing in development and verifying data in production.
There are a number of similarities between testing in development and verifying data in production. Both scenarios can leverage the same techniques and the same tools. However, there is a significant difference—and that is in your focus.
Goal for testing in development: Ensuring the code is working successfully.
Goal for verifying data in production: Verifying the manipulated data is valid.
As mentioned above, when you are considering data verification, there are many considerations. We like to sum these up as the 3 R’s: Is the data reconciled? Is it related? And is it reasonable?
The 3 R’s are interconnected and build on each other. Reconciliation is a fundamental verification step and should be your first goal to implement. It provides a baseline of confidence in the data, as you will know that the totals and key values in each system match.
Verifying data relationships is the next logical step and adds a significant amount of value because it ensures that data is related in a way that's useful for the business. Finally, you'll want to verify that your data is reasonable. Implementing this depends on having appropriate reconciliation and relationship tests in place, as you can’t verify that data is reasonable if it can’t be reconciled.
The ability to find and be alerted to exceptional scenarios by reasonableness testing can be invaluable, particularly since this type of verification can catch problems that you haven’t even considered.
Reconciliation of data involves comparing the sources of your data to the targets, and verifying that the data matches. This is rarely as simple as just comparing rows of data from the source and the target systems.
When verifying the second "R," that data is related, you are checking the relationships in the data. For example, you expect sales data to relate to a number of other pieces of information. The record of a sale could relate to a product, a customer, a date or a number of other entities. If those relationships are missing or invalid, the data might lose a significant amount of value.
One scenario where you can encounter the potential for unrelated data can be found if you work with non-relational data stores, like file-based systems or NoSQL stores. These types of data stores don't enforce referential integrity, so there is nothing to guarantee that records are properly related.
If you work primarily with relational databases, you might not think you have much to worry about here. However, it’s not uncommon for relational databases to be missing the referential integrity checks that they need to properly enforce relationships. In addition, many data warehouses intentionally don’t enforce referential integrity for performance reasons.
It's not uncommon for relational databases to be missing the referential integrity of checks needed to enforce relationships.
Even when you are working with a relational database where referential integrity is enforced, you might need to verify that your data is related. For example, in data warehouses, it is common to have an “Unknown” record in dimensions and fact records that don’t match existing dimensions might be assigned to these unknown records.
These records are valid in that they pass referential integrity checks, and a certain volume of “unknown” relationships might be acceptable for your business scenarios. However, all organizations have limits to how many unknown relationships they can have in their data before it becomes unusable for decision making.
For example, when processing product sales, it might be acceptable to have 5% of sales map to an “unknown” customer location (geography), because that information either wasn’t provided from the customer or the provided information was invalid (the provided region and the postal code don’t match so the geography can’t be determined accurately).
The 3rd "R" is that data is reasonable. This verification is a bit more subjective and will require you to work with your business users to determine which data scenarios are expected and normal, and which data scenarios are abnormal and require someone to be alerted.
It’s very helpful to know the trends for key metrics in your historical data and to have that available for discussions with business users. Generally, you can consider data reasonable if a business user would look at it and not find it surprising.
Reasonable verification will require you to work with your business users to determine which data scenarios are normal.
A simple example of reasonableness is looking at daily sales. Imagine your organization typically sells 10,000 widgets a month. After loading the current day’s data into your data warehouse, you find that the data shows that you’ve exceeded 10,000 widgets for the month and it’s only the 4th day of the month.
While that’s certainly exciting data (it’s a record sales month!), you might not consider it reasonable that sales have skyrocketed that quickly. It’s certainly possible that the data is legitimate, but you would want to be alerted and do some research to confirm those numbers, particularly before telling your CEO to buy a new house.
While that might be an extreme example, the same reasonableness criteria can be applied in other scenarios. If you pay salespeople commission on your product sales, there’s likely a specific range that you expect the commissions to fall into when you look at as a percentage of the sale. Doing a reasonableness verification on orders to ensure that commissions fall into the appropriate range can be very beneficial.
Simple reasonableness verification can be done by getting expected values for key metrics from the business and querying to validate that they fall within an acceptable range. For example, if you wanted to verify that your daily sales were within 10% of an expected value of $5,000, you could use the following query.
Data-centric systems often contain multiple data stores and applications. There can be numerous separate processes that move or transform the data. So how do you go about verifying that it’s happening correctly? At what points should you apply the three
One approach is step-by-step validation. In this case, you look at the major components of your data infrastructure. Within the infrastructure, you identify each transition point for your data—where does it move from one system to another, or where is it transformed? You would then implement testing for each of these points.
Testing for reconciled data works very well in the step-by-step approach and it's highly recommended that you do reconciliation verification at each transition point. This helps you quickly identify spots where some of the data might have been missed or not transformed correctly.
Checking for related and reasonable data is usually not as critical for each step, although you might want to implement these types of verification testing at specific transition points. Step-by-step validation is particularly useful for identifying where in the overall data infrastructure a problem was introduced.
For an idea of how you might apply data verification to a real-life system, consider the following example. You are working on a data infrastructure that contains the following systems:
An OLTP order processing system
An OLTP customer relationship management (CRM) system
A star-schema data warehouse (DW), which includes a staging area (Stage)
An OLAP analysis repository
A series of reports and dashboards that report on information from the data warehouse and OLAP repository
Much like Order Processing to Stage, this transition is focused on the movement of data, not transformation. The validation logic is very similar, but you would focus on customer counts, total amounts for quotes, etc.
Stage to DW is primarily a transformation step. As part of this transition, the data is restructured to fit the data warehouse data, existing values are summarized, and new values are created. Data that is determined to be invalid would also be excluded at this step.
You would implement all of the three
Relationship verification would be applied to validate that categorization of the data hadn’t been impacted as facts were related to dimensions. Finally, reasonableness verification would compare the loaded values to ensure that they matched the historical trends and that there weren’t unusual data scenarios encountered during the data transformation processing.
Validation of this step involves a good understanding of the business requirements. To implement the validation, you need to understand the data scenarios that might be encountered, as well as how they are expected to be handled. The validation would still focus on the key metrics identified in the previous steps (total orders, sales amounts, customer counts), but the queries might be more complex, as the data has changed during the transition.
This transition point enables the visualization of the data for end users. Normally, the data isn’t heavily transformed in this step, but it's often aggregated and summarized differently for reporting purposes. Many times there are additional calculations implemented in the reporting tool.
As you implement verification for this transition point, your primary concern should be reconciling that the totals displayed on the reports match the totals from the data warehouse. Relationships are not typically modified in the reporting step, and the reasonableness of the data should be verified in a previous step.
If the reports are simple and straightforward, with direct pass-through to the underlying storage, verification tests might not be necessary at this step. However, new calculations and transformations for display purposes are frequently implemented in the reporting layer. In these cases, reconciliation verification is important to ensure that the reporting layer isn’t presenting an invalid display of the source data.
Endpoints are the places in your data-centric system where data is introduced from outside, or where the data
Order processing is an originating endpoint. Data flows into this system from users or other systems interacting with the order processing system. Endpoint verification at this point would consist of reconciliation steps as well as reasonableness verification.
For reconciliation, your primary goal for this endpoint is to gather the totals, counts and balances that will be used for reconciliation with other endpoints. These values should be stored so that you have a history of the values. The reasonableness verification should be based on these historical values. For example, you might capture the total order amounts and item counts for reconciliation. Your reasonableness tests would compare those totals to the historical trends, and if the variance was greater than 15 %, an alert would be triggered.
As mentioned earlier, doing complete data verification can involve interacting with a large number of systems and running a large number of queries against them. Whether you choose to build your own or use a commercial solution like SentryOne Test, there are some key features you should look for:
Automated execution of queries: As mentioned, there can be a large number of queries to execute, and if you want to test regularly (which is certainly recommended), manual execution will get very time-consuming, very quickly.
Support for comparison of results across disparate systems: Your test framework should handle the need to compare data across different systems that might treat the same nominal data type differently. For example, SQL Server, Oracle, and JSON all use slightly different representations for date values. Your test framework should be able to account for this and still make accurate comparisons.
Alerting: When tests fail, you will want someone to be notified. Make sure your test framework supports this.
Flexible query result checking: Once you’ve run the query, you need a way to verify that it returned the expected data. This should be part of the automated execution so that you get a pass/fail message for each test case. Bear in mind that some queries might return single values and some may return tables of information. Your test framework should handle either.