Data is critical to organizations today. Businesses depend on accurate data to determine whether they are meeting objectives, make decisions about new products and offerings, and evaluate the success of current initiatives. Governments use data to determine which programs are successful and which are not. And non-profits use data to evaluate the impact they're making.
There are countless examples of data being used to support critical processes today. However, most of the energy and effort in testing in IT focuses on testing the application functionality that creates or uses the data, not on verifying the end result—the data itself. Often, data-centric processes—such as data integration, extract, transform, and load (ETL) processes, and analytic
Why has there been a historical focus on application testing? For many years, most organizations have concentrated on testing application logic because that’s where the interest was. People were focused on developing new and better applications. They wanted to be able to develop these applications quickly, iterate on them rapidly, and build new ones when the business drivers changed.
This emphasis on applications has required flexible and powerful testing frameworks. After all, it's very difficult to make rapid changes to an application without having a solid set of test cases that can validate that the changes you just made are actually working.
Testing the data in data-centric applications is an often overlooked part of the project.
It was often thought that the application would be the only thing working with the data, so if the application was “correct,” then the data must be correct as well. In practical terms, though, most data today is used and manipulated by multiple systems. Now you have to verify all the applications that might have access to the data, ensure that these applications interact with the data correctly, and confirm that there are no issues with cross-interactions. The problem is even more complex in today’s self-service world because new applications that use your data can be added at any time, often without you being aware.
Another reason that data-centric testing hasn’t been a focus is the perception that testing application logic is “easy,” while testing data is “hard.”
Developers in many cases don’t like testing data, because it involves outside dependencies, above and beyond their code. Many testing approaches advocate isolating the code under test—for most
However, this approach can be a drawback when testing data-centric applications, because the tests often verify only the application logic. They don’t validate how the application works with real data.
Related content: Testing Data-Centric Code in Development | Verifying Data in Production
Organizations are realizing that the real value is in the data they collect and manage—the applications that work with the data are subject to constant change and replacement. In many cases, the data produced from the applications is more valuable than the application itself. So, while we continue to need to test application logic, we also need to test data. This is particularly true in the following cases:
The data is business-critical or a differentiator for the organization
The data is interacted with from multiple applications or systems
The data is part of a data-centric application or workflow (for example, data integration between systems, ETL, or a data warehouse)
This document presents a methodology for testing data-centric applications and data. Not every piece of the methodology needs to be adopted to realize benefits from it. Any improvement to the testing has tangible results in reducing the number of defects in your data, as well as providing a reason for the developers and consumers of a system to feel confident in the results that it provides.
There are two main areas that this methodology covers—doing data-centric testing during development, and doing data verification for production or during system testing. Many of the same testing techniques can be used in both areas. However, the focus is a little different. Data-centric testing in development focuses on the testing necessary to make sure your data-centric applications produce the correct results. Data verification testing is focused on making sure that the systems that interact with the data produce consistent, verifiable results every day (or even more frequently).
The major benefit of testing your data and data-centric applications is confidence in your data. One of the more common reasons for business intelligence initiatives to fail is that the users lack confidence in the results. By testing and verifying both the processes and the data that you are using, you can give the consumers of the data the confidence they need to make business decisions.
Another benefit arises if your organization makes use of self-service BI. According to Gartner, less than 10% of self-service BI initiatives will be monitored for consistency. That can create major issues for both the accuracy of the reporting, and adherence to regulatory requirements.
According to Gartner, less than 10% of self-service BI initiatives will be monitored for consistency.
Testing data-centric applications also leads to overall cost improvements. The earlier in the development cycle that defects are discovered, the easier and less costly it is to correct them. By incorporating robust testing into the development process, the maintenance and update costs can be greatly reduced. True, it does require a little more time upfront to create the tests, but the payoff is significant.
One of the biggest challenges with testing data-centric applications is that you are interacting with data. To test it well, you need a set of data that addresses the test scenarios. Depending on the goal of the test, you might need a small, static set of data that represents some specific expected data details, or you might need a much larger set of test data that represents your production data.
Managing these data sets can be challenging, as the creation of good test data can be time-consuming. Simply taking a copy of the production data for testing purposes is not an option for many organizations because of privacy concerns and regulations.
Related to managing the test data is the problem of keeping the data and the tests synchronized. As the database schema are updated with new columns, tables, etc., the test data sets and the tests themselves need to be updated to reflect the current state.
Another major challenge with data-centric testing is that the tools haven’t progressed at the same rate as the application tools. It’s difficult to automate data testing, and even with tools that support it, you might find yourself pulling various technologies together with duct tape in order to assemble a working solution.
Another challenge is the time it takes to create the tests. Often, testing is the first area to suffer when projects fall behind, and it can be easy to think that taking time from testing to complete other parts of a project will be okay. However, this often creates a downward spiral: The parts of the project that aren’t being tested create larger numbers of defects and rework, which can take more time away from testing, which just repeats the cycle. In addition, data testing, in particular, is time-consuming—managing the test data, as mentioned above, can require a lot of effort.
As mentioned in the previous section, the tools available for data-centric testing are, for the most part, lacking in several noticeable ways.
One, most tools are targeted to a particular tool or
However, you have to use a different tool, and learn a different skill set, in order to test SQL Server Reporting Services reports. This lack of technology coverage adds to the complexity of producing a full testing solution.
To some degree, you can work around this by pulling multiple tools together, and scripting the interactions between them. However, not all tools support the automation necessary for that approach, and it doesn’t reduce the need to have and maintain multiple tools and the skill set necessary to use them.
Any of the “x” Unit frameworks can make a good foundation for performing data-centric testing. However, you will need to spend some time developing an additional layer of functionality to make interacting with the database and other data-focused applications easier. In addition, this layer will ensure consistency in how the testing is performed.
You should also consider the people who will be developing and executing the tests when selecting tools. If your team is familiar with .NET or general programming languages, there are a broader array of choices. On the other hand, if your team doesn't spend a lot of time using .NET, then you will want to use tools that provide a friendly interface for the creation of the tests.
As you are looking for tools to drive your data-centric testing initiatives, keep the following criteria in mind:
The methodology discussed here can be implemented using a variety of tools. However, you will find that some tools are better suited to it than others. The samples shown in the related article Testing Data-Centric Code in Development uses SentryOne Test, which is a tool developed with the methodology in mind, so it fits very well. But the approaches discussed here can be implemented with other tools and a bit of ingenuity—they might just require more work to set up and use.
There can be a wide variety of people involved in testing, but data-centric testing process focuses on a few key roles. These roles don't have to be carried out by different people, but each role has a specific focus in the testing process.
Involve these roles in your testing strategy:
In some organizations, it’s felt that developers shouldn’t be involved in the testing process. Instead, they should just focus on producing code and let the Quality Assurance (QA) group handle testing. This is a good way to produce lots of code that nobody has tested. Developers are integral to the testing process because they are the only ones that know what code they have written. At a minimum, they need to work with the testers to ensure that everyone has a clear understanding of the requirements and the implementation so that the tests can accurately exercise the system.
In many organizations, particularly those adopting test-driven development (discussed further in the next article), there is a trend towards developers actually creating their own tests. An additional benefit you may find is that when creating automated tests, developers are often the best equipped to do that well. In data-centric testing, it is often necessary to have a developer who really understands the data participate in the test creation, or at a minimum, educate the testing team on working with the data. If you are really focused on improving your data-centric testing, you are likely to have at least a portion of your developer’s time spent on testing.
Developers would still be primarily involved in development testing for functionality, at the unit and system testing level. These will be defined in a later section of this series of articles. Data verification is typically not in their area of responsibility.
This is a more specialized role in organizations that focus on having extremely thorough automated test coverage. These are testers who are focused on testing and
Quality Assurance encapsulates traditional testing in many organizations. Often the people in QA focus primarily on “black box” testing – that is, they don’t know the internals of the system, but rather what goes in and what should come out for the application. Particularly when it comes to data-centric applications, they make
Adopting a testing approach for data-centric applications tends to change this role more significant than the other roles. The focus for your QA resources becomes a) understanding the data requirements of the application, b) developing automated test scripts for that data, and c) testing the bigger interactions of the data-centric application or system under test.
The QA role is usually responsible for testing the system functionality at a macro level, rather than smaller units of code. They should be involved in testing at the system level, as well as performance and load testing. In addition, the QA role is heavily involved in data verification testing, which will be defined in a later section in this series.
Data-centric testing is a critical factor in today’s data-driven world. The quality, accuracy, and reliability of the data your organization works from is not something that can be left up to
For more details about applying the principles of data-centric testing, check out the related article Testing Data-Centric Code in Development. This article focuses on how you can adopt data-centric testing as part of your development processes, along with the various types of testing that you can consider as part of developing new and enhanced functionality and data. You'll also see how to apply data verification testing to data throughout your organization, which can increase your confidence in the data you work with every day.