Testing data and data-centric applications is a vital step for organizations that are using their data to drive business decisions. This article explains what data-centric testing is, and provides an overview of a methodology that can be used to implement data-centric testing in your organization.
Data-centric testing requires flexible and powerful testing frameworks. After all, it's very difficult to make rapid changes to an application without having a solid set of test cases that can validate that the changes you just made are actually working.
Related content: Testing Data and Data-Centric Applications | Verifying Data in Production
You can use several different types of testing with data-centric applications during development. For the most part, these testing types line up with traditional application testing approaches, but there are some differences to accommodate the data focus.
When you think of code, you might picture a monitor full of C++, C#, Java, or another programming language. However,
Code can be any set of instructions to the computer or an application that produces an output.
When
So what types of computer instructions can this include? It certainly includes traditional application code, but it also includes database code, and instructions to the computer for other, specialized applications, like data integration tools and business analysis and reporting tools. When you create an SSIS
The Data Definition Language (DDL) that you use with databases is also a set of instructions to the database engine. The DDL details the tables, views, foreign keys, and other objects that should be created, altered, or deleted in the database.
SQL Server Analysis Services (SSAS) uses a similar language, XML for Analysis (XMLA), which supports creating, modifying, and deleting objects on an Analysis Server. In addition, Multi-Dimensional Expression language (MDX) and Data Analysis Expressions (DAX) allow for querying and shaping results from Analysis Services.
These examples illustrate that most of the components that we create for data-centric applications would be considered code. And since they are code, they should be tested. The rest of this article will discuss how to do this.
Several types of testing are conducted on traditional applications. Most of these are directly applicable to testing data-centric code as well, though you might find it necessary to tweak the approaches a bit to make them work well.
Unit tests focus on small units of work (logical groupings of code) in the system under test, and they check assumptions about the behavior of that code. Generally, unit tests are implemented by the programmer during the code development process. These are tests that you would run against the code you have just completed to ensure it works as expected. Once completed, you would keep the unit tests to form the backbone for regression testing and to act as a verifiable check on whether the code performs as expected.
Automated unit testing is standard practice in application development. In application development unit testing, efforts are made to isolate the unit of work being tested from any outside dependencies, including the database or the file system. This is challenging for data-centric applications, and you'll find that it can create additional work to abstract away the external systems the code interacts with. In some cases, the tools you use for data-centric applications don’t support this level of isolation. SSIS, for example, is very difficult to unit test in a fully isolated manner, as some components require a connection to a database in order to function.
Rather than getting too wrapped up in debates about whether data-centric testing truly meets the definition of a unit test, we prefer to take a practical approach and work with what we have. If you like, you can refer to unit tests that interact with outside dependencies as micro-integration tests.
When creating unit tests, you should control the inputs to the code. The tests should verify that the output of executing the code delivers the results you expected. In some cases, the same unit tests might be driven through a variety of inputs, so that the same unit of code can be tested with many different inputs. This verifies that the code produces the correct results for all the tested inputs.
These types of tests are often referred to as data-driven unit tests.
Unit tests should also be isolated from each other. You should create unit tests so that any unit test is atomic and can be run independently of other unit tests. This isolation means that the tests can be run in smaller subsets easily, even down to a single unit test, and that you don't have to run them in any particular order. This approach does require that each test sets up the appropriate preconditions for the test, creating any necessary data prior to the test execution and cleaning it up afterward.
Unit tests should ideally be fast to execute. The longer the unit tests take to run, the less likely you are to execute them. Since much of the benefit of unit tests comes from running them frequently, you should ensure that your tests run as fast as possible. You can accomplish this by making sure your unit tests are done against small sections of code and that they do not cover too much of your application at once. If the tests are created properly, you can also run them singly or as a smaller subset to get faster feedback on the section of code you are testing. Another key point is that data-centric unit tests should focus on small sets of data. The point is to exercise the functionality of the code unit, not to performance-test it.
Often, the reason that developers object to including external dependencies in their unit tests relates to performance. Typically, external resources like databases or file systems are orders of magnitude slower than the same operation carried out in memory.
However, you can work around this in many cases by following the guidance in the preceding paragraph. In addition, running subsets of the unit tests when you are testing interactively, and the full test suite when doing a full integration, can lead to a better experience with resource-constrained tests.
Integration tests generally span multiple units of work and verify that larger portions of the system work together correctly. This may involve interacting with multiple subsystems, for example, verifying that a report can correctly retrieve information from a database, perform calculations on the result, and then display that to the user. Integration testing further ensures that code or modules developed by one developer works properly with code from another developer and that it doesn’t have an unintended impact on other parts of the system.
As noted above, often testing of data-centric applications falls into the category of integration testing, as it can be difficult to isolate the application being tested from the underlying data.
In many cases, isolating the application from the data can actually hamper the effectiveness of the tests, as the data is central to the requirements for the application. Rather than getting too concerned about what type of testing is being performed, at SentryOne we prefer to take a pragmatic approach and focus on creating the tests that best verify the system under test.
Integration tests generally take a
Creating integration tests for data-centric applications is much like creating unit tests, in that you generally have to set things up for the test, invoke the part of the system under test, and then assert that the new state of the system matches an expected result. However, this type of test focuses on larger sets of functionality. To create an integration test, you would define a usage scenario for the application, the expected end state, and test data that supports the scenario.
Just as can be done with unit tests, integration testing can be automated. In many cases, the same framework or harness that is used for unit testing can also be leveraged for integration testing, as the general structure of the tests is very similar. The primary difference is in the granularity of what is being tested, and how hard you try to isolate the code being tested from other systems.
Using a framework also enables you to assemble integration tests into suites that can be run together, and the ability to include your integration tests as part of the build process. You will find that using an automation approach to your integration tests provides an immense amount of value, and is required to take advantage of integration tests for regression testing.
Test Driven Development (TDD) is a practice in which unit tests and code are written in conjunction with each other. As a developer, you would write small, incremental tests, then write the code to satisfy those tests and ensure they pass.
You start by creating a test that implements a specific test case. This test will fail initially, so you write the code necessary to make the code pass. Then you refactor the code until it is clean and elegant while maintaining the passing status of the test. You would then repeat the process for the next set of functionality until the code delivers the expected results.
This approach has a number of benefits. One, since you're creating tests in conjunction with the code you're writing, test coverage of the code is much higher. Two, it keeps your efforts focused on implementing the code that meets the requirements. Three, one of the most important benefits it offers is increased confidence as a developer.
When you develop using a TDD approach, you always know where your code stands. Because you are working in small increments, you are never very far away from a system that passes all the tests.
If all tests are passing, then you know all implemented code is working as designed. If you make modifications, you will get immediate feedback on whether the change has impacted other functionality in the system. This makes it much easier to make updates and refactor code.
When you develop using a TDD approach, you always know where your code stands.
System testing tests the system as a whole. It generally focuses on validating that the system meets the overall requirements for the solution, and often includes
For data-centric applications, system testing might need to take on some additional steps to truly validate the system. For example, it becomes much more important to validate the underlying data in the system when dealing with data-centric applications.
Since individually reviewing each row of a table in a database isn’t practical, you'll need to apply tools to this problem. Good tools are capable of comparing expected data with the actual data and ideally will have the capability to do this against either a comparable, known good
Control totals are things like a customer count, the total amount of sales for the month of December, or some other aggregated value that gives you confidence that if the aggregate matches, the underlying details are likely to match as well.
Take note! Be careful with tools that only allow you to do a row-by-row, column-by-column comparison. Often, when dealing with changes to data-centric applications, updates to the system require modifications to the data structures. When that happens, it can break the functionality of many data comparison tools.
Rather, you should look for tools that support a tabular comparison and give you the ability to compare aggregated values.
Regression testing is testing done to validate that new changes to a system have not adversely affected existing functionality. In basic terms, this is something that most of us have seen when we have fixed one problem, only to see something that we thought was unrelated suddenly stop working in another part of the system.
Regression testing is all about finding unintended consequences. It also is used to ensure that corrected issues do not resurface in later versions of the system by continuing to validate those fixes for subsequent versions.
Regression testing is a problem spot for many organizations because it doesn’t involve testing what has changed, it involves testing everything that has not been changed.
People are not very good at anticipating the side effects of their changes. In addition, in cases where testing is done manually, it can be easy for people to not test as thoroughly for areas of the system where they don’t expect to find issues.
By creating automated unit and integration tests, you will get regression testing without having to do any additional work. You can establish a baseline of functionality testing that can be easily re-executed as needed.
So for subsequent changes, you can continue executing the same tests that you have already created, verifying that nothing unexpected has changed in the system.
This approach does mean that you will need to make sure any new changes to the system are also covered by automated tests, particularly any defects that are corrected. Once you have a test that validates that a particular defect is fixed, you can have confidence that if it shows up again, you will catch it during testing, rather than in production.
The use of automated tests for regression tests is incredibly valuable, particularly if the system you are working on experiences a lot of change.Using automated tests also means that your investment gets more valuable every time you make a change—just look at all the time you are saving over having to manually re-execute tests, or the costs of having a regression in functionality make it through testing unnoticed.
Load and performance testing is testing to determine whether the system handles operations at the expected volume of the production system in an acceptable time frame. Load and performance testing generally assumes that the functionality is correct, and focuses primarily on time frames and volume.
The practical data-centric testing described in this article doesn’t focus specifically on load and performance testing, as existing approaches for this type of testing work well for data-centric applications. However, it can be helpful to use a test framework that allows you to easily time operations that are being performed.
One item to note is that load/performance tests should not be combined with tests that verify functionality.
Developer-focused tests need to run quickly so that the developer doesn’t spend time waiting on them to complete. Load tests in particular, and most performance tests, require a large volume of operations, so the tests tend to take more time.
Long-running load tests don't mesh well with quick functionality tests. You'll find it much easier to manage if you keep a clear separation between these types of testing.
This sections gives details about what types of functionality you should consider testing for data-centric applications. It also provides information on the how-to of actually automating test cases that verify this functionality.
However, we can't document all the possibilities around that for this article, so I'll keep the discussion at a fairly generic, pseudo-code level. We will use the common Arrange, Act, Assert pattern for the pseudo-code. When looking at a test framework, it should be capable of handling the requirements of the scenarios below.
SentryOne Test was developed with this practical, data-centric testing approach in mind, so it enables the below scenarios. Other testing frameworks can be used as well, though some of them might require additional work.
When working with SQL databases, you will want to test the structure, the data, and the various ways that the data can be manipulated. This primarily includes the tables and views in the database.
From a testing standpoint, you want to verify that the object exists and that it contains the correct columns with the correct data types. You might also want to verify that calculated columns and check constraints are set up properly, as shown on the right.
You will also want to test stored procedures, triggers, and functions that are used in the database. Verifying that these are properly implemented requires that you set up the necessary prerequisites and inputs, and that you verify the outcome of executing the routine.
An example of testing a stored procedure is shown in the figure at left, "Test for a stored procedure." The requirement for the stored procedure is that it evaluates the customer records looking for possible duplicates based on names and addresses.
It requires that there are customer records to evaluate, that a threshold for similarity is provided as input, and that a list of customers that exceed the threshold be provided as output.
SSIS can be among the most heavily used components of the BI stack, and thus requires rigorous testing. An area for special focus is the Data Flow Task, which handles the bulk of the work in most packages. Two types of tests for the data flow are fairly standard: validating the number of rows loaded and comparing the loaded data against the source.
Further, your requirements might also dictate performance levels that a package must achieve. These also should be validated to ensure, as the example at right, "Example SSIS Package Test," illustrates, that the package executed within a predefined run time. As noted earlier in the performance and load testing section, these should be split into separate tests if they are performing load or volume testing—they can be isolated as necessary.
Additionally, the Execute SQL Task often needs validation as it has the ability to modify data or to return data that can have an impact on the other tasks in the package.
The figure at right, "Validation of an ExecuteSQL Task Which Alters Data," covers a test for an Execute SQL Task, which runs a query and returns a value that is placed in a variable.
In addition to Execute SQL Tasks which return data, many also perform commands against the target system. These can be tested by querying the target database.
Figure 6 - Validation of an Execute SQL Task Which Alters Data is an example of a test that verifies an Execute SQL Task truncates a target table.
SQL Server Analysis Services (SSAS) has three main areas that you will want to test: dimensions, measures, and calculated members. With dimensions, there are a few different aspects to consider. First, you should ensure that expected members exist in the database. Next, you should validate the count of those members. Finally, you should test the accuracy of any calculated members of the dimension. The figure at left, "Test for SSAS Dimension," walks through an example scenario of testing a product dimension.
The second area of focus for SQL Server Analysis Services (SSAS) testing revolves around measures. These tests tend to be straightforward, comparing the aggregated values from the cube against similar aggregations from the source systems. The figure "Test SSAS Measures" at right illustrates this scenario.
Finally, it's important that you test the accuracy of the cube’s calculated measures. This could involve calculating the expected value by hand, then hard coding it into the test.
Alternatively, you might want to recreate the calculation in a query against source data. The figure at left, "Test SSAS Calculated Measures," represents the basic workflow for validating calculations.
Testing related to Reporting Services falls into two areas: 1) validation of values and 2) ensuring reports executed successfully in a predefined time period. As part of the test, you'll need to execute the report.
After executing the report, you would retrieve a value from the report, and then compare that to a known value, whether it is manually set or calculated from the source data. Most commonly, the grand totals of the report are used for this purpose.
Retrieving values from SSRS reports can be complicated. You'll often find it easiest to do this by exporting the report to an XML format. Once you have it in the XML format, you can use XPath queries to locate specific values in the report. This table summarizes these steps for a sales report. Many reports have performance requirements associated with them.
You can validate that the report runs in the expected time frame by capturing the runtime for the report, and comparing that to an expected value. Again, as noted in the performance and load testing section, these tests should be split into separate tests if they are performing load or volume testing, so they can be isolated as necessary.
One question that often comes up when testing data-centric applications
It's recommended to have separate test data for each data store/input source you deal with.
Each of these data stores is a potential area where you will need to create or load test data. In addition, some data integration processes deal with text files or other non-relational sources—you will also need test data that represents these inputs. If your data-centric application is a data mart or warehouse, you might be wondering whether you need test data for the warehouse itself—after all, you can just load the data from the source system as needed.
While this approach is an option, it’s recommended that you have separate test data for each data store/input source you deal with. The reason for this is that it's difficult to create effective unit tests if you rely on processes unrelated to the unit of code under test to set up the test data. It’s very common that these upstream processes can be modified in a way that breaks downstream unit tests, so keeping them isolated is a better approach.
Test data generally falls into two categories: 1) specific data sets to validate known scenarios and 2) large volumes of data that represent a broad sample of the types of data the solution might possibly encounter. The specific, well-known data sets are most often used for unit and integration testing, and generally represent a small number of rows that are designed to exercise any conditional paths in the solution.
These data sets cannot be randomly generated, as they need to provide specific values to make sure the test conditions are met. By the same token, extracting this data from production data generally doesn’t result in a data set that exercises all the conditional paths. Often, these data sets have to be created by hand. It’s very useful to leverage a test framework that has support for creating and managing these targeted data sets.
Large-volume data sets are used for load and stress testing. They are also useful for sanity checks on the solution—verifying that out-of-range data is handled or that a sample of production data can be processed successfully. Random data generators and extracting data sets from production are valuable approaches to creating these types of data sets.
If you do extract production data for test purposes, there will often be a requirement to mask or strip certain pieces of information. In these cases, it can be helpful to combine random data generation with the extract process, to fill in any gaps in the extracted data.
Regardless of the type of test data, look for test frameworks that include or can be modified to include the management of the test data sets. You will find creating tests much easier if you have simple ways to load test data, reset the data sources, and persist the test data outside of the database itself.
Continuous Integration/Continuous Delivery (CI/CD) is a development practice in which any changes checked into the team’s source control system are immediately compiled, analyzed, and tested so that the developers get immediate feedback on the state of the solution. Continuous delivery builds on this with a set of processes that enables teams to build solutions in short iterations, while keeping it in a state that it can be released at any time.
This approach can lead to a much faster time to value and enables users to see and respond to changes in the solution much more quickly. CI/CD is quite popular in traditional application environments, but there are some challenges in implementing it for data-centric applications.
One of the challenges is that in order to ensure the solution is in a releasable state at all times, you need to be able to test any changes quickly and efficiently. If extensive manual testing is required, this process becomes very difficult. This is the reason that many people feel that automated testing of a significant portion of the solution is mandatory to truly implement continuous delivery.
Continuous Integration/Continuous Delivery leads to much faster time to value.
Once you have an automated suite of tests, it becomes much easier to have your data-centric applications participate in CI/CD processes. Since the tests are automated, they can be added to integration and build processes.
Most automated testing frameworks support this capability, at a minimum by using a command-line tool. Some feature the option for directly integrating the test execution and result evaluation into the build tools.
You don’t have to do CI/CD to use the data-centric testing approach. However, you'll find it much easier to implement CI/CD if you follow the approach of automating the tests for your data-centric applications. CI/CD can add significant flexibility to your team’s ability to deliver useful business results in a timely manner and should be something that you evaluate when looking at improvements in your delivery process.
In this article, we covered some important concepts on what sort of things you need to test during development of your data-centric applications and what types of testing you should consider. There are also several examples of test cases that you can look at implementing in your automated testing framework of choice.
The next article in this series, "Verifying Data in Production," discusses the importance of testing your data and data-centric applications even after development has finished and the solution is in production. I'll also cover several techniques that you'll find useful for this type of testing.