T-SQL UNION Operator: A How-To Guide
Paul S. Randal
Published On: March 22, 2022
Categories: T SQL, Best Practices, SQL Server 0
What exactly is a UNION operator? A UNION operation is defined as concatenating the results of two or more queries into a single result set.
I remember a client struggling to bring data from several tables into a single result set using a complex OR query. I asked whether there was any overlap in the data in the various tables, and he said the most perfect five words, "Oh, no, they're mutually exclusive."
You may be wondering why I described his response as "perfect." Well, having mutually exclusive data meant there was no possibility of duplicates for him to be concerned about, and a simple UNION ALL with a straightforward and efficient query plan would work.
However, UNION ALL may not be the best option in some cases, which is why it's crucial to understand your data and how query operators work under the covers, so your query has the most optimal execution plan. In this post, I discuss the UNION and UNION ALL set operators in detail, including use cases for each and common performance considerations.
What Is a UNION Query? General Syntax Example
Here's a simple T-SQL UNION example:
There are a couple of restrictions for using a UNION:
- All queries in the UNION must return the same number of columns.
- The columns for each result set must have compatible data types. For example, combining an int column and a bigint column is fine, but attempting to combine a uniqueidentifier column with a decimal column would result in an error. Also, the order you define the column in your SELECT statement matters as the results are concatenated ordinally.
The queries in the above UNION statement are simple SELECT statements to show the overall syntax. However, the queries being concatenated together can be quite complex, including things like subqueries, aggregates, GROUP BY, and HAVING clauses. ORDER BY is a special case for the UNION clause.
UNION vs. Union ALL
When combining results using UNION, it’s important to understand how duplicate values should be treated. A UNION will remove any duplicate values returned from concatenating various result sets. The following query returns a unique list of numbers, from 1 to 10. While the values 1 and 7 are included in both result sets, the duplicates are filtered out before the data is returned, as shown below:
If duplicate rows aren’t a concern, or you know there are no duplicates, consider using the T-SQL UNION ALL instead of UNION, as UNION ALL doesn’t attempt to filter out any duplicate rows. The following query is the same as the previous UNION query but uses UNION ALL instead of UNION:
You can see the results are returned in the order in which they’re defined in the row constructor, and duplicates aren’t removed. The reason I mention the order in which the rows are returned here compared to the UNION example is a UNION must perform a sort to remove duplicate rows. Result sets from a UNION statement are often ordered when returned, but you cannot depend on this. The only way to guarantee a result set is ordered in a specific manner is to include the ORDER BY clause as part of the statement.
The Performance Expense of Sorting Data With UNION vs. UNION ALL
One of the biggest performance problems I've seen with using the UNION clause is when an expensive sort operation must be performed. So, as a rule, only use a UNION if you need to remove duplicate values or use a UNION ALL.
When a sort operation occurs, a request is made for some memory to perform the operation, known as a memory grant. Memory grants have an initial requirement for how much memory is required to begin the sort operation, and there is additional memory needed to perform the operation entirely in memory. Performing a sort in memory is much faster than spilling the data to tempdb and performing the sort operation there. However, sometimes cardinality estimations are incorrect, and SQL Server guesses it can fit all rows in memory, but it cannot, leading to an expensive sort spill to tempdb.
The following is a warning you might see in an execution plan for a sort operation if it has to spill to tempdb:
Using ORDER BY With a UNION Operation
By definition, any result set is unordered, but there’s often a need to order the results from a UNION. To order results from a UNION, use an ORDER BY clause placed after the last statement of the UNION. Consider the following query:
The ORDER BY clause is always evaluated last in a SQL statement. In the above example, the two queries are combined, and then ordering occurs. However, it’s important to understand the column name/alias you must reference in the ORDER BY clause.
The column names returned in a UNION operation are always based on the result set of the first SELECT statement, so you must define column aliases in the first SELECT statement. You must also refer to these column names if you use an ORDER BY clause. Here’s an example of this:
What Is the Difference Between UNION and JOIN?
The UNION clause combines the results of two or more queries. The data from the queries don’t need to be related as long as each query has the same number of columns the data types are compatible
A JOIN operation combines rows from different tables where there’s a relationship between those tables. Depending on the type of JOIN, the rows returned may be the same between the tables being joined (e.g. INNER JOIN), or one table may be preserved. If there are matching rows in other tables, they’ll be included (e.g. OUTER JOIN).
UNION vs. OR
One practical application of the UNION operator is to combine similar queries from the same set of tables together to form a new result set. It's been my experience to see application developers use the pattern of having similar queries with equality comparisons in the predicates because it's easier to write or generate through a tool and seems more efficient than one single query including an OR comparison in the predicate. A simple example of this would be as follows, with the associated query plan:
You'll notice in the plan the SalesOrderDetail table is referenced twice; one scan for the lookup of the ProductID value 712 and another scan for the ProductID value of 870. To UNION these, SQL Server must run the same scan operation twice. Using the output of SET STATISTICS IO ON, you can see 2492 pages from the buffer pool were accessed to return the results:
A more efficient way to refactor the UNION statement is to simply use the IN clause (a shorthand way of writing an OR clause), so the work of filtering out the ProductID values occurs in a single scan of the table, as shown below:
You can see from the execution plan the table is now only scanned once, and the number of pages read from the buffer pool is cut in half:
It's worth noting there’s a common misconception about using the OR operator in a WHERE clause. Using OR doesn’t always require scanning all the data in a table to return the rows being searched. Factors such as column cardinality, the number of values being searched, and the columns returned in the SELECT list are considered. If there were a covering nonclustered index on the ProductID column (which included SalesOrderID and UnitPrice), you could use a single seek operation instead of scanning the table.
T-SQL Recursive Common Table Expressions
A Common Table Expression (CTE) is a type of temporary named result set to allow you to define a query once and then refer to the query by name in subsequent statements in the same batch. CTEs include the ability to do recursive queries, which is a way for a query to refer to itself until a base case is met. The structure of a recursive CTE is defined by an anchor query and a recursive query. The recursive portion of the query references the name of the CTE and will recurse until its base condition is met.
The critical thing to take away here is a recursive CTE requires a UNION ALL statement to combine the results from the anchor query and all subsequent recursive queries into a final result set.
CTEs are great for hierarchical queries, such as a company organization structure. Here's an example:
UNION ALL in Partitioned Views
While the SQL Server partitioning function can be great from a data manageability perspective, it’s sometimes advantageous to design your own partitioning strategy for tables. Designing a partition gives you more flexibility for data movement and allows you to index each table as needed. You can also take advantage of the statistics for individual tables (vs. a single statistic object for a partitioned table in SQL Server).
Consider the following table definition:
CREATE TABLE SalesQ12021
The important concept here is the constraint on the SaleDate column – this feature will ensure only records for a given quarter of 2021 can be inserted into the table. The real power of this design comes when there are multiple tables for certain date ranges, each with its own date constraints.
Consider the following view using the UNION ALL operator to produce a dataset combining the data from these four tables together:
Notice only the SalesQ42021 table is being referenced. Since each table has constraints in place, the query optimizer knows the query only needs to access a single table to execute this query.
Best Practices for Using UNION and UNION ALL
The UNION and UNION ALL T-SQL query operators are useful for concatenating result sets from two or more SELECT queries. There’s a lot of flexibility with these statements, but you must ensure each SELECT statement returns the same number and compatibility of the columns.
Like in my "perfect" example I opened the post with, try to use the UNION ALL statement when possible, as there can be much overhead involved in sorting the data to return a distinct list for UNION. Knowing your data and operator options can help you achieve the most efficient query execution plan.
Maintaining and improving query performance can also be crucial to the overall health of your SQL Server environment. With SolarWinds® SQL Sentry, you can use the integrated Environment Health Overview (EHO) dashboard for an at-a-glance view designed to allow you to prioritize troubleshooting events causing the most pressing performance issues.
Paul S. Randal is the CEO of SQLskills.com, which he runs with his wife Kimberly L. Tripp. Both Paul and Kimberly are widely-known and respected experts in the SQL Server world, and both are long-time SQL Server MVPs. Paul was a Contributing Editor for TechNet Magazine, where he wrote the bi-monthly SQL Q&A column and feature articles. He also had #1 top-rated workshops and sessions at the PASS Summit and TechEd. Paul is active in the SQL Server community, from user groups to online forums to helping out on Twitter (@PaulRandal – check out the #sqlhelp tag). His popular and widely-referenced blog can be found at https://www.sqlskills.com/blogs/paul/ and he can be reached at email@example.com.