T-SQL UNION Operator: A How-To Guide
What exactly is a UNION
operator? A UNION
operation is defined as concatenating the results of two or more queries into a single result set.
I remember a client struggling to bring data from several tables into a single result set using a complex OR
query. I asked whether there was any overlap in the data in the various tables, and he said the most perfect five words, "Oh, no, they're mutually exclusive."
You may be wondering why I described his response as "perfect." Well, having mutually exclusive data meant there was no possibility of duplicates for him to be concerned about, and a simple UNION ALL
with a straightforward and efficient query plan would work.
However, UNION ALL
may not be the best option in some cases, which is why it's crucial to understand your data and how query operators work under the covers, so your query has the most optimal execution plan. In this post, I discuss the UNION
and UNION ALL
set operators in detail, including use cases for each and common performance considerations.
What Is a UNION
Query? General Syntax Example
Here's a simple T-SQL UNION
example:
SELECT FirstName, LastName, AddressLine1, City, State, Zip FROM Customers UNION SELECT FName, LName, Address, CityName, StateName, ZipCode FROM Employees;
There are a couple of restrictions for using a UNION
:
- All queries in the
UNION
must return the same number of columns. - The columns for each result set must have compatible data types. For example, combining an int column and a bigint column is fine, but attempting to combine a uniqueidentifier column with a decimal column would result in an error. Also, the order you define the column in your
SELECT
statement matters as the results are concatenated ordinally.
The queries in the above UNION
statement are simple SELECT
statements to show the overall syntax. However, the queries being concatenated together can be quite complex, including things like subqueries, aggregates, GROUP BY, and HAVING clauses. ORDER BY is a special case for the UNION
clause.
UNION
vs. UNION ALL
>
When combining results using UNION
, it’s important to understand how duplicate values should be treated. A UNION
will remove any duplicate values returned from concatenating various result sets. The following query returns a unique list of numbers, from 1 to 10. While the values 1 and 7 are included in both result sets, the duplicates are filtered out before the data is returned, as shown below:
SELECT Number FROM (VALUES (1), (3), (5), (7), (8), (9)) AS Set1 (Number) UNION SELECT Number FROM (VALUES (1), (2), (4), (6), (7), (10)) AS Set2 (Number);
If duplicate rows aren’t a concern, or you know there are no duplicates, consider using the T-SQL UNION ALL
instead of UNION
, as UNION ALL
doesn’t attempt to filter out any duplicate rows. The following query is the same as the previous UNION
query but uses UNION ALL
instead of UNION
:
SELECT Number FROM (VALUES (1), (3), (5), (7), (8), (9)) AS Set1 (Number) UNION ALL SELECT Number FROM (VALUES (1), (2), (4), (6), (7), (10)) AS Set2 (Number);
You can see the results are returned in the order in which they’re defined in the row constructor, and duplicates aren’t removed. The reason I mention the order in which the rows are returned here compared to the UNION
example is a UNION
must perform a sort to remove duplicate rows. Result sets from a UNION
statement are often ordered when returned, but you cannot depend on this. The only way to guarantee a result set is ordered in a specific manner is to include the ORDER BY clause as part of the statement.
The Performance Expense of Sorting Data With UNION
vs. UNION ALL
One of the biggest performance problems I've seen with using the UNION
clause is when an expensive sort operation must be performed. So, as a rule, only use a UNION
if you need to remove duplicate values or use a UNION ALL
.
When a sort operation occurs, a request is made for some memory to perform the operation, known as a memory grant. Memory grants have an initial requirement for how much memory is required to begin the sort operation, and there is additional memory needed to perform the operation entirely in memory. Performing a sort in memory is much faster than spilling the data to tempdb and performing the sort operation there. However, sometimes cardinality estimations are incorrect, and SQL Server guesses it can fit all rows in memory, but it cannot, leading to an expensive sort spill to tempdb.
The following is a warning you might see in an execution plan for a sort operation if it has to spill to tempdb:
Using ORDER BY With a UNION
Operation
By definition, any result set is unordered, but there’s often a need to order the results from a UNION
. To order results from a UNION
, use an ORDER BY clause placed after the last statement of the UNION
. Consider the following query:
SELECT AddressLine1, City FROM Person.Address WHERE City = 'Redmond' UNION SELECT a.AddressLine1, City FROM Sales.SalesOrderHeader h JOIN Person.Address a ON h.BillToAddressID = a.AddressID JOIN Person.StateProvince sp ON a.StateProvinceID = sp.StateProvinceID WHERE sp.name = 'Washington AND City = 'Redmond' ORDER BY AddressLine1 ASC;
The ORDER BY clause is always evaluated last in a SQL statement. In the above example, the two queries are combined, and then ordering occurs. However, it’s important to understand the column name/alias you must reference in the ORDER BY clause.
Column Aliases
The column names returned in a UNION
operation are always based on the result set of the first SELECT
statement, so you must define column aliases in the first SELECT
statement. You must also refer to these column names if you use an ORDER BY clause. Here’s an example of this:
SELECT pp.LastName AS LName, pp.FirstName AS FName, 'SalesPerson' AS OfficalJobTitle FROM Person.Person AS pp JOIN Sales.SalesPerson AS e ON e.BusinessEntityID = pp.BusinessEntityID UNION ALL SELECT pp.LastName, pp.FirstName, e.JobTitle FROM Person.Person AS pp JOIN HumanResources.Employee AS e ON e.BusinessEntityID = pp.BusinessEntityID ORDER BY LName ASC;
What Is the Difference Between UNION
and JOIN
?
The UNION
clause combines the results of two or more queries. The data from the queries don’t need to be related as long as each query has the same number of columns the data types are compatible
A JOIN
operation combines rows from different tables where there’s a relationship between those tables. Depending on the type of JOIN
, the rows returned may be the same between the tables being joined (e.g. INNER JOIN
), or one table may be preserved. If there are matching rows in other tables, they’ll be included (e.g. OUTER JOIN
).
UNION
vs. OR
One practical application of the UNION
operator is to combine similar queries from the same set of tables together to form a new result set. It's been my experience to see application developers use the pattern of having similar queries with equality comparisons in the predicates because it's easier to write or generate through a tool and seems more efficient than one single query including an OR
comparison in the predicate. A simple example of this would be as follows, with the associated query plan:
SELECT ProductID, SalesOrderID, UnitPrice FROM Sales.SalesOrderDetail WHERE ProductID = 712 UNION SELECT ProductID, SalesOrderID, UnitPrice FROM Sales.SalesOrderDetail WHERE ProductID = 870;
You'll notice in the plan the SalesOrderDetail
table is referenced twice; one scan for the lookup of the ProductID
value 712
and another scan for the ProductID
value of 870
. To UNION
these, SQL Server must run the same scan operation twice. Using the output of SET STATISTICS IO ON
, you can see 2492 pages from the buffer pool were accessed to return the results:
A more efficient way to refactor the UNION
statement is to simply use the IN clause (a shorthand way of writing an OR
clause), so the work of filtering out the ProductID
values occurs in a single scan of the table, as shown below:
SELECT ProductID, SalesOrderID, UnitPrice FROM Sales.SalesOrderDetail WHERE ProductID IN (712, 870);
You can see from the execution plan the table is now only scanned once, and the number of pages read from the buffer pool is cut in half:
It's worth noting there’s a common misconception about using the OR
operator in a WHERE
clause. Using OR
doesn’t always require scanning all the data in a table to return the rows being searched. Factors such as column cardinality, the number of values being searched, and the columns returned in the SELECT
list are considered. If there were a covering nonclustered index on the ProductID
column (which included SalesOrderID
and UnitPrice
), you could use a single seek operation instead of scanning the table.
T-SQL Recursive Common Table Expressions
A Common Table Expression (CTE) is a type of temporary named result set to allow you to define a query once and then refer to the query by name in subsequent statements in the same batch. CTEs include the ability to do recursive queries, which is a way for a query to refer to itself until a base case is met. The structure of a recursive CTE is defined by an anchor query and a recursive query. The recursive portion of the query references the name of the CTE and will recurse until its base condition is met.
The critical thing to take away here is a recursive CTE requires a UNION ALL
statement to combine the results from the anchor query and all subsequent recursive queries into a final result set.
CTEs are great for hierarchical queries, such as a company organization structure. Here's an example:
;WITH DirectReportsCTE ( ManagerID, ManagerName, EmployeeID, EmployeeName, Title, Level ) AS ( --base/anchor case SELECT e.ManagerID, CAST (NULL AS VARCHAR (60)), e.EmployeeID, e.FullName, e.Title, 0 AS Level FROM dbo.CompanyEmployees AS e WHERE ManagerID IS NULL UNION ALL --recurse SELECT e.ManagerID, CAST (d.EmployeeName AS VARCHAR (60)), e.EmployeeID, e.FullName, e.Title, Level + 1 FROM dbo.SQLSkillsEmployees AS e INNER JOIN DirectReportsCTE AS d ON e.ManagerID = d.EmployeeID ) SELECT ManagerID, ManagerName, EmployeeID, EmployeeName, Title, Level FROM DirectReportsCTE;
UNION ALL
in Partitioned Views
While the SQL Server partitioning function can be great from a data manageability perspective, it’s sometimes advantageous to design your own partitioning strategy for tables. Designing a partition gives you more flexibility for data movement and allows you to index each table as needed. You can also take advantage of the statistics for individual tables (vs. a single statistic object for a partitioned table in SQL Server).
Consider the following table definition:
CREATE TABLE SalesQ12021 ( SalesOrderID int NOT NULL, SalesOrderDetailID int NOT NULL, OrderQty smallint NOT NULL, ProductID int NOT NULL, UnitPrice money NOT NULL, UnitPriceDiscount money NOT NULL, SaleDate datetime NOT NULL CHECK (SaleDate >= '1/1/2021' AND SaleDate < '4/1/2021'), CONSTRAINT pk_SalesQ12021 PRIMARY KEY CLUSTERED (SalesOrderID, SalesOrderDetailID) );
The important concept here is the constraint on the SaleDate column – this feature will ensure only records for a given quarter of 2021 can be inserted into the table. The real power of this design comes when there are multiple tables for certain date ranges, each with its own date constraints.
Consider the following view using the UNION ALL
operator to produce a dataset combining the data from these four tables together:
CREATE VIEW vwSales AS SELECT SalesOrderID, SalesOrderDetailID, OrderQty, ProductID, UnitPrice, UnitPriceDiscount, SaleDate FROM SalesQ12021 UNION ALL SELECT SalesOrderID, SalesOrderDetailID, OrderQty, ProductID, UnitPrice, UnitPriceDiscount, SaleDate FROM SalesQ22021 UNION ALL SELECT SalesOrderID, SalesOrderDetailID, OrderQty, ProductID, UnitPrice, UnitPriceDiscount, SaleDate FROM SalesQ32021 UNION ALL SELECT SalesOrderID, SalesOrderDetailID, OrderQty, ProductID, UnitPrice, UnitPriceDiscount, SaleDate FROM SalesQ42021; A query against this view with a filter on the SaleDate column would produce the following execution plan: SELECT COUNT (*) AS RowCount FROM vwSales WHERE SaleDate >= '10/11/2021' AND SaleDate < '10/12/2021';
Notice only the SalesQ42021 table is being referenced. Since each table has constraints in place, the query optimizer knows the query only needs to access a single table to execute this query.
Best Practices for Using UNION
and UNION ALL
The UNION
and UNION ALL
T-SQL query operators are useful for concatenating result sets from two or more SELECT
queries. There’s a lot of flexibility with these statements, but you must ensure each SELECT
statement returns the same number and compatibility of the columns.
Like in my "perfect" example I opened the post with, try to use the UNION ALL
statement when possible, as there can be much overhead involved in sorting the data to return a distinct list for UNION
. Knowing your data and operator options can help you achieve the most efficient query execution plan.
Maintaining and improving query performance can also be crucial to the overall health of your SQL Server environment. With SolarWinds SQL Sentry, you can use the integrated Environment Health Overview (EHO) dashboard for an at-a-glance view designed to allow you to prioritize troubleshooting events causing the most pressing performance issues.