SQL Server T-SQL SELECT DISTINCT Clause Tutorial
The DISTINCT clause is an optional part of the SELECT statement and returns a set of unique records based on the columns included in the SELECT list.
When it comes to optimizing SQL Server performance, using the DISTINCT clause in queries where it isn’t needed is one of the most common and easiest performance problems to fix. Years ago, I worked with a customer where it was a programming standard to include DISTINCT for every query because of a poorly designed data model. Within a couple of days of consulting, we were able to add constraints to the data model and modify the existing queries on the customer’s tables to increase performance by 20%. So, while using DISTINCT to return a unique list of rows can sometimes be necessary, please be aware it comes at a cost to performance.
This post will discuss the DISTINCT clause in detail, including its use cases, edge cases to be aware of, and common performance considerations. Most of the examples use the AdventureWorks sample database.
General Syntax
The general syntax for DISTINCT is the following:
SELECT DISTINCT ProductID, CarrierTrackingNumber FROM Sales.SalesOrderDetail;
The DISTINCT clause may only be used with a SELECT statement and must follow the SELECT keyword. Any column names or derived columns following the DISTINCT keyword will be consolidated to return a unique set of rows across those fields. Behind the scenes, the deduplication of rows generally happens in one of two ways:
- Introducing a sort operator to sort data and remove duplicates
- SQL Server may introduce an aggregation into the query to remove the duplicate records
The execution plan below is for the simple query above, and you can see SQL Server chose to use a Hash Match Aggregation.
DISTINCT and Collations
Notably, when comparing string data types, duplicate values result from the collation used. Collations in SQL Server provide sorting rules, case, and accent sensitivity properties for your data and can be specified at the server level, database level, column level, and expression level. You can read more about collations in the Microsoft documentation here.
Consider the following example where I enter my last name twice into a table:
CREATE TABLE LastNames
(
LastName VARCHAR(255)
);
INSERT INTO LastNames
VALUES ('Randal')
, ('randal');
My SQL Server installation uses the SQL_Latin1_General_CP1_CI_AS collation, which is case-insensitive. When I run the following query against the table I created, a single record is returned.
SELECT DISTINCT LastName FROM LastNames;
However, if I adjust the query and force a case-sensitive expression-level collation, both records are returned. You should be especially aware of this when dealing with case-sensitive collations introducing results you don’t expect.
SELECT DISTINCT
LastName COLLATE Latin1_General_CS_AS
FROM LastNames;
GROUP BY vs. DISTINCT
GROUP BY provides similar functionality to the DISTINCT operator in that it can return a unique list of rows in a result set. However, GROUP BY also allows you to perform aggregate functions (e.g., SUM, COUNT, AVG) on the groups of rows returned by the statement.
The question of, “Should I use GROUP BY or DISTINCT syntax in this query?” is one of the most common questions I’ve seen developers ask. As you may guess, the answer is “It depends!”
The main consideration is if aggregations are included in the query. If so, using GROUP BY is the correct answer. In fact, in most cases, GROUP BY will provide the same functionality when you need to remove duplicates.
One issue I still see is using the SELECT DISTINCT statement where it’s not needed. This can be resolved by simply understanding (and trusting) the data model a bit more to know if a particular query might produce duplicate values (and whether this is an issue or not).
Adding a SELECT DISTINCT clause in a query often introduces an aggregation or sort behind the scenes. This overhead can be too much if the query is operating on a large data set because of the order of operations SQL Server uses (listed below) when it comes to using DISTINCT vs. GROUP BY:
- FROM
- ON
- JOIN
- WHERE
- GROUP BY
- WITH CUBE or WITH ROLLUP
- HAVING
- SELECT
- DISTINCT
- ORDER BY
- TOP
Notice how GROUP BY is applied before the DISTINCT clause. The DISTINCT clause is almost the last clause of a SELECT statement to be processed.
My good friend Aaron Bertrand has an excellent article on comparing and contrasting the intricacies of the GROUP BY and DISTINCT statements here.
Using DISTINCT With ORDER BY
It’s often useful to order the results of a query. For a normal SELECT statement, you can include columns in the ORDER BY that aren’t part of the SELECT list, but this isn’t the case when the SELECT statement includes the DISTINCT clause.
The following query returns the OrderDate and ShipDate columns from Sales.SalesOrderDetail and then orders the result set based on the TerritoryID column. This is a perfectly valid SQL statement:
SELECT OrderDate, ShipDate FROM Sales.SalesOrderHeader ORDER BY TerritoryID;
If I change the above query to include only a DISTINCT list of values, then the query fails due to the ORDER BY clause containing a column not in the SELECT list:
SELECT DISTINCT OrderDate, ShipDate FROM Sales.SalesOrderHeader ORDER BY TerritoryID;
The way to get around this is to include the TerritoryID as part of the SELECT column list:
SELECT DISTINCT
OrderDate,
ShipDate,
TerritoryID
FROM Sales.SalesOrderHeader
ORDER BY TerritoryID;
Using DISTINCT in Aggregate Functions
You can also include the DISTINCT clause in aggregate functions to perform aggregation on a distinct list of values. The most commonly used aggregate function with DISTINCT is the T-SQL COUNT function.
In the following query, I return a list of all ProductID values and CarrierTrackingNumber values from the Sales.SalesOrderDetail table. Notice how the count for the ProductID column is significantly higher than the count of the CarrierTrackingNumber values. This is because the CarrierTrackingNumber column contains NULL values while the ProductID column does not allow NULL values. NULL values aren’t considered in aggregate functions.
SELECT COUNT (ProductID) AS AllProducts, COUNT (CarrierTrackingNumber) AS AllCTN FROM Sales.SalesOrderDetail;
After enabling SET STATISTICS IO ON and getting the execution plan, we can see the query used a clustered index scan and read 1,246 pages:
Now, I’ll introduce the DISTINCT clause inside of a COUNT aggregation on the ProductID column:
SELECT
COUNT (ProductID) AS AllProducts,
COUNT (CarrierTrackingNumber) AS AllCTN,
COUNT (DISTINCT ProductID) AS DistinctProducts
FROM Sales.SalesOrderDetail;
From the returned results, we see there are 266 distinct values in the ProductID column:
Using DISTINCT inside the COUNT function is a helpful way to determine how unique the distribution of values in a column is. However, like everything else regarding computing, there is a trade-off as this aggregation is not cost-free.
Although all the data in the table is already being scanned, SQL Server still needs to reaccess a data structure to perform the DISTINCT aggregation. In this case, because a non-clustered index exists on the ProductID column, SQL Server can scan that index to return the distinct values for the column. Below we see the extra scan of the non-clustered index in the execution plan:
We also see the extra pages read from the buffer pool for the second index scan:
So, what happens if I include a COUNT DISTINCT operation on a column without an existing non-clustered index? In this case, an extra table scan must occur to perform the DISTINCT aggregation. Below I include a COUNT DISTINCT on the CarrierTrackingNumber column, and the column returns 3,806 distinct values:
SELECT
COUNT (ProductID) AS AllProducts,
COUNT (CarrierTrackingNumber) AS AllCTN ,
COUNT (DISTINCT ProductID) AS DistinctProducts,
COUNT (DISTINCT CarrierTrackingNumber) AS DistinctCTN
FROM Sales.SalesOrderDetail;
And the resulting execution plan includes an extra clustered index scan:
This is a crucial point to take away. Just because your query only references a single table doesn’t mean your query will, at maximum, read all of the data in the table a single time.
The IN Clause
The IN clause allows for the specification of multiple search values in a WHERE clause. It’s an alternative to writing multiple OR statements in the predicate. Sub-queries can also be used as a source for the IN clause. This sub-query must return a single column, and the list of values returned will always return a DISTINCT list of values. Notice in the execution plan for the query below, a Stream Aggregate is performed to return a unique list of ProductID
values from the sub-query:
SELECT
*
FROM Production.Product
WHERE ProductID IN
(
SELECT
ProductID
FROM Sales.SalesOrderDetail
);
If I alter the query a bit to include a GROUP BY instead of the simple inner query above, we can see the resulting execution plan is the same:
SELECT
*
FROM Production.Product
WHERE ProductID IN
(
SELECT
ProductID
FROM Sales.SalesOrderDetail
GROUP BY ProductID
);
So, what happens if the inner query returns a list of already unique values? In such a case, SQL Server is smart enough not to perform an aggregation to return a DISTINCT list.
To show this, let’s get a unique list of ProductID values from the SalesOrderDetail table and put them in a temp table. Notice how I include a NULL value in this table (more on why I’m doing this in a moment):
SELECT DISTINCT
ProductID
INTO #DistinctProducts
FROM Sales.SalesOrderDetail
UNION ALL
SELECT
NULL;
Next, I’ll create a UNIQUE clustered index on the table. This will guarantee no duplicate values can exist in the table. Also, UNIQUE indexes allow for a single NULL value in SQL Server:
CREATE UNIQUE CLUSTERED INDEX idx_DistinctProducts ON #DistinctProducts (ProductID);
Now, when I execute the query and use the temp table containing unique values, we can see from the execution plan how the aggregation to return a DISTINCT list of values is no longer needed.
SELECT
*
FROM Production.Product
WHERE ProductID IN
(
SELECT
ProductID
FROM #DistinctProducts
);
Things start to get interesting if I modify the above query a bit to use a NOT IN clause instead of an IN clause:
SELECT
*
FROM Production.Product
WHERE ProductID NOT IN
(
SELECT
ProductID
FROM #DistinctProducts
);
The query returns no results. Recall from the INSERT statement above where I included a single NULL value into the ProductID column of the #DistinctProducts temp table. When an inequality comparison is made against a NULL value, the result is unknown (NULL). When SET ANSI_NULLS is set to ON, which is the default, any NOT IN comparison against a sub-query with a NULL value will return a NULL result set. I’ve seen many developers run into problems with this and not fully understand why no results were returned.
Luckily, the workaround is relatively easy – just include a predicate to filter out any NULL values:
SELECT
*
FROM Production.Product
WHERE ProductID NOT IN
(
SELECT
ProductID
FROM #DistinctProducts
WHERE ProductID IS NOT NULL
);
When to Use DISTINCT in SQL
The DISTINCT clause is part of the SELECT statement and returns a unique set of rows for a query. As seen from the DISTINCT examples above, there can be much overhead in the aggregation or sorting process to return the unique list of rows in the result set.
So, when a unique data set is needed, consider using GROUP BY as an alternative because of its flexibility for performing aggregate functions. Also, only consider using DISTINCT when necessary due to the overhead to remove duplicate rows from the result set.
To monitor SQL Server queries and optimize performance more easily, SolarWinds® SQL Sentry can provide you with the ability to get to the root of query problems using Plan Explorer. Plan Explorer is a built-in tool designed to score algorithms to help you determine the best index to support a given query, view recommended indexes, and more.