#BackToBasics: Common Table Expressions (CTEs)
At the end of last year, I accepted a challenge for 2016: to publish one post per month that makes fewer assumptions about the reader's knowledge. In the first post, which came up rather quickly I might add, I'm going to talk about a concept in SQL Server introduced in SQL Server 2005: the Common Table Expression (CTE).
What is it?
A CTE is probably best described as a temporary inline view - in spite of its official name, it is not a table, and it is not stored (like a #temp table or @table variable). It operates more like a derived table or subquery, and can only be used for the duration of a single SELECT, UPDATE, INSERT, or DELETE statement (though it can be referenced multiple times within that statement). It can also be used to drive a MERGE statement, but I'm not a big fan of those.
Paul White (@SQL_Kiwi) once said that a better name might have been Common View Expression (CVE); I think Temporary View or Inline View would have been more on point as well.
People don't generally know to call it a CTE because the label of CTE isn't in the syntax - it simply appears like this, using the keyword
;WITH some_alias_name AS (SELECT ...
This is not the only use for
WITH in Transact-SQL - you may also have seen it used for table hints (e.g.
WITH (NOLOCK)), database setting options (e.g.
WITH ROLLBACK IMMEDIATE), index options (e.g.
WITH (DROP_EXISTING = ON)), backup options (e.g.
WITH RECOVERY) and others.
Why is it used?
A CTE can serve many purposes, some of which could be swapped with derived table or subquery variants.
Eliminate redundant expressions
Sometimes you have a complex expression that you reference multiple times, such as:
SELECT h = MAX(DATEADD(DAY, 30, column_name)), l = MIN(DATEADD(DAY, 30, column_name)), a = AVG(DATEADD(DAY, 30, column_name)) FROM dbo.some_table;
A CTE can simplify this by only having to run the
DATEADD expression once:
;WITH cte AS ( SELECT d = DATEADD(DAY, 30, column_name) FROM dbo.some_table ) SELECT h = MAX(d), l = MIN(d), a = AVG(d) FROM cte;
Of course, as I suggested, this could also be written using a nested subquery:
SELECT h = MAX(d), l = MIN(d), a = AVG(d) FROM ( SELECT d = DATEADD(DAY, 30, column_name) FROM dbo.some_table ) AS nsq;
Which of these forms you use is subjective (and will perform the same), but I have found myself leaning toward CTEs because, unlike nested subqueries, they can be referenced multiple times without being repeated. For example:
;WITH cte AS ( SELECT d = DATEADD(DAY, 30, column_name) FROM dbo.some_table ) SELECT MAX(d) FROM cte UNION ALL SELECT MIN(d) FROM cte;
This doesn't mean that
dbo.some_table is only touched once, though. More on that in a moment.
A couple of other things I want to point out here:
- You can stack / nest / cascade CTEs, by separating them with a comma:
;WITH cte1 AS (SELECT x = 5), cte2 AS (SELECT x = x + 1 FROM cte1), cte3 AS (SELECT y = x + 4 FROM cte2) SELECT y FROM cte3;
- You can provide the column names alongside the CTE name, rather than relying on aliases assigned within the CTE:
;WITH cte(col1, col2) AS (SELECT 5, 'foo') SELECT col1, col2 FROM cte;
- You cannot use a CTE in multiple subsequent queries. As mentioned above, this is not like a #temp table; people often want to do something like this, but at the point of the second query, the CTE no longer exists:
;WITH cte AS (SELECT x = 5) SELECT x FROM cte; SELECT x FROM cte;
Prior to SQL Server 2012, CTEs were often used to mimic the functionality that eventually arrived with
LAG/LEAD, where you could reference the "previous" or "next" row. This would mean using some kind of windowing function, like
ROW_NUMBER(), and joining the CTE to itself in order to find the previous or next row. As an example, let's say I have a simple table like this, with a sequence of events:
CREATE TABLE #log ( EventType nvarchar(32), EventTime datetime ); INSERT #log(EventType, EventTime) VALUES(N'start', '20160106 12:00:00'), (N'middle', '20160106 12:20:00'), (N'almost done', '20160106 12:30:00'), (N'end', '20160106 12:45:00');
Now I need a query to show me how long each event took. I could write a CTE like the following, which joins to itself to find the previous row based on the
;WITH cte AS ( SELECT EventType, EventTime, rn = ROW_NUMBER() OVER (ORDER BY EventTime) FROM #log ) SELECT x.EventType, x.EventTime, duration = DATEDIFF(MINUTE, x.EventTime, y.EventTime) FROM cte AS x -- initial row LEFT OUTER JOIN cte AS y -- "next" row ON x.rn = y.rn - 1;
We use an outer join here so that we can still see the row that represents the end, even though that row doesn't technically have a duration:
EventType EventTime duration ----------- ----------------------- -------- start 2016-01-06 12:00:00.000 20 middle 2016-01-06 12:20:00.000 10 almost done 2016-01-06 12:30:00.000 15 end 2016-01-06 12:45:00.000 NULL
In SQL Server 2012 and up, we could express this query a little simpler and without a CTE, and get the exact same results:
SELECT EventType, EventTime, duration = DATEDIFF(MINUTE, EventTime, LEAD(EventTime, 1) OVER (ORDER BY EventTime)) FROM #log;
In this case, there is a reason to use
LEAD rather than a CTE, and I already alluded to it above: there is a performance difference because the
LEAD variant only references the underlying table once. Here are the plans (click either to enlarge):
Paging is a long-established process of allowing a user to view a large result set, n rows at a time. The most prolific example is Google or Bing, which typically show 10 search results per page. Again, prior to SQL Server 2012, paging wasn't as simple as
OFFSET/FETCH alone does not necessarily provide any benefits aside from syntactically. If we look at the 2005 variation (lifted directly from Robert Cary's SSC article, SQL Server 2005 Paging – The Holy Grail):
DECLARE @startRow INT ; SET @startrow = 50 ;WITH cols AS ( SELECT table_name, column_name, ROW_NUMBER() OVER(ORDER BY table_name, column_name) AS seq, ROW_NUMBER() OVER(ORDER BY table_name DESC, column_name desc) AS totrows FROM [INFORMATION_SCHEMA].columns ) SELECT table_name, column_name, totrows + seq -1 as TotRows FROM cols WHERE seq BETWEEN @startRow AND @startRow + 49 ORDER BY seq
This uses a CTE, and two
ROW_NUMBER() functions used to traverse the table in each direction. A translation of that query to use the new syntax, without a CTE, would be:
SELECT table_name, column_name, COUNT(*) OVER() AS totrows FROM INFORMATION_SCHEMA.COLUMNS ORDER BY table_name, column_name OFFSET @startRow-1 ROWS FETCH NEXT 50 ROWS ONLY;
This looks a lot simpler, but as it turns out (at least in this case), while the plans definitely have different shapes, they are roughly the same in terms of costs - the old style has fewer reads, but more sort operations, and these seem to offset in terms of duration (I'd like to validate that further on larger result sets, but you can download these plans here to examine them yourself in SQL Sentry Plan Explorer).
If you are performing paging in your application, you should check out my article over on SQLPerformance.com, Pagination with OFFSET / FETCH : A better way, where I promote this hybrid form (which uses a CTE to retrieve just the key values you're after, than joins to the main table to retrieve the other columns):
;WITH pg AS ( SELECT [key_column] FROM dbo.[some_table] ORDER BY [some_column_or_columns] OFFSET @PageSize * (@PageNumber - 1) ROWS FETCH NEXT @PageSize ROWS ONLY ) SELECT t.[bunch_of_columns] FROM dbo.[some_table] AS t INNER JOIN pg ON t.[key_column] = pg.[key_column] ORDER BY [some_column_or_columns];
This form will work best when you are retrieving a lot of columns, and your table has both a narrow clustering key and an index that supports the desired
ORDER BY. This isn't quite the case with the example Robert chose (the
INFORMATION_SCHEMA views are, well, not anything like real user tables).
Another place where CTEs can help to simplify a query - mostly by eliminating a self-join or dumping values to a #temp or other table - is removing duplicates. Let's say I have a simple table where I forgot to define a primary key or unique constraint, and now I need to clean it up:
CREATE TABLE #people ( name nvarchar(32) NOT NULL ); INSERT #people(name) VALUES(N'Aaron'),(N'Bob'),(N'Aaron');
A CTE can make this very easy, by simply applying a
ROW_NUMBER() grouped by the name, and deleting any rows where that row number is greater than 1:
;WITH cte AS ( SELECT name, rn = ROW_NUMBER() OVER (PARTITION BY name ORDER BY name) FROM #people ) DELETE cte WHERE rn > 1;
Note that this will delete arbitrary duplicate rows - if you have other criteria that you want to use in order to determine which rows to keep, you can apply tie-breakers to the
ORDER BY. Let's say we have a slightly different table:
CREATE TABLE #people2 ( name nvarchar(32) NOT NULL, StartDate DATE NOT NULL ); INSERT #people2(name, StartDate) VALUES(N'Aaron', '20101025'), (N'Bob', '20120101'), (N'Aaron', '20150601');
If I want to keep the oldest (or "first") row, I can use
StartDate as the ordering column:
;WITH x(name, rn) AS ( SELECT name, rn = ROW_NUMBER() OVER (PARTITION BY name ORDER BY StartDate) FROM #people2 ) DELETE x WHERE rn > 1;
If I want to keep the newest (or "last") row, I can simply reverse the ordering:
;WITH x(name, rn) AS ( SELECT name, rn = ROW_NUMBER() OVER (PARTITION BY name ORDER BY StartDate DESCENDING) FROM #people2 ) DELETE x WHERE rn > 1;
And then there are situations where you might have two Aarons starting on the same day, in which case you can add another column to help break ties (or just accept that an arbitrary row might be deleted).
Updating with an aggregate
I am largely against storing redundant information in a table, such as a calculated value indicating the cheapest or most expensive order a customer has placed. But a lot of people seem to want to do this. Let's say we have an orders table:
CREATE TABLE #orders ( CustomerID INT NOT NULL, OrderTotal DECIMAL(10,2), BiggestOrder DECIMAL(10,2), SmallestOrder DECIMAL(10,2) ); INSERT #orders(CustomerID, OrderTotal) VALUES(1, 32.05), (2, 12.04), (1, 55.72), (2, 65.15), (1, 44.65), (2, 11.42);
Now we want to apply the biggest and smallest order for a customer to all of that customer's rows. The first type of attempt I see many people try goes something like this:
UPDATE #orders SET BiggestOrder = MAX(OrderTotal), SmallestOrder = MIN(OrderTotal) GROUP BY CustomerID;
Seems like a logical approach, but this doesn't work due to a syntax error on the
GROUP BY. And even if you remove it, you get a more fundamental error:
Msg 157, Level 15, State 1 An aggregate may not appear in the set list of an UPDATE statement.
An aggregate may not appear in the set list of an UPDATE statement.
Next, they try this (using windowed aggregates, which don't require outer grouping):
UPDATE #orders SET BiggestOrder = MAX(OrderTotal) OVER (PARTITION BY CustomerID), SmallestOrder = MIN(OrderTotal) OVER (PARTITION BY CustomerID);
But alas, this also yields an error message:
Msg 4108, Level 15, State 1 Windowed functions can only appear in the SELECT or ORDER BY clauses.
Windowed functions can only appear in the SELECT or ORDER BY clauses.
One way around this is to run a correlated subquery for each aggregate:
UPDATE o SET BiggestOrder = (SELECT MAX(OrderTotal) FROM #orders AS bo WHERE bo.CustomerID = o.CustomerID), SmallestOrder = (SELECT MIN(OrderTotal) FROM #orders AS so WHERE so.CustomerID = o.CustomerID) FROM #orders AS o;
Another way to express this (with a lower runtime cost, especially as the number of aggregates increases) would be to access the aggregates using a derived table:
UPDATE o SET BiggestOrder = agg.MaxOrder, SmallestOrder = agg.MinOrder FROM #orders AS o INNER JOIN ( SELECT CustomerID, MaxOrder = MAX(OrderTotal), MinOrder = MIN(OrderTotal) FROM #orders GROUP BY CustomerID ) AS agg ON o.CustomerID = agg.CustomerID
Or a CTE:
;WITH cte AS ( SELECT CustomerID, BiggestOrder, MaxOrder = MAX(OrderTotal) OVER (PARTITION BY CustomerID), SmallestOrder, MinOrder = MIN(OrderTotal) OVER (PARTITION BY CustomerID) FROM #orders ) UPDATE cte SET BiggestOrder = MaxOrder, SmallestOrder = MinOrder;
In this case, the CTE and derived table produce nearly identical plans, but the CTE is much more clearly written (though I agree this is subjective). I won't clog up the post with screen shots and other evidence, but again will make the .QueryAnalysis file available for you to download here.
CTEs can also be useful for recursion, where a CTE references itself one or more times to produce output. The two most common use cases are:
- Generating a set - people often use a recursive CTE to define a set of numbers or a date range. For example, here is a simple CTE that generates a set of numbers from 1 to 100:
;WITH x AS ( SELECT y = 1 UNION ALL SELECT y + 1 FROM x WHERE y < 100 ) SELECT y FROM x;
If you try to go over 100, though, you will get this error:
Msg 530, Level 16, State 1 The statement terminated. The maximum recursion 100 has been exhausted before statement completion.
The statement terminated. The maximum recursion 100 has been exhausted before statement completion.
In order to exceed 100 iterations, you'll need to use
;WITH x AS ( SELECT y = 1 UNION ALL SELECT y + 1 FROM x WHERE y < 200 ) SELECT y FROM x OPTION (MAXRECURSION 200);
This will work up until 32,767, after which you will receive this error:
Msg 310, Level 15, State 1 The value 32768 specified for the MAXRECURSION option exceeds the allowed maximum of 32767.
(To exceed even that limit, you can use
(MAXRECURSION 0), though whether that's a good idea or not...)
More importantly, it is my experience that recursive CTEs do not scale well when deriving large sets; often having a pre-calculated Numbers or Calendar table is much better. I've proven this in multiple previous articles:
- Hierarchy - this I believe was the original driver behind CTEs; to be able to map out an assembly process (a bill of materials), to produce an org chart (find all of an employee's managers, their manager's managers, those managers' managers, etc), and other similar tasks where you would need to know how many levels could potentially exist in order to use traditional joins or unions (and these would perform poor at scale in any case). I will perhaps take a look at hierarchical queries in a more advanced post, outside of the scope of this series; in the meantime, TechNet has some examples for you.
Why the leading semi-colon?
This is probably the most common question I receive personally about CTEs (aside from "What's a CTE? I've never heard of it..."). In just about every CTE I've ever published in blog posts, tips, or answers out on Stack Exchange, you'll see me begin my code with a semi-colon, like this:
;WITH cte_name AS (SELECT ...
The reason for this is simple - because of the overloading of the
WITH keyword, there are several scenarios where copying and pasting a CTE into existing code will break. For example:
SELECT columns FROM dbo.some_table -- existing code WITH cte_name AS (SELECT ... -- pasted
This would fail with the following error:
Msg 336, Level 15, State 1 Incorrect syntax near 'cte_name'. If this is intended to be a common table expression, you need to explicitly terminate the previous statement with a semi-colon.
After getting this complaint at least a dozen times, I quickly made it a habit to always use a leading semi-colon on every CTE I post, much to the chagrin of a few of my colleagues, who think it is perverse to use a semi-colon to begin a statement, rather than to end one.
In most cases, this is okay; you can have as many semi-colons between two statements as you like. This is legal, for example, even if it's not something I would write (or want to read):
;;;;;;;; SELECT 1; ;;; ;;;;; ;; ; ;; ; ;;; SELECT 2;;;;;;
However, there is at least one case where it's not okay, and that is when your CTE is used at the beginning of an inline module, such as a view or inline table-valued function:
CREATE VIEW dbo.view_name AS ;WITH cte_name AS (SELECT ...
This yields the following error:
Msg 102, Level 15, State 1, Procedure view_name, Line 3 Incorrect syntax near ';'.
Beware the Sugar
In cases where you use a CTE to avoid redundancy, it is important to note that this is mostly just syntactic sugar. As I showed a couple of times above, just because you only typed the CTE once, does not mean it is going to be evaluated only once. Even in very simple cases, you can see that a CTE that accesses a table once, but is referenced multiple times, ends up evaluating the CTE (and hence accessing the underlying table) multiple times. Lets say we have a simple table like this:
CREATE TABLE dbo.floob(blat INT PRIMARY KEY); INSERT dbo.floob(blat) SELECT object_id FROM sys.all_objects;
And then we have a simple CTE query, which only references the table once:
;WITH n AS ( SELECT TOP (1) blat FROM dbo.floob ORDER BY blat ) SELECT n.blat, n2.blat FROM n CROSS JOIN n AS n2;
If we look at the results, it might *seem* like there is no reason that the inner query could have been executed multiple times (your results may be slightly different depending on your build of SQL Server):
blat blat ----------- ----------- -1372343552 -1372343552
But if we look at the plan, it becomes more clear:
If we change the
ORDER BY to something that can change with each execution (making the result non-deterministic), such as
;WITH n AS ( SELECT TOP (1) blat FROM dbo.floob ORDER BY NEWID() ) SELECT n.blat, n2.blat FROM n CROSS JOIN n AS n2;
It is much more obvious from the outset that the query was executed twice, since we end up with two completely different
TOP (1) values:
blat blat ----------- ----------- -858063793 -321212601
And there is a much more expensive plan to go with it:
So please be careful, as every reference to a CTE will most likely represent an additional evaluation of the inner query.
My one wish
Personally, aside from the thing being named correctly and deterministic inner query results to only be evaluated once, I would have liked to have seen explicit, distinguishable CTE syntax, such as:
WITH_CTE cte_name AS (SELECT ...
However, while that would avoid some of the confusion about how a CTE works, and would eliminate the leading semi-colon nonsense, it would have violated the ANSI standard - which defines a CTE structure using
WITH. Let's pretend to ignore the fact that all of SQL Server's other implementations of
WITH are not from the standard, because it's safe to assume that they're not going to fix all of that syntax to make CTEs friendlier. Theoretically, they could add more specific syntax as a synonym that the parser could understand, giving us a better option while not destroying backward compatibility, but I wouldn't hold my breath.
This was an interesting exercise, and was a lot harder than I thought it would be. But it was refreshing to walk through many examples of CTEs starting with a very early baseline of knowledge. I'm looking forward to publishing more entries in the series, and if you have any suggestions for topics, please comment below.