Keeping DRY in SQL Server with Scalar User Defined Functions – Part 1

Adrian Hills

Published On: February 20, 2020

Categories: SQL Server, Engineering 0

No one likes to keep repeating themselves. As software engineers, we're often encouraged to follow the DRY principle: Don't Repeat Yourself. We look to refactor repeated blocks of code out into reusable units that encapsulate that logic, and then call that from everywhere it's needed.

This offers a number of benefits. If we need to change that logic in the future, we now have just one place to update and everywhere it's used will automatically benefit. Only needing to make the change in one place not only saves development time, but reduces risk (with duplication, we could miss a place in code to update as well as increasing the surface area for the number of places we need to go in and change). So, it's no wonder that DRY principles form a staple part of code reviews.

If you're interested in how we ensure quality here at SentryOne with code reviews, check out Cameron Presley's blog post on the subject

But Beware! 

When it comes to the world of SQL Server database development, that instinct to not repeat yourself could end up biting you if you're not careful.  

User defined functions (UDFs) were first introduced in SQL Server 2000 and are a natural concept to non-database developersa unit of code that encapsulates some logic, and returns some output.  There are three core types of user defined functions in SQL Server: 

  • ScalarReturns a single value 
  • Multi-Statement Table Valued—Returns a tabular result-set of data, where the body of the function can contain multiple statements 
  • Inline Table Valued—Returns a tabular result-set of data, where the body of function contains a single SELECT statement that directly returns the result-set 

In this post, we'll going to focus on the first of those three core typesthe scalar UDF.  

Let's run through what might seem like an obvious approach to encapsulate some logic in SQL Server using the WorldWideImporters sample OLTP database.  

Important: this applies when running the database at a compatibility level below 150 (SQL Server 2019), as will become clearer in the conclusion. The examples here were run on SQL Server 2019 with the database set to compatibility level SQL Server 2017 (140).

Scenario 

Take the following query that calculates the total value of a particular order:

SELECT SUM(Quantity * UnitPrice) AS OrderTotal  
FROM Sales.OrderLines ol  
WHERE ol.OrderID = 1 

You might decide to encapsulate this into a scalar UDF:

CREATE FUNCTION Sales.CalculateOrderLineTotal ( 
@OrderId INTEGER 
) 
RETURNS DECIMAL(18,2) 
AS 
BEGIN 
DECLARE @Total DECIMAL(18,2) 
 
SELECT @Total = SUM(Quantity * UnitPrice) FROM Sales.OrderLines WHERE OrderId = @OrderId 
 
RETURN @Total 
END 

And then call it like this:

SELECT Sales.CalculateOrderLineTotal(1)

We can now reuse this function anywhere we need to retrieve the value of an order. Seems useful, right?  

There Be Dragons 

Suppose you have the following query which is used as part of an export process to extract all orders from the system:

SELECT o.OrderId, o.OrderDate 
FROM Sales.Orders o 
ORDER BY o.OrderDate ASC

You have a requirement to change it to also include the value of each order. Greatyou have a function for that! So, you slip in a call to the scalar UDF like so: 

SELECT o.OrderId, o.OrderDate, Sales.CalculateOrderLineTotal(o.OrderId) AS OrderTotal 
FROM Sales.Orders o 
ORDER BY o.OrderDate ASC

Job done….? 

Well, no. You soon hear back that the export process is performing poorly after this change. Let's take a look at the impact that change has had on the resources needed to fulfil the query and return the 73,595 results.  

Using Plan Explorer, we get the following execution details:

  (A) Original Query (without order total) (B) Query with scalar UDF
Duration (ms) 51 112,913
CPU 108 104,608
Reads 725 12,564,915

Ouch. There's that bite I was saying about. In fact, less of a bite and more of a three-course meal. Huge increases in execution time, CPU utilization and reads. You might ask, "What if that's just the cost of getting that extra Order Total value returned with each row?" Let's checkhere is the same query, except we've pulled the query from the scalar UDF out as a correlated subquery into the main query:

SELECT o.OrderId, o.OrderDate,  
(SELECT SUM(Quantity * UnitPrice) FROM Sales.OrderLines ol WHERE ol.OrderId = o.OrderId) AS OrderTotal 
FROM Sales.Orders o 
ORDER BY o.OrderDate

Again running through Plan Explorer to grab the stats, this time we see (C): 

  (A) Original Query (without order total) (B) Query with scalar UDF (C) Query with correlated subquery
Duration (ms) 51 112,913 153
CPU 108 104,608 144
Reads 725 12,564,915 1,093

What a difference! Slightly higher than the original query (A), but that's to be expected given we've introduced a lookup on another table to return the extra data.  

But, they're doing the same thing to get the Order Total aren't they? Let's check the execution plans.

Scalar UDF approach [B]: 

Image1_ScalarUDF_Query_PreCompat150

Plan Explorer shows us we have some warnings. If we hover over the Compute Scalar operation, we see the following warning about the use of the UDF:

Image2_ScalarUDF_Query_ComputeScalar

Inlined correlated subquery approach [C]: 

Image3_ScalarUDF_Query_InlineSubquery

With the scalar UDF approach, there is no mention of the Sales.OrderLines table despite the fact we are now querying that table. Instead, what happens inside the scalar function is hidden behind the single "Compute Scalar" operationas shown above, we can see something is of concern via the warnings. No matter what you have going on inside the function, there's no detail exposed in the execution planjust a whole lot of cost. Whereas with the inlined subquery, we have full visibility of exactly what it's doing to get the extra data from Sales.OrderLines. 

Scalar functions can often cause poor performance, just like in this case, for a number of reasons. The scalar function is called iteratively, once per row in the result-set. In this case, that's 73,595 executions, resulting in the poor performance we're seeing here. During the query optimization process, SQL Server does not apply any costings to the scalar function call, whereas clearly we're seeing it has a huge cost. 

Conclusion 

As conscientious software engineers, it can be easy to apply the programming best practises, like DRY, in the database layer and assume it's the best solution. But we should always check the performance of queries to verify that we're taking an approach that is acceptable in terms of performance and resource utilization.  

In my next blog post, I'll cover enhancements that were introduced in SQL Server 2019 (compatibility level 150) that enable scalar UDFs to be inlined automatically into the calling query, and follow this exercise through to see how this plays out. 

 

Adrian (@AdaTheDev) is a Lead Development Engineer at SentryOne, focusing on the development of the Data DevOps SaaS product portfolio - namely SentryOne Document and SentryOne Test. Based in the UK, he joined SentryOne from Pragmatic Works in 2018, with 18 years experience in software development that includes a background in developing SaaS platforms in the hospitality and digital marketing industries. Adrian is a fan of good (bad) puns, dad jokes, and proper British biscuits.


Comments

New call-to-action