T-SQL Tuesday: SQL Server Assumptions

T-SQL Tuesday

This month's T-SQL Tuesday is being hosted by Dev Nambi (@DevNambi), who I first met on SQLCruise a few years ago. For this installment, Dev wants us to write about assumptions that we encounter when working with SQL Server, and boy, do I come across a lot of those. Rather than pick just one, I thought I would lightly treat a handful of SQL Server assumptions that I believe are widespread and could be damaging in certain scenarios.

1. CASE will always short circuit

By "short circuit" I mean that SQL Server will stop processing the branches of a CASE expression as soon as it hits a match. So the assumption is that this construct will never fail, because SQL Server would never reach the ELSE condition:

DECLARE @i INT = 1;
SELECT CASE WHEN @i = 1 THEN 0 ELSE 1/0 END;

Since its use is rampant, and it "always" works, I don't really blame people for making this assumption; even the documentation stated as much for a long time. However, there are a few cases (seriously, no pun intended!) where this short circuiting breaks down. One such scenario is where you introduce aggregates to the expression:

DECLARE @i INT = 1;
SELECT CASE WHEN @i = 1 THEN 0 ELSE MIN(1/0) END;

Suddenly, we get the following error:

Msg 8134, Level 16, State 1, Line 2
Divide by zero error encountered.

I mentioned this specific scenario in an answer on dba.stackexchange.com, noting that this is not a bug, just that the documentation needed re-wording (which happened, as explained in Connect items #690017 and #691535).

The truth is, in most cases, you can rely on CASE to short circuit. In fact, I do so all the time, to prevent SQL Server from interpreting my statements in an illogical order (for example, when filters should eliminate rows from impossible conversions - Erland Sommarskog has a great example, and one of the posted workarounds uses a CASE expression). Here's my own example that is tough to solve without a CASE expression

CREATE TABLE dbo.t(d VARCHAR(20));
INSERT dbo.t(d) VALUES('20140708'),('Last Friday');
 
-- fails: SELECT d FROM dbo.t WHERE MONTH(d) = 7;
-- fails: SELECT d FROM dbo.t WHERE ISDATE(d) = 1 AND MONTH(d) = 7;
 
-- succeeds:
SELECT d FROM dbo.t WHERE CASE WHEN ISDATE(d) = 1 THEN MONTH(d) END = 7;

At the end of the day, you just need to be aware of the scenarios where this assumption becomes brittle; this situation with aggregates isn't the only one. I wrote about this and several other CASE behaviors here: Dirty Secrets of the CASE Expression.

2. A seek is better than a scan, always

Generally, I agree: a scan is usually more work than a seek. But in typical and predictable DBA fashion, I'm going to affix a disclaimer there: It Depends.

The issue is that, in some cases, a scan is the most efficient way to retrieve the data. Depending on your index structure and the columns in the query (both those used for output and those used to filter), a seek - in particular, a range scan in disguise - may actually lead to a lot of costly lookups to retrieve the columns the index doesn't cover. This can be easy to fix, but only if you have the flexibility to change or add indexes.

And for small tables or indexes (think filtered indexes here too), performing a seek simply isn't going to offer any performance benefits over just scanning the whole table or index, and may not even be worth the effort of performing exhaustive searches for an "optimal" execution plan.

Rob Farley (@rob_farley) blogged about this in another installment of T-SQL Tuesday earlier this year. He provides a lot of details, but I think the salient point he made was this one:

"Some Seeks are better than some Scans, and some Scans are better than some Seeks."

The takeaway: Don't expend too much effort trying to convert all scans to seeks. You may not be able to make the query any better; you may actually end up making it worse. Focus on the scans that you can prove are actually causing performance problems - are they scans because of a missing index, or an index that doesn't cover, or an implicit conversion, or a function against a predicate column? Is the query actually exhibiting poor performance, or did you just spot a scan in an execution plan and thought it must be a problem?

3. A query without "order by" will always order by <x>

I've seen different fill-in-the-blank values here, including order of insert, the clustered index column(s), the identity column, and even the index I hinted. These assumptions are usually based on a limited set of observations, and go something like this:

"It's always worked that way for me, so it will always work that way."

http://www.flickr.com/photos/rebeccabarray/7846952542Which marble hit the floor first?

The truth of the matter is that a table, by definition, is an unordered set of rows; you may as well think about it as a bag of marbles. If your query does not include an ORDER BY, you should not rely on any specific ordering, even if it is what you "always" observe. Here's why: If you don't specify an order, you are essentially telling SQL Server, "I don't care about order." So, SQL Server will feel free to return the results in the most efficient manner possible. The result may coincide with your assumption, but it may not.

I see a lot of people assume that if they issue TOP (1) against a table, and don't specify an order, that they will get the first row they inserted. Going back to the marble analogy, imagine I dropped a bag of marbles on the floor while you weren't looking, and asked you to tell me which marble hit the floor first. Without some other indicator, you'd just pick an arbitrary marble off the floor. If I wasn't paying attention, I wouldn't know if you were right, either. The optimizer works much the same way: without an explicit ordering, it is free to choose whatever TOP (1) it feels like. And unless you have some column that indicates order of insert (like an identity column, or a datetime column populated with the current time by default), you couldn't tell SQL Server how to get that row, and can't be sure it got the right row, either.

Many factors can affect the optimizer's choice here, and it can change its mind in direct or collateral response to various events, too. You may get a different plan today than the one you got yesterday, and you may get a different plan than the one your co-worker gets today. This could be caused by changes to the query, changes to the data / statistics / indexes, session settings, isolation level, allocation order or merry-go-round scans, partitioning, parallelism, turning on or off a trace flag, sp_configure changes, clearing the plan cache, restarting the server, failing over, applying a hotfix/cumulative update/service pack, upgrading to a new major version, and on and on...

Here is a quick and simple example where adding an index can change the result:

CREATE TABLE dbo.People
(
  PersonID INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, 
  Name SYSNAME
);
INSERT dbo.People(Name) VALUES('Zeke'),('Bob'),('Frank');
 
SELECT Name FROM dbo.People; -- clustered index scan, ordered [Zeke, Bob, Frank]
 
CREATE UNIQUE INDEX pn ON dbo.People(Name);
 
SELECT Name FROM dbo.People; -- non-clustered index scan, ordered [Bob, Frank, Zeke]

I've seen multiple scenarios like this, first-hand, where the application is released, everything is coming out in the order of the identity column, users are happy, and then suddenly the plan goes parallel or someone adds an index or the query gets recompiled for some reason. At this point, users are not happy. The developers want some way to restore the old behavior, without touching the query; every time, the ultimate solution is the only one that's feasible: Add ORDER BY to the query (and sometimes add a column to make the ORDER BY logical and deterministic).

An abbreviated list of scenarios where this can affect you unless you add an ORDER BY to the outermost query:

  • A view or inline table-valued function with ORDER BY - the TOP 100 PERCENT that even makes this syntax possible is now optimized away, as described here.
  • An "ordered" INSERT (or SELECT INTO) for a temp table, table variable, or return table in a multi-statement table-valued function.
  • An UPDATE that matches multiple rows, and updates one row in one scenario, and a different row in another.
  • ORDER BY or window functions inside a CTE, subquery or derived table (usually in combination with a meaningless TOP).
  • Using TOP (1) or SET ROWCOUNT to determine the "first" or "last" row inserted.
  • Even in the case of a hinted index, the index will be used (if possible, else an error in most cases), but just because the index is used doesn't mean the results will be returned sorted by the key(s) in that index. You still need an ORDER BY to ensure sorting by the index key(s).

I have talked about a few of these scenarios here and here, as have Craig Freedman and Conor Cunningham. These are all old posts; if anything, SQL Server's optimizer has advanced in ways that only mean there are now as many or more reasons to stop expecting a reliable order without specifying ORDER BY.

4. SQL Server's Defaults Are Great

During the 2000 and 2005 heyday, SQL Server developed a reputation as a self-tuning, set-it-and-forget-it database platform. I forget the exact wording now, but I remember it being marketed as something so simple to tune that you could say goodbye to all of your DBAs.

That hasn't exactly panned out.

And not because SQL Server is terrible at self-tuning. Nor because it has become a much more complex platform. I will concede that not all SQL Server-based applications needed a DBA in the first place. But SQL Server does not ship out of the box with scalability in mind across all aspects, and you can tell from several of the default settings. Some examples:

  • File locations
    Everything defaults to the system drive (C:\), no matter how many drives you have or how fast they are. System databases, tempdb, error logs, the location for new data and log files, and program files like Management Studio, SQLCMD and Profiler. Why couldn't SQL Server setup encourage you to put the more important things on other drives, if they're available? (Bonus points for detecting RAID levels, SSDs, and differentiating between SAN and direct-attached storage.) Since it can't, please don't assume that these pre-populated file locations are optimal for your installation, and consider distributing them from the start, since hanging them later can be a pain.
  • New database file sizes and autogrow
    The model database dictates the size and autogrow rates of new databases (ones you create, not ones you restore or attach). While the actual size depends on version and edition (3 MB data and 1 MB log for SQL Server 2014 Enterprise Edition), it is unlikely that the defaults are optimal for any database you're creating in this decade. How quickly will you use a measly 3 MB? And the autogrow rates (1 MB for data, 10% for log) leave a lot to be desired as well. Even with instant file initialization enabled, growing a data file from 2MB or 3MB to accommodate your data - 1 MB at a time - is going to be a very painful process. And growing the log file 10% at a time leads to longer and longer growth events, especially since log files can't benefit from instant file initialization, and can also lead to too many virtual log files (VLFs) and heavy VLF fragmentation. Please consider altering the model database so that the initial size and growth rates are reasonable and absolute sizes (no percentages), even if you are on SSDs or other flash-based storage and don't think these issues could ever become critical.
  • Recovery model
    Like file sizes and autogrow, the default recovery model of new databases is determined by the recovery model of the model database, and this can also vary by version and edition (Express Edition, for example, defaults to simple, while Enterprise Edition default to full). There is no right or wrong here; this is another case of It DependsTm. In many systems, you are likely to want a mix of full and simple recovery databases; in those cases, it may make sense to set model to whichever recovery option will be used for the majority of new databases, and change the rest to the option that makes more sense for them. Some people set up policies using Policy-Based Management to identify databases that are set up in the "wrong" recovery model; others use auditing or DDL triggers to just keep an eye on new databases with a plan to review them manually. In any case, this is another area where you might not want to trust SQL Server's defaults for all databases.
  • tempdb
    The tempdb database is literally the dumping ground for all kinds of system and user activities in SQL Server; not surprisingly, it is often a major performance bottleneck. The defaults for this database are a single 8 MB data file set to 10% autogrow, and a 1 MB log file also set to 10% autogrow. As I've explained above, files this small are just begging for frequent and disruptive autogrow events, and those percentage-based autogrow events have the potential to take longer and longer as time goes on. And since tempdb is re-created to its initial size on restart or failover, the process just starts all over again. In addition to changing these sizes and growth rates to something more logical, other changes you can make from the defaults are to create multiple data files (see this post by Paul Randal (@PaulRandal) to help determine how many), all with the same size, and consider trace flag 1117. You can also consider implementing trace flag 1118, but Paul has some advice about that, too.
  • Backup compression and logging
    As Jeremiah Peschka (@peschkaj) explains in this blog post, backup compression is all benefit at very little cost, and I really would only hesitate to turn this on in the case where a workload is completely CPU-bound (which is still quite rare, even in the face of SSDs, PCIe storage, and In-Memory tables). So consider setting this as the default for all databases:

    USE master;
    EXEC sp_configure 'backup compression default', 1;
    RECONFIGURE WITH OVERRIDE;

    While you're at it, enable trace flag 3226 - by default, SQL Server logs every successful backup message to the error log, which also pollutes SQL Server Agent history. Hopefully you have some kind of alerting mechanism in place for backups that fail, so success messages are just noise.

  • Optimize for ad hoc workloads
    This is a server-wide setting that prevents plan cache bloat by requiring that a query plan is requested twice before caching a full version of the plan. Since SQL Server 2008 was released, I have recommended this setting on dozens and dozens of servers, and I have yet to come across a workload that didn't like it. Kimberly Tripp (@KimberlyLTripp) offers much more information and a ton of links to other resources about this setting. Not convinced? Try it out on your test/QA servers:

    USE master;
    EXEC sp_configure 'show advanced options', 1;
    RECONFIGURE;
    EXEC sp_configure 'optimize for ad hoc workloads', 1;
    RECONFIGURE WITH OVERRIDE;

  • Server memory
    The default for max server memory is an astronomical 2 petabytes; effectively this means there is no cap. Since SQL Server is optimized to use memory aggressively, this means that the server can consume as much memory as it wants, only releasing it when getting pressure from the OS to do so. This often leads to complaints about memory leaks and people wanting SQL Server to stop hogging all of their memory; when in fact the server is doing exactly what they've implicitly told it to do. This gets more complicated when you have multiple instances, or other applications running on the server; in either case you don't want to leave this setting at the default. See this blog post by Jonathan Kehayias (@SQLPoolBoy) for guidance on how much memory to reserve for the operating system and other applications and services.
  • Trace flags
    SQL Server ships with no trace flags enabled. I've mentioned a few trace flags you should consider turning on; Pieter Vanhove (@Pieter_vanhove) has a good starter set to consider as well.

Note that not all of these default settings and behaviors are bad in all scenarios; I merely wanted to bring them up, since the typical assumption is that the defaults can't possibly be bad in any scenario. Do you change any other defaults on most or all SQL Server instances? I'd love to hear about others I see less frequently.

5. If Microsoft does it, it must be a good idea!

Along the same lines as the defaults, Microsoft does not advocate best practices very well. They should use the tagline, "Do as I say, not as I do." In some cases, it might be better as, "Don't do as I say or as I do." Remember being able to set the sa password to blank? They are better these days at security best practices, but still lacking in other areas. Even today, when I look in Books Online, I often see things that makes me cringe.

Blank sa password FTW
Remember when you could do this?

Now, I'm not saying that I envy the documentation writers, or that I want their thankless jobs. There is a ton of content there, and it must be mind-numbing to even think about the prospect of making it "perfect." Microsoft has a pretty good track record of telling us how the product works, but doesn't do so well in the areas of practical usage, guidance, or code samples that promote sound techniques. Even system stored procedures and maintenance plans often fail to promote best practices. Not to mention products like SharePoint and Entity Framework that make way too many schema and query sins to even start to compile a meaningful list.

Some of the most egregious things that newer database developers learn from (and for a lot of these, I'm going to pick on one specific topic with multiple violations):

  • SELECT *
    While some people seem to think SELECT * is okay until it becomes a performance problem, most of us probably agree that it is not a best practice to use this syntax in production code for various reasons. Yet there it is, in all its glory, in the BOL topic above (and many other topics as well).
  • Default cursors
    Microsoft seems to be a big fan of default cursors, in spite of their heavy-handed behaviors. In most cases, you'll want to specify LOCAL FAST_FORWARD for forward-only, read-only cursors. Or avoid cursors in the first place. But cursors with no options specified can be found in Books Online, AdventureWorks In-Memory OLTP sample procedures, and by far the most infamous: sp_msforeachdb. While it is an undocumented and unsupported procedure, its use is quite prevalent; unfortunately, the problems caused by the cursor may lead to serious problems, including silently skipping databases without the user's knownledge. I've complained about this in Connect #264677 and #752629, but it will not be fixed. I've written a replacement without this flaw and with several other features; I've blogged about it here and here.
  • No semi-colons
    The topic above actually does use semi-colons to terminate statements, but a lot of topics don't, never mind Microsoft's own stored procedures. The problem is that "Not using a statement terminator for Transact-SQL statements" has been deprecated since SQL Server 2008; while it's unlikely to actually be implemented in our lifetime, I've blogged about why we should all make a conscious effort to use statement terminators more consistently.
  • Varchar without length
    The above topic doesn't do this, but several others do (including this one). The main problem with declaring varchar without length is that it sometimes defaults to 1 character, and others 30; and in some cases it can lead to silent truncation - meaning you can have data loss and not even know it. (I talk more about this here.) Similar ambiguity can affect other variable-length data types, like char, nchar, nvarchar, numeric, decimal and others. If it can take a length, precision or scale, please always specify it explicitly.
  • Date query sins
    I often wonder where people get ideas to store dates as strings, use ambiguous regional date formats, use BETWEEN and lazy shorthand like GETDATE() + 1, and then I see gems like this in Microsoft's own documentation (yes, this appears in the topic above too):

    AND h.OrderDate BETWEEN '1/1/06' AND '2/1/06'
    AND d.DueDate BETWEEN CAST ('20060101' AS datetime) + 14 
        AND CAST ('20060201' AS datetime) + 14;

  • What don't I like about this? Plenty of things:
    • BETWEEN
      You should avoid BETWEEN for queries involving date/time data types, since rounding can take values like 23:59:59.997 and round them up to the next day, and what the end range includes is actually confusing to a lot of people (often I see people believe that the above will include all data from February 1st - or is that January 2nd? - not just the data stamped at midnight). I've blogged about the issues with BETWEEN in more detail, and I am not a big fan of EOMONTH() for similar reasons.
    • M/D/Y
      I alluded to it previously, but you should avoid regional and ambiguous formats like M/D/Y, since both humans and SQL Server can mistake what you mean. The only truly safe formats for string literals in SQL Server are YYYYMMDD and YYYY-MM-DDTHH:nn:ss[.ms...] (and yes that T is important). I've blogged about this here.
    • Shorthand
      The shorthand syntax of adding a numeric value to a DATETIME should be abandoned, since it doesn't work with the new data types. I bet there is a lot of code out there that is on the verge of breaking the moment someone decides to switch to DATE or DATETIME2. I would rather type out an explicit DATEADD, since clarity and forward compatibility are more important to me than saving a few characters of code or a few milliseconds of typing.
  • No schema prefix
    The above topic does not do this, but others do, including this one. To be fair, this topic also explains why you should always specify the schema when creating or referencing objects, so this is another case of "do as I say, not as I do." For more details on why you should always specify the schema, see this blog post.
  • Alias types
    Also known as "user-defined data types," there are several of these in the AdventureWorks sample database, and they seem to be used simply to save a little bit of typing during table and parameter declarations. I'll never understand why you want to specify a column as ColumnName dbo.Phone when dbo.Phone simply maps to NVARCHAR(25). (Also not sure why a phone number would ever need to support Unicode characters, but that's probably a different issue.) Much worse is the effort required to change an alias type once it's been used in tables, procedures, and functions - see this blog post for all the gory details.
  • Old-style joins
    While the ancient inner join syntax (FROM t1, t2 WHERE t1.x = t2.x) has not been deprecated like the outer join syntax (*=/=*) has, it's not the clearest either; there is an example of this in the topic mentioned above as well. It discourages the separation of join criteria from filter criteria, and makes it much more likely to accidentally derive the cartesian product. I've even seen multiple scenarios where this has been "fixed" by slapping a DISTINCT on the SELECT list. This is certainly a stylistic preference and a lot more subjective than many of the things I tend to complain about, but in case you're not convinced, I've written a lot more about it.
  • Added September 5, 2014: Best practices documents
    These things age, and not everything stands up to the test of time. Denny Cherry (@mrdenny) recently wrote a post debunking a lot of things stated in this rather old (2008) blog post put out by MSDN.

I don't mean to crap all over Microsoft's documentation and code samples (never mind production code); all I'm saying is that, just because you see them do something, does not mean it is a good idea to do it yourself.

Conclusion

A lot of people make a lot of assumptions about SQL Server, myself included. I constantly learn things about our favorite platform, and occasionally have to change my opinion about a pre-conceived notion I had. Hopefully you're not making any of the assumptions I've outlined above; and if you are, I've persuaded you to think twice about them.

Thwack - Symbolize TM, R, and C