Best Practices: Properly Referencing Columns

Several years ago now, I wrote about people not bothering to specify the schema when creating or referencing objects. While I could argue that always specifying the schema makes the code easier to understand and more self-documenting, the major concern there was that leaving the schema out could lead to plan cache bloat (and I describe this scenario in a much more recent post, too, and point out another potential side effect in Bad Habits: Another Case for Semi-colons and Schema Prefix). The only justification I ever hear for omitting the schema is productivity - but I'm not buying it. Even if you only use the dbo schema today, you should still protect yourself from future issues, even at the cost of a few keystrokes and precious milliseconds of coding time.

I feel the same way about referencing columns, but for slightly different reasons. No, leaving out a column prefix will not lead to plan cache bloat - unless you have two users writing near-identical queries that only differ by that detail. But not bothering to indicate which table a column came from can lead to frustrating errors and debugging, and even wrong results that can be difficult to notice.

Query Progression

Many queries evolve over time - they start out simple, but later they need to expose more columns, pull in additional tables, or offer more advanced filtering. Let's take a very simple case; you have a script that gets you all of the tables in the current database that are replicated:

SELECT name FROM sys.tables WHERE is_replicated = 1;

Now you realize that you need the schema name too, so you add a join to sys.schemas, but now you have to prefix the name columns anyway, to avoid an ambiguous column error:

SELECT [schema] = s.name, [table] = t.name
  FROM sys.tables AS t
  INNER JOIN sys.schemas AS s
  ON t.[schema_id] = s.[schema_id]
  WHERE is_replicated = 1;

Next, you discover that you actually need to know which columns are replicated, so you just add a join to sys.columns:

SELECT [schema] = s.name, [table] = t.name, [column] = c.name
  FROM sys.tables AS t
  INNER JOIN sys.schemas AS s
  ON t.[schema_id] = s.[schema_id]
  INNER JOIN sys.columns AS c
  ON t.[object_id] = c.[object_id]
  WHERE is_replicated = 1;

But this, too, leads to an ambiguous column error...

Msg 209, Level 16, State 1
Ambiguous column name 'is_replicated'.

Ambiguous column name 'is_replicated'.

...since is_replicated exists in both sys.tables and sys.columns. So you've done extra troubleshooting here that could have been avoided if you had properly qualified all of the columns during initial development. I can assure you this gets much more tedious as the query gets more complex and/or as the number of unqualified references increases.

Remember, too, that referencing is important everywhere in the query, because introducing an ambiguous reference could bite you anywhere. Using the proper reference in the SELECT list is worthless if the ORDER BY expression doesn't include the same reference.

Getting Help

Another scenario where not qualifying columns can create extra work is when seeking help from others. I often see queries like this posted on Stack Overflow or dba.stackexchange:

SELECT this, that, the_other_thing, the_count = COUNT(something_else)
  FROM dbo.table1 
  LEFT OUTER JOIN dbo.table2
  ON columnA = columnB
  WHERE filterX > 10 AND countY < 5
  GROUP BY this, that, the_other_thing;

If you ask people to help you debug this query - maybe offer advice on indexes, or figure out why it is returning too many or too few rows - you will inevitably be drawn into a discussion asking which table each column comes from. This information is very important and useful, and people who are not already familiar with your schema won't have any bits of that knowledge. Some implicit details may be obvious and logical to you but that does not make them explicit for anyone else.

Of course this can bite you even when you're not out asking the community for help. If you pass this query along to co-workers for a code review, they may be familiar enough with the schema to deduce which columns come from where, but they may not. And even you might be reviewing your own code a year down the road and might now have to go reverse engineer the schema to figure everything out. Being diligent about column prefixes - even when during initial development it is completely obvious to you which columns come from which table - can save you time in the long run.

Encountering Unintuitive Behavior

Another area where not prefixing columns can cause an issue is when you combine a minor typo with behavior that is, admittedly, downright unintuitive. Here's a self-contained example of what I'm talking about:

CREATE TABLE #Orders(ID INT);
 
-- orders 1 and 2 exist
INSERT #Orders(ID) VALUES(1),(2);
CREATE TABLE #OrderDetails(OrderID INT);
 
-- in order details, only order #3 is represented:
INSERT #OrderDetails(OrderID) VALUES(3);
 
-- logically speaking, this query shouldn't compile, 
-- but it does, and gives unexpected results:
SELECT ID FROM #Orders WHERE ID IN (SELECT ID FROM #OrderDetails);
-- note wrong column name -----------------^^
 
DROP TABLE #Orders, #OrderDetails;

Results:

ID
—-
1
2

Obligatory SNL Reference

Orders 1 and 2 are returned, even though neither of them exist in #OrderDetails. What is happening here is that SQL Server searches #OrderDetails for a column named ID, and when it doesn't find it in the inner scope, it checks for it in the outer scope. It finds it there, so it assumes that is the ID you meant. This is a pretty simplistic example, but I hope it is clear that this can happen in more complex cases, where in your development or unit test scenarios it can actually seem like the code is working fine.

You can prevent this by properly qualifying each column - annoying, perhaps, when a query only references a single table, but still worth it. With this slightly different query:

CREATE TABLE #Orders(ID INT);
 
-- orders 1 and 2 exist
INSERT #Orders(ID) VALUES(1),(2);
CREATE TABLE #OrderDetails(OrderID INT);
 
-- in order details, only order #3 is represented:
INSERT #OrderDetails(OrderID) VALUES(3);
 
SELECT o.ID FROM #Orders AS o WHERE ID IN (SELECT od.ID FROM #OrderDetails AS od);
 
DROP TABLE #Orders, #OrderDetails;

We get the error message we should expect, and can fix the code before it ships:

Msg 207, Level 16, State 1, Line 10
Invalid column name 'ID'.

In fact, by using the alias as a column prefix, IntelliSense probably would have prevented us from ever typing ID in the inner scope in the first place.

This issue has been brought up multiple times (here are two examples: SO #5076906 and SO #26457775), and is often labeled as a bug. The behavior is definitely by design, not to make for confusing semantics, but rather to conform to the ANSI standard. More importantly, the behavior will not be "fixed"; see KB #298674 as well as Connect #302281, #735178, #772612, and #265772. The solution is to use proper table aliases and column prefixes everywhere. (Well, technically, you don't have to use table aliases, but they sure can make code easier to read, especially when dealing with long entity names and three- or four-part references.)

Conclusion

I do hope to have convinced some of you that it is worth investing the time up front to ensure your tables and columns are properly aliased and qualified. I'm not a big fan of extra typing for nothing, but I'm even less of a fan of avoiding extra typing to save a second here or there, or slightly reduce the character count of my code. Especially when there are potential downsides that haven't always been immediately apparent - but should be now.