An Introduction to SQL Server Logical Joins
Paul S. Randal
Published On: June 20, 2022
Categories: T SQL, Best Practices, SQL Server 0
The JOIN syntax is an optional part of the FROM statement and is used to specify how to return data from different tables (or views). Generally, when we think of joining tables together, we think of returning data from two tables with an equality match (INNER JOIN) between the sources. However, the JOIN keyword also has options for returning all the data from one table and only the records matching from the other table (OUTER JOIN). This tutorial covers when and why to use certain types of joins and some common pitfalls developers encounter when writing SQL Server JOIN statements. Many of the examples below use the AdventureWorks database.
There are a few things to note about this tutorial. Firstly, while it’s possible to JOIN multiple tables for data modifications (INSERT/UPDATE/DELETE), for simplicity’s sake, I’ll stick to SELECT statements. Secondly, I’ll focus on using tables and views as sources. While you can use the APPLY operator to perform similar functionality as JOIN statements involving table-valued user-defined functions, I’ll leave that explanation for another SQL tutorial.
Lastly, it’s worth mentioning although SQL Server allows you to write highly complex JOIN statements involving many tables, it doesn’t mean you should. As I’ll show later, the more complex the SQL statement is, the bigger the chance the query optimizer won’t produce a good execution plan. I’ve helped so many customers over the years improve the throughput on their SQL Server installations by simplifying their JOIN statements. Simpler is better when it comes to JOINs.
The general syntax of a SQL Server JOIN statement is:
The T-SQL INNER JOIN statement returns rows from two sources where there’s an equality match between the inputs. The INNER keyword is optional; if you omit it and only use JOIN, an INNER JOIN is used.
Note that T-SQL is an interpreted language. The syntax describes to SQL Server what results you want returned but does not tell SQL Server HOW to return the data. It’s up to the query optimizer to determine the best JOIN order. So, the order in which you write your INNER JOINs generally doesn’t matter as SQL Server will rewrite them as necessary. There are rare fringe cases where this doesn’t hold true, but, again, this topic has enough merit to be its own SQL Server guide.
The following example returns matching rows from the Sales.SalesOrderHeader table and the Sales.SalesOrderDetail table where values in the SalesOrderID column between the two tables match:
Notice the example uses table aliases to cut down on the amount of T-SQL needed to reference the tables. Because there is a common column between the tables (SalesOrderID), the code must clarify which table is referred to in the JOIN statement. There must be at least one column comparison when joining tables in the FROM clause, but there can be additional columns to join on as needed.
The following is an example of an “older” style INNER JOIN, where the JOIN clause is omitted, and the join conditions are in the WHERE clause.
While it’s a somewhat dated practice, this is still perfectly valid and executes the same as a JOIN written as an INNER JOIN. You must be careful when using INNER JOIN syntax, as it will throw an error if you don’t include a comparison operator as part of the JOIN clause. With the old-style syntax above, you can easily forget to include the JOIN criteria in the WHERE clause, which will result in a Cartesian product (CROSS JOIN).
Through my years of consulting, overly complex SQL statements have been one of the biggest performance issues I’ve seen. As the number of INNER JOINs increases, the join order possibilities increase dramatically, giving the query optimizer many more join paths to evaluate. SQL Server can only ever join two inputs at a time (regardless of the number of tables in the JOIN statement), so determining the best order in which to join tables can be a costly process. The optimizer has a clever way to handle such situations to ensure timely query execution: it sets a time limit on how much optimization it will actually go through. Once it hits the limit, it will execute the best plan it has rendered so far.
Consider the following query, which involves six tables INNER JOINed together:
When I execute the query and look at the execution plan, I see the plan has a ‘Reason For Early Termination of Statement Optimization’ value of ‘Time Out.’
This means during the query optimization phase, there were so many plan options the time limit for optimization was hit, and the “best” plan of the explored plan alternatives was chosen as the execution plan. This happens quite often. If you have complex T-SQL statements and aren’t getting the execution plan you’d like, it may be time to refactor those T-SQL statements, so they involve fewer joins or break them into separate statements.
A T-SQL OUTER JOIN is where one table is preserved, and the other returns rows if values are matched with the preserved table. By “preserved,” I mean all rows from that table are eligible to be returned in the query, while the OUTER table will only return rows if there’s a match within the preserved table.
There are two typical OUTER JOINs:
- The LEFT OUTER JOIN
- The RIGHT OUTER JOIN
There’s also a FULL OUTER JOIN which combines the functionality of both the LEFT and RIGHT OUTER JOIN. Note the OUTER keyword is optional for OUTER joins–using LEFT JOIN is the same as LEFT OUTER JOIN for syntax purposes.
LEFT OUTER JOIN
A LEFT OUTER JOIN is where the left table (the first table specified) is preserved while the table on the other side of the JOIN returns values if there’s a match and otherwise returns NULL values when there’s no match. This is one of those cases where HOW you write the SQL statement does matter.
Consider the following example for creating two temporary tables that will store different numeric values:
The following query returns everything from #Numbers1 and only those values from #Numbers2 that match:
Notice the #Numbers1 table is to the LEFT of the JOIN operator, so it’s preserved. If there’s NOT a match in the #Numbers2 table, a NULL value is returned.
One beneficial aspect of using OUTER JOINs is it easily allows values to be returned from one table where there’s NOT a match in another table. To do this, simply add criteria to the WHERE clause to look for NULL values in the OUTER table. Here’s an example:
One logical mistake many developers make is adding a predicate on the OUTER table in the WHERE clause. When you try to filter rows out on the OUTER table in the WHERE clause, it essentially turns the JOIN into an INNER JOIN because NULL values are filtered out due to the T-SQL order of operations. The following is an example of incorrectly filtering rows in the OUTER table:
If the query needs to first apply criteria to the OUTER table in the query, then you’ll need to apply the filter in the JOIN statement. The query below has the same filter as above but in the JOIN statement instead:
This approach filters out those records in the #Numbers2 table while the OUTER JOIN statement occurs. This gives more NULL values returned from the #Numbers2 table while preserving all the records from the #Numbers1 table.
RIGHT OUTER JOIN
For all intents and purposes, a RIGHT JOIN is the same as a LEFT JOIN, with the right table being preserved and the only difference being the order in which the statement is written. The query below is identical to the first LEFT OUTER JOIN example but written as a RIGHT OUTER JOIN instead. The only difference between the two is the table order in the FROM clause.
FULL OUTER JOIN
A T-SQL FULL OUTER JOIN combines the functionality of the LEFT and RIGHT OUTER JOINs. Use this type of JOIN when you want to return (preserve) rows from both tables. When there is no match between the two tables, NULL values are returned. The following query returns rows from the #Numbers1 and #Numbers2 table:
When there’s a match between the tables, both values will be returned. When there’s not a match, the value from the table where the value exists is returned, and a NULL value is returned from the other table.
A T-SQL CROSS JOIN multiplies all rows between the tables in the join to generate a Cartesian product. This type of JOIN is typically only suitable for generating test data quickly or taking a table with a single row and combining it with all the rows in the second table. Notice there are no comparison criteria between the tables listed in the example below. The example T-SQL query below uses the spt_values tables in the master database and the ROW_NUMBER windowing function to quickly generate 10,000 monotonically increasing values.
When to Use Joins in SQL
Joining related tables is a fundamental operation in the relational database world, and understanding when to use the different types of JOIN is a critical skill for any database developer. Use INNER JOIN when you want only to return those rows where there’s a match between tables. Use LEFT or RIGHT OUTER JOIN when you want to preserve a single table and only return those rows from the OUTER table if there’s a match. Use a FULL OUTER JOIN when you want to combine the functionality of the LEFT and RIGHT OUTER JOIN to return values from both tables. You can read this blog post to dive deeper into the key differences between CROSS JOIN and INNER JOIN types.
Make sure you’re mindful when constructing JOIN clauses and don’t overcomplicate them. Including too many tables in a JOIN will likely lead to less-than-ideal performance due to the complexity involved for the query optimizer to generate a good execution plan.
SolarWinds® SQL Sentry is built to more effectively monitor query processor activity by analyzing what’s being processed, how long it takes, and what resources are being used. With the Top SQL feature, use the included Plan Explorer query analysis tool to collect and analyze query plans used to execute and the statistics used to generate the query plans for more informed optimization and tuning. You can download SQL Sentry with the integrated Plan Explorer tool free for 14 days.
Paul S. Randal is the CEO of SQLskills.com, which he runs with his wife Kimberly L. Tripp. Both Paul and Kimberly are widely-known and respected experts in the SQL Server world, and both are long-time SQL Server MVPs. Paul was a Contributing Editor for TechNet Magazine, where he wrote the bi-monthly SQL Q&A column and feature articles. He also had #1 top-rated workshops and sessions at the PASS Summit and TechEd. Paul is active in the SQL Server community, from user groups to online forums to helping out on Twitter (@PaulRandal – check out the #sqlhelp tag). His popular and widely-referenced blog can be found at https://www.sqlskills.com/blogs/paul/ and he can be reached at email@example.com.