Pin the Blame on the Query
Published On: February 9, 2016
Categories: Cloud, SQL Server, Visual Studio 5
Last week, the online version of Visual Studio had a significant outage (about five hours). In this apology post, Microsoft's Brian Harry puts a significant portion of the blame on the SQL Server team, saying:
"Clearly the SQL server team has some work to do to further tune the new cardinality estimator and they are all over that."
(And he goes into more specific detail in A bit more on the Feb 3 and 4 incidents.)
So, yes, a new cardinality estimator is generally going to improve a lot of queries, but there are inevitable regressions, too. This query, in particular, is not exactly "switch the cardinality estimator from underneath me friendly":
SELECT TOP 1 @collidingAccountName = ids.AccountName, @preExistingDomain = i.Domain, @preExistingTfidGuid = i.Id FROM #identities ids INNER LOOP JOIN tbl_Identity i WITH (INDEX=IX_tbl_Identity_AccountName_TypeId, FORCESEEK, UPDLOCK, HOLDLOCK) ON i.PartitionId = @partitionId AND i.AccountName = ids.AccountName AND i.Domain = ids.Domain AND i.Sid <> ids.Sid WHERE i.PartitionId = @partitionId AND ids.TypeId in (5, 6) AND i.TypeId in (5, 6) OPTION (OPTIMIZE FOR (@partitionId UNKNOWN))
Let's count the things in this single query that absolutely must be regression tested after an upgrade, service pack, cumulative update, or compatibility level change:
- TOP without ORDER BY (against a #temp table, no less), if you're expecting predictable sorting
- LOOP JOIN hint
- explicit INDEX hint
- FORCESEEK hint
- UPDLOCK and HOLDLOCK hints
- OPTIMIZE FOR UNKNOWN
These are all things that may have been necessary under the old estimator, but are likely just tying the optimizer's hands under the new one. This is a query that could have, and should have, been tested in their dev / staging / QA environments under the new cardinality estimator long before they flipped the switch in production, and probably could have gone through series of tests where different combinations of those hints and options could have been removed. This is something for which that team can only blame themselves. Michael Campbell (@AngryPets) tweeted:
"Every time I use a hint I think of ways NOT to - simply because you never know of long-term impacts like this case."
Hear, hear. Query hints should be used as a last resort (you'll hear this repeatedly from experts like Paul White). They used six hints in this single query, if you include
And for mitigation to have taken 30 engineers a total of five hours is also something that could have, and should have, been handled better - especially since their fix was to add a SEVENTH hint. Since they knew the problem happened as a result of switching to the new cardinality estimator, and they identified the problematic query immediately, they could have changed the query to use the old estimator using
QUERYTRACEON, which I suspect would have been quicker than whatever train of thought went into adding the
MAX_MEMORY_GRANT hint that "solved" the issue. (A joke about how many VSTS engineers it takes to change compatibility level is surely in the works.) And I suspect this query will stay like this, with all those volatile items above in place, until the next cardinality estimator bump.
I hope they have learned from this, and I do applaud their transparency. But several have suggested that the team take a forced break to read Joe Sack's white paper, "Optimizing Your Query Plans with the SQL Server 2014 Cardinality Estimator." I don't think that's such a bad idea.
And I do sincerely hope they fix those object names - why a table needs a
tbl prefix is beyond me.
Aaron (@AaronBertrand) is a Data Platform MVP with industry experience dating back to Classic ASP and SQL Server 6.5. He is editor-in-chief of the performance-related blog, SQLPerformance.com. Aaron's blog focuses on T-SQL bad habits and best practices, as well as coverage of updates and new features in Plan Explorer, SentryOne, and SQL Server.