SQL Server Troubleshooting Horror Stories and Lessons Learned
If you spend any time with veteran SQL Server DBAs, you’ll hear troubleshooting horror stories that will make you cringe. I recently moderated a two-part SentryOne webinar, “SQL Server Troubleshooting Horror Stories and Lessons Learned,” which included a panel of renowned industry thought leaders from SQLskills: Paul Randal, Kimberly Tripp, Jonathan Kehayias, Erin Stellato, and Tim Radney.
In Part 1 of the webinar series, we shared our hard-won experiences and covered a variety of troubleshooting topics such as:
- Handling database corruption
- Parameter sensitivity in T-SQL code
- Wait statistics analysis and NUMA imbalance
- Scary "infinite recompile" error messages
- Updating servers without a testing process
In Part 2 of the series, we opened it up for questions from attendees in an Ask Me Anything (AMA) session.
If you missed the live webinars, click the links below to view them on-demand.
- Part 1—SQL Server Troubleshooting Horror Stories and Lessons Learned
- Part 2—Ask Me Anything (AMA) session with the panelists
During both webinars, we received great questions from attendees. We’ve compiled them below—along with the answers from the panelists. Enjoy!
Q: Does the CHECKDB need to run for VLDB on a primary or secondary replica? For a VLDB that is terabytes in size, will it take forever and what is its performance impact?
Paul: For information on this, check out Brent Ozar’s blog on “Where to Run DBCC on Always On Availability Groups.” I’d also recommend this resource as well.
Q: How do you restore a corrupted database boot page?
Paul: This SQLskills blog, “Disaster recovery 101: fixing a broken boot page,” can help you with that.
Q: I have several partitions in a filegroup. Can I restore only certain partitions?
Kimberly: Restores are always at the page-level, file-level, filegroup-level, or database-level. So, if you design/architect for partitioning AND recovery, you can do a piecemeal restore and even restore PART of your partitioned table (because it's on different filegroups).
The idea would be to restore the critical core components (primary, required tables like products/customers, and the critical part of your large, partitioned table—like 2020 and 2019, for example). You could bring that online and then later gradually restore 2018, 2017, 2016, etc. But, again, only if 2019, 2018, etc. are on different filegroups. The restore would be filegroup-based.
For more information, check out this resource.
Q: Is forcing hint of using indexes a good approach in queries? Is the use of Query Store helpful to use optimal plan?
Kimberly: Forcing is always a last resort. I always try other options first. But, if you force, make sure you go back and test! You can read this blog for additional best practices.
Q: I have several tables in a filegroup. How do I make a table read-only? I cannot make a filegroup read-only. Is there any alternative?
Kimberly: This isn’t possible except through permissions, I suppose.
Q: Could you elaborate on query hash and how to find plans that have been changed recently?
Erin: Yes, for information on this, please read my blogs, “Examining the Performance Impact of an Adhoc Workload,” and “Why You Need Query Store, Part II: Plan Forcing.”
Q: Should range query columns be a clustered index?
Kimberly: Range queries are best served by a nonclustered COVERING index. However, a clustered could be used, but might not be the most effective method.
Q: Does the topic discussion of schedulers and NUMA nodes for the CPU-bound workload cause any kind of thread starving?
Jonathan: I cover this topic in my blog, “Performance Problems with SQL Server 2020 Enterprise Edition Under CAL Licensing.” Check it out for more information.
Q: Jonathan mentioned changing table variables to in-memory tables. Can you talk more about that solution?
Jonathan: Yes, in fact Microsoft has a great resource, “Memory-Optimized Table Variables,” that speaks to this.
Q: Can Jonathan share any blog or case studies for NUMA configuration related issues?
Q: Which is better, option unknown or option recompile?
Kimberly: To help answer this, please check out my "Building High Performance Procedures" blog post.
Q: With AI in Optimizer, do you think performance tuning and indexing will become redundant in the next five years with self-tuning databases?
Tim: No. AI can help with execution plans, machine learning and algorithms can help with index creation, but neither are likely to be able to replace actual code tuning and changing logic within stored procedures and dynamic SQL. Plus, in five years, how many organizations will still be on SQL Server 2012, 2014, and 2016, all prior to any 2017 with the first attempt at automatic plan correction. Things are advancing well but making perf tuning and indexing redundant in five years—not likely at all.
Jonathan: Nope, not at all—I say that because people and ORMs write odd coding patterns that optimize in interesting ways, and I don't see AI being able to make the same logic tree decisions that creative code rewriting can do for getting the optimizer better information about reduced data sets. Early aggregations are possible when you understand the data set, business requirements, and what the end result "should" be, which is not something the optimizer can actually predict or know until it has run the query. Let's just say that I am not concerned about having a lack of work in performance tuning any time soon.
Q: In SQL Server 2019, can a task scheduled on particular CPU change to another CPU in the middle of query execution?
Jonathan: A runnable task can change schedulers in the same NUMA node if it is waiting and another scheduler in the node is not busy or actively running a task.
Q: When thread starvation happens, is it advisable to increase number of threads during peak time?
Jonathan: No, thread starvation just increases the queue of tasks at the scheduler and makes everything else wait longer. You need to troubleshoot the cause of the thread starvation issue and get to the root of the problem. Increasing threads usually doesn't fix the problem at all, it just takes a few seconds—or minutes—longer to tip over.
Q: Most DBAs are more technical and lack the domain expertise they are working in. How can this be resolved?
Tim: With time, they should gain experience, plus they should always be learning. Get a Pluralsight subscription and start watching courses. Attend webinars and other online or in-person training.
Kevin: We have several resources that delve into this topic, like "Futureproof Your Data Professional Career," "Who Wants to be a DBA: Roles & Responsibilities," and our panel discussions on "Top DBA Mistakes.
Q: What would be a good resource for learning Xquery better? I haven't found good sources for Xquery in SQL Server.
Q: I have a server that when running my Agent job—which runs sp_cycle_errorlog—it reboots the server. I have been going crazy trying to figure out what is causing it. Any ideas? It runs at 8PM once a week.
Jonathan: I’ve seen odd behaviors if it tries to cycle the log at midnight, so you might try offsetting the schedule to a different time. But I haven't seen it reboot the machine trying to cycle the errorlog before.
If it doesn't run the server stays up and doesn't reboot? If that is the case, I would open a Microsoft Support case for that issue. It shouldn't do that.
Q: How about the database rebuild/reorg index process—does it take a snapshot and chew up log space?
Jonathan: Index reorganize and rebuild do not use a snapshot, but they will consume log space.
Q: Is there any way to reduce the blocking caused by the DbccFilesCompact stage of DBCC ShrinkFile?
Paul: Don't run shrink! Why are you shrinking files? Also, no you can't.
Q: The reason that we are shrinking files is that we recently implemented a data lifecycle policy and associated data archiving processes. Our main OLTP database was over 8TB. We now have multiple TBs of free space and are using expensive Azure disks.
Jonathan: So it's a one-time shrink? That's acceptable—but no way to work around any blocking.
The only workaround is to run the shrink at a low usage period or maintenance window, kill it when activity increases, and run it again over multiple days or even weeks until it finishes.
Q: We purged 1 TB of old data on a database. Is there a quick way to release all that free space in the database? Shrink file takes too long and is causing blocking.
Jonathan: Shrink is "resume-able" meaning that the work done isn't lost if you stop it and then resume running it again later. You might have to shrink in low usage times over multiple days and stop the process when activity increases again to get it shrunk down completely.
Kevin: If possible, create a partition. Move all the data you wish to purge into the partition. Drop the partition. This is significantly faster than DELETE (logging all the deleted rows) and TRUNCATE (logging all the truncated pages), since the drop partition action consumes only the one transaction log action.
Q: There’s a new feature of SQL Server 2019 using Accelerated Database Recovery (ADR) that when a Tlog is getting smaller, the data file will grow. My question is, do you see any risk on this in case of corruption on the data file and if this ADR page was corrupted, will that corrupt the whole database to start up after restarting the instance?
Paul: Yes, I see risk—but no worse risk than for regular transaction log corruption.
Q: Why is SQL Server 2019 better for a lot of ad-hoc load in relation to using QueryStore? (We're on SQL Server 2016, and I love QueryStore.)
Tim: Here is a blog post from Erin about some of the Query Store changes for SQL Server 2019. Basically, you have options now to prevent QS from consuming too many resources with 2019.
Q: Is there any specific suggestion or documentation available that explains how to increase SQL Server performance on Azure VMs? Especially for disk read/writes over 6TB databases.
Tim: One of the first things I would check is that you have a VM size that allows for the disk throughput for your workload. Take a look at this article and reference this script for collecting throughput usage.
Q: We are running SQL Server 2017 in a VMware 6.5 cluster (hosts are 2x24 core, 1.5T RAM(768/NUMA). VMs are 20-24 cores, and we are seeing high CPU_RDY waits in VM.
Jonathan: One other note: If you are basing this off vCenter, you have to do math based on the resolution level of the chart to get the percentage CPU Ready wait, not the summation value. Anything over 5% per vCPU would be performance impacting, anything over 10% per vCPU will generally cause the mouse on an RDP session to be jumpy and not smooth. Check out more information in this resource.
Q: As a follow-on to the previous question, upon further examination, we discovered ESX is collapsing the VMs into the SAME NUMA. This feels like its related to soft-NUMA, but we don't (yet) have confirmation of this. (And yes, SOS_SCHEDULER_YIELDS are a sign this is happening. And no, we aren’t oversubscribed.)
Jonathan: What is the pCPU to vCPU ratio for the host? High CPU RDY waits are host oversubscription and you can confirm that with %CSTP as well in ESXTOP. Reducing vCPU count, moving servers off the host, putting big VMs with big VMs and moving little VMs off the same host can help but if the host is oversubscribed it's not due to Soft NUMA. Soft NUMA is SQLOS-specific and is only enabled by SQL Server 2016 when there are more than 8 physical schedulers in a single hardware node.
Under vNUMA, this could happen with your configuration because a hardware NUMA node on the host is 24 cores and 768GB RAM, and if a VM has less than 24 vCPU and less than 768GB RAM, it can be vNUMA optimized to a single node. You can control this by changing the configuration of the VM for sockets and cores, or by also configuring the Virtual Numa Controls for the VM. Check out this resource for further information.
In your case, you might want to try and configure numa.vcpu.maxPerVirtualNode = 12 so that the 24 vCPU VMs are split across the hardware nodes evenly into two virtual nodes. "It Depends..." and I can't say that this will be faster, but I can say that it won't reduce the %RDY or %CSTP waits for the VM because those are host oversubscription related and you have to reduce the vCPU to pCPU subscription ratio.
Since you aren't oversubscribed, then %RDY won't be significant for the VM guests. Ready Time only accumulates when a vCPU is ready to execute and must wait to be scheduled on a pCPU. Read this resource from the VMware Knowledge Base.
The value above represents the percentage of time that the virtual machine is ready to execute commands, but has not yet been scheduled for CPU time due to contention with other virtual machines.
Compare against the Max-Limited, %MLMTD value. This represents the amount of time that the virtual machine was ready to execute, but has not been scheduled for CPU time because the VMkernel deliberately constrained it. For more information, see the Managing Resource Pools section of the vSphere Monitoring and Performance Guide or Resource Management Guide.
If the virtual machine is unresponsive or very slow and %MLMTD is low, it may indicate that the ESX host has limited CPU time to schedule for this virtual machine.
The value above represents the percentage of time that the virtual machine is ready to execute commands, but that it is waiting for the availability of multiple CPUs as the virtual machine is configured to use multiple vCPUs.
If the virtual machine is unresponsive and %CSTP is proportionally high compared to %RUN, it may indicate that the ESX host has limited CPU resources, simultaneously co-schedule all vCPUs in this virtual machine.
Review the usage of virtual machines running with multiple vCPUs on this host. For example, a virtual machine with four vCPUs may need to schedule 4 pCPUs to do an operation. If there are multiple virtual machines configured in this way, it may lead to CPU contention and resource starvation.
Without having the ESXTOP information for the Guest VMs on the host, I can't make further suggestions unfortunately.
Q: Is database encryption a feature in SQL Server 2019 Standard edition?
Jonathan: Always Encrypted is supported on Standard Edition. You can compare editions and features here.
Q: Is there a checklist or class for best practices of SQL config under ESX?
Tim: There is an 83-page PDF that VMware has put out about virtualization of SQL Server on ESX. There is some really good tips in that document.
Q: Why does QueryStore (SQL2016) sometimes ignore forced-plan?
Jonathan: There are many reasons, but you can determine the exact reason using the XEvent query_store_plan_forcing_failed.
Q: Is it worth it to put the MAXDOP setting other than 0 if CPU is less than 8?
Jonathan: It depends on the workload, concurrency of requests, and whether query costs even allow for parallelism. Generally speaking, I would suggest starting at 1/2 the number of schedulers if 8 or less, just so that a single query can't use all the schedulers for a parallel query execution.
Tim: No, not ideal. MAXDOP 1 prevents parallelism. Look at cost threshold for parallelism to prevent trivial queries from being executed parallel. CXPACKET waits are normal, you just don't want very long wait times.
I’d like to extend a special thanks to the SQLskills team for participating in this webinar, “SQL Server Troubleshooting Horror Stories and Lessons Learned.” Here are some additional resources that you will likely find useful. Happy reading!
Kevin (@kekline) serves as Principal Program Manager at SentryOne. He is a founder and former president of PASS and the author of popular IT books like SQL in a Nutshell. Kevin is a renowned database expert, software industry veteran, Microsoft SQL Server MVP, and long-time blogger at SentryOne. As a noted leader in the SQL Server community, Kevin blogs about Microsoft Data Platform features and best practices, SQL Server trends, and professional development for data professionals.