Your Disaster Recovery Strategy Is Useless If Your SQL Server Backups Haven’t Been Tested
It has been drilled into every DBA's head to backup, backup, backup. We understand that it is not a matter of if we need to restore a database but when and how quickly. However, it has not been as heavily stressed that we need to test our SQL Server backups to ensure that they can be restored and to test how long it takes to restore them. Sadly, I learned firsthand just how important it is to test my backups.
A Failed Disaster Recovery Exercise
I was tapped to attend one of our bi-annual Disaster Recovery (DR) exercises that took place in a recovery center more than 1,000 miles away. These exercises required us to use most of our IT staff, as well as most of our developers, for long hours. As a Junior DBA at the time, I drew the short straw to work the overnight shift on day one. I was tasked with restoring our databases. No problem, right? I thought I would follow the runbook and I would be good to go. Boy was I wrong.
I started the restore on the third database in the runbook. That database was a dependency of most, if not all, of our enterprise applications. I estimated that by using the fully accurate method of two and a half times the backup time, the restore would take about three to four hours. I moved on to do another task while I waited.
An hour into the restore, I saw red all over my SQL Server Management Studio (SSMS) window. The database restore had failed. Like any good IT professional, I rebooted the SQL Server and the backup server and tried again. Once again, an hour went by and the restore had failed. I then copied the backup file locally to the instance and tried one more time. It failed again. The only option I had was to skip this database and move on. This database was used by most applications in our company, which meant this restore failure had ruined our entire DR exercise. Our organization had paid for us to prove that we could recover from a major disaster, and we had failed.
During the DR exercise debriefing, the DBA team was asked, "How could this have happened?" Our answer was that it was probably because we had never tested the backups. Senior leadership was not thrilled with our answer and made it clear to us how much money and time had been wasted with a failed test. They demanded that we test the backups.
Learning Backup Testing Best Practices
The task of testing the backups fell on me, the rookie. Not knowing any better, I would restore a database by manually writing out the T-SQL to do the restore on an old pizza box server. I would start the restores on Monday and I would receive an email when they were done. I felt surprisingly good about my process. But the problem with my approach was twofold—I was the only one doing the testing, and I only tested the databases once a week.
Our standard backup stance was to take weekly full backups of the database, nightly differentials, and logs every 15 minutes. This was not a problem, except that I would never test all the logs or all the differentials. I realized that wouldn’t be a problem until a developer truncated a table. I received the ticket to restore the database on a Friday afternoon and the full backup was taken on Sunday overnight. We had a differential taken every night. Our retention policy, which was flawed, was to remove the previous day's differential to clear space for the backup server. I knew the time that it took to restore, so I could give that information back to the development team and let them know how long their application would be down. The manual restores of the full database started and completed successfully; however, the differential was another story. My first thought was to start restoring the previous night’s differential, but I did not have one from which to restore. I was forced to manually restore each log.
I am not a math whiz, but that is 1 TRILLION log files to roll. I wrote the T-SQL to roll the logs. Since I was successful, it took a long while to get these logs applied. The Recovery Time Objective (RTO) was two hours, and I missed that time. Senior leadership then asked why it took so long. I explained the problem with the corrupted differential, and I elaborated that I had to manually restore those ~1 trillion log files. Their follow up question was, "Why couldn't this be automated?"
Automate the Testing Process
After manually restoring multiple databases multiple times, I began to search for a way to restore databases and to test with some sort of automation. I fired up my favorite search engine and found very little in the way of auto restores for SQL Server. But then Paul Brewer developed a stored procedure, sp_RestoreGene (you can download it here), and released it for free in December 2012. I discovered that this stored procedure reads the backup tables in MSDB and builds the scripts for you to run what you can run. It was the perfect start to my challenge. This stored procedure should be used with every DBA’s database. I developed a job that would loop through each database using Aaron Bertrand's sp_foreachdb that I found via Brent Ozar. This was the best way to for me to start this process.
After several iterations of the process, I realized I needed to not rely on the original instance staying available. I developed a plan to centralize the backups to our centralized reporting server. I was forced to make a few assumptions that we would not have duplicate database names in my environment. I was able to use sp_RestoreGene with minimal alterations.
Now that the backups were centralized, I was able to utilize the code that I used on the stored procedure running off the central server. I would use this server as the location for centralized database recovery data. I stored the start time of the restore, the total size of the backups, the finish time of the restore, and any error codes if we had any issues. This process proved to be helpful for a challenge I was given with Disaster Avoidance.
I worked for an organization that challenged the DBA team with Disaster Avoidance. We were tasked with using SQL Server log shipping to help the organization achieve the ability to move away from disasters, as well as to prove that we could move datacenters on a moment's notice. That part of the challenge was easy compared to the "clean up" work that came after the move to our alternate datacenter.
Failing over with SQL Server log shipping requires that you break your DR stance. You must re-establish the secondary datacenter after the move. Using SP_RestoreGene and the centralized data allowed us to restore the secondary datacenter much quicker than if we had used the manual process. We were also able to give leadership an exact timeline to bring our enterprise back to our DR posture. My DBA team could be confident in our ability to restore our databases any time we needed to.
Spend the Time to Test Your Backups
Over the course of my career I have restored hundreds of databases. Far too often, I experienced issues with the restores. Many of the problems were related to bad, missing, or corrupted backups. Until I had a process that ensured that I could restore my databases, I had to cross my fingers and hope. Setting up an automated backup testing process has proven to be extremely helpful and ensures we can meet the DR expectations that are set by our organization.
Shand is a 15 year veteran of IT. Over that time, he has worked with data in mainframe and distributed systems. His passion is disaster recovery, disaster avoidance and business continuity. Planning, testing, and implementing disaster plans, for MS SQL Server has become his passion. Having lived though several disasters in his career, both manmade and acts of God, Shand is prepared to take on whatever can be thrown at him from a disaster perspective.