Tag: Backups

T-SQL Tuesday #19–What A Disaster

Allen Kinsel (blog|twitter) is running this months T-SQL Tuesday and wanted to know about preparing or recovering from a disaster. I thought this might be a good opportunity to tell a little story of how a disaster sucked up around three weeks of my life a couple of years ago.

 

It was a normal day, I was sitting and quietly going through a small staging release when my director asked me to come into his office. Instantly I started wondering “what did I do?”

Not being able to come up with anything egregious I settled down the stomach gurgles and went wandering in. That I walked in and was offered a seat I got nervous again, this was not a good sign with this director. Now I was really curious as to what could be up.

 

What was up?

“Are you working on anything big right now?” were the words that kicked off something that changed my views on a lot of things.

“Nothing that can’t be put on hold, what’s up?”

“Did you hear about the <redacted> outage?”

“Sure, everyone here has heard about that, it’s a seriously messed up situation.”

“How do you feel about being a part of the solution?”

My long nurtured DBA sense of responsibility kicked in at this point and I heard myself saying “sure”

“Great, I’ll shoot an email off to the admin, you’ll be on a flight to <redacted> first thing in the morning.”

 

I quickly rearranged all of my plans for the next few days and resisted giving myself a facepalm until I was well clear of the office.

 

Up, up and away

The next morning I hopped on a plane and by mid-afternoon was at my place of destination. When I walked in the door it was all hands to the pump where people were rushing around like crazy and the smell of desperation was in the air.

I was brought into a room and the situation explained to me…

A couple of days prior there had been an attempt at a microcode upgrade on a SAN. The upgrade failed, crashed the SAN and corrupted all of the data. No databases could be attached or started, no files were accessible, there was no filesystem, nothing. It was bad.

I asked at what point the decision was going to be made to scratch it and go to a backup (figuring if there had been a DR site for this it would have been in place already).

 

Backups? We don’t need no stinking backups

Yup, you guessed it, there were no backups. I asked when the last time a backup was taken, someone stated that they thought a backup had been taken 9 months before, but they couldn’t be sure and they didn’t know of anyone that could get into that datacenter and see.

The reason for no backup? Apparently it was taking too long to perform the backup and so the decision was made to just turn off the backup process.

I came to find out later no effort had been made to tune the backup process or to attempt alternative backup methods.

I find it unfathomable that there were no backups on a critical system that supported > 1 million users and had extremely high visibility. You would have thought that DR would have been priority #1, but that was not the case.

 

Sleepless nights

While there was little that could be done with the dead storage there was a lot of work around what could be done as a mitigation strategy. What could be done to restore service while other work was done in an attempt to recover some amount of data.

A new SAN was brought up, database installations performed, a change management process was put in place (one had not existed before) and there was a lot of discussion around getting backups working immediately.

I didn’t move from the building the first 36 hours. Thankfully the company brought in three square meals a day for everyone that was there to ensure that we at least got fed. People were sleeping on the floor in offices just to try and get a couple of hours sleep so they could remain functional.

Restoration of service was a slow arduous process as great care had to be taken with the order of enabling certain components.

Slowly things got back to normal, hourly calls with the VP dropped to every four hours. I was able to sleep in a bed and get some rest and a change of clothes (at one point I told the VP I was running out of clothes and asked how much longer I was going to be there, his response was that I should probably go and expense some underwear).

 

Getting back some data

A little over two weeks after everything went kaboom we started getting word of some data recovery. A third party company had been brought in and had been performing a block by block recovery of the storage from the bad SAN. They were not able to pull files or anything that simple, they were just able to pull access data on blocks. With a great deal of effort they managed to recover 90% of the data, which then had to somehow get validated and reconciled with the data now in the system.

Scripts abounded, levels of confidence in the data had to be decided upon and the risks ascertained for data restoration. That to me was a very scary concept. I’m glad that the decision on that one was made well above my pay grade.

After about three weeks I was able to go home. My work (and everyone else who was sent down to work on the recovery attempt) was acknowledged in a company meeting a couple of months later.

 

Takeaways

This is a real world example of a disaster. If there had been a backup a great many good people would not have been stuck away from home for three weeks.

It gave me a much greater appreciation for what can happen in a disaster. Don’t get caught out, make sure your backups are good and you have a strong business continuity strategy.

Guest Post On Hey, Scripting Guy! Blog

Recently I was contacted by Aaron Nelson (blog|twitter) who provided me an awesome opportunity to write a guest blog post for the Hey, Scripting Guy! Blog. Naturally I jumped at the chance and the post went live this week as a part of a week of SQL related post in honor of SQLRally which is coming up next week in Orlando.SQLRally

 

My post was on using PowerShell to report on SQL Server backup status. Go have a read and let me know what you think. With the rest of the week being dedicated to SQL and PowerShell don’t forget to keep checking the Hey, Scripting Guy! Blog.

Stop Logging Successful Backups

It’s great that SQL writes to the Event log and SQL log every time a backup completes. You get to see lots of data saying that database master was backed up and database model was backed up, then that msdb was backed up etc…

 

Is it really that useful?

Well, at least it can be useful to have in there. The thing is there are better ways to find out when your databases were last backed up (such as using PowerShell or querying MSDB).

The downside is that it quickly fills your logs with information that you don’t really need to know. It gets even worse if you perform frequent transaction logs backups. If you have three databases that are having their logs dumped to disk every 20 minutes all of a sudden your SQL log is next to useless. All you are seeing is backup success messages and not much else. It makes it all too easy to miss important the things that you really need to pay attention to.

 

There is hope

Yes, it’s true! You can stop SQL from dumping that information into the log.

By enabling trace flag T3226 you kill these messages dead. Worry not, any failure will still be written to the logs however all those pointless notifications will vanish. All of a sudden your logs will be cleaner and meaner, important things will stand out and your scroll wheel can take a break.

 

How to enable T3226

Open up SQL Configuration Manager and the SQL service. Under the Advanced tab you’ll see startup parameters. Just tack ;-T3226 on to the end, apply the changes and restart SQL.

 

The steps are further outlined at http://msdn.microsoft.com/en-us/library/ms345416.aspx and http://msdn.microsoft.com/en-us/library/ms190737.aspx.

 

Go ahead, clean up those log files…you know you want to.