T-SQL Tuesday #19–What A Disaster

Allen Kinsel (blog|twitter) is running this months T-SQL Tuesday and wanted to know about preparing or recovering from a disaster. I thought this might be a good opportunity to tell a little story of how a disaster sucked up around three weeks of my life a couple of years ago.

 

It was a normal day, I was sitting and quietly going through a small staging release when my director asked me to come into his office. Instantly I started wondering “what did I do?”

Not being able to come up with anything egregious I settled down the stomach gurgles and went wandering in. That I walked in and was offered a seat I got nervous again, this was not a good sign with this director. Now I was really curious as to what could be up.

 

What was up?

“Are you working on anything big right now?” were the words that kicked off something that changed my views on a lot of things.

“Nothing that can’t be put on hold, what’s up?”

“Did you hear about the <redacted> outage?”

“Sure, everyone here has heard about that, it’s a seriously messed up situation.”

“How do you feel about being a part of the solution?”

My long nurtured DBA sense of responsibility kicked in at this point and I heard myself saying “sure”

“Great, I’ll shoot an email off to the admin, you’ll be on a flight to <redacted> first thing in the morning.”

 

I quickly rearranged all of my plans for the next few days and resisted giving myself a facepalm until I was well clear of the office.

 

Up, up and away

The next morning I hopped on a plane and by mid-afternoon was at my place of destination. When I walked in the door it was all hands to the pump where people were rushing around like crazy and the smell of desperation was in the air.

I was brought into a room and the situation explained to me…

A couple of days prior there had been an attempt at a microcode upgrade on a SAN. The upgrade failed, crashed the SAN and corrupted all of the data. No databases could be attached or started, no files were accessible, there was no filesystem, nothing. It was bad.

I asked at what point the decision was going to be made to scratch it and go to a backup (figuring if there had been a DR site for this it would have been in place already).

 

Backups? We don’t need no stinking backups

Yup, you guessed it, there were no backups. I asked when the last time a backup was taken, someone stated that they thought a backup had been taken 9 months before, but they couldn’t be sure and they didn’t know of anyone that could get into that datacenter and see.

The reason for no backup? Apparently it was taking too long to perform the backup and so the decision was made to just turn off the backup process.

I came to find out later no effort had been made to tune the backup process or to attempt alternative backup methods.

I find it unfathomable that there were no backups on a critical system that supported > 1 million users and had extremely high visibility. You would have thought that DR would have been priority #1, but that was not the case.

 

Sleepless nights

While there was little that could be done with the dead storage there was a lot of work around what could be done as a mitigation strategy. What could be done to restore service while other work was done in an attempt to recover some amount of data.

A new SAN was brought up, database installations performed, a change management process was put in place (one had not existed before) and there was a lot of discussion around getting backups working immediately.

I didn’t move from the building the first 36 hours. Thankfully the company brought in three square meals a day for everyone that was there to ensure that we at least got fed. People were sleeping on the floor in offices just to try and get a couple of hours sleep so they could remain functional.

Restoration of service was a slow arduous process as great care had to be taken with the order of enabling certain components.

Slowly things got back to normal, hourly calls with the VP dropped to every four hours. I was able to sleep in a bed and get some rest and a change of clothes (at one point I told the VP I was running out of clothes and asked how much longer I was going to be there, his response was that I should probably go and expense some underwear).

 

Getting back some data

A little over two weeks after everything went kaboom we started getting word of some data recovery. A third party company had been brought in and had been performing a block by block recovery of the storage from the bad SAN. They were not able to pull files or anything that simple, they were just able to pull access data on blocks. With a great deal of effort they managed to recover 90% of the data, which then had to somehow get validated and reconciled with the data now in the system.

Scripts abounded, levels of confidence in the data had to be decided upon and the risks ascertained for data restoration. That to me was a very scary concept. I’m glad that the decision on that one was made well above my pay grade.

After about three weeks I was able to go home. My work (and everyone else who was sent down to work on the recovery attempt) was acknowledged in a company meeting a couple of months later.

 

Takeaways

This is a real world example of a disaster. If there had been a backup a great many good people would not have been stuck away from home for three weeks.

It gave me a much greater appreciation for what can happen in a disaster. Don’t get caught out, make sure your backups are good and you have a strong business continuity strategy.

One thought on “T-SQL Tuesday #19–What A Disaster”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s