Tag: Backups

Automated Backup Tuning

Tuning your backups is a wonderful thing. You could easily reduce the time it takes to backup your databases by 50% of more just by changing a few settings, and yet so few people do it. I asked myself the question why and came up with two answers.

  1. People do not know how to tune their backups.
  2. It takes too long to run through the tuning process.

How to tune your backups

I’m not going to go over methods for tuning your backups in this post. There are people who have done a far better job at both explaining the adjustments you can make and that have created videos to help you understand and perform the process yourself.

My biggest concern was directed at the level of effort required to test all the possible permutations of files, maxtransfersize and buffercount values, after all, who has time to figure all of that out and then capture the data to look at the relative performance characteristics of each one?

I decided that the best way to do this was to create a nice little test harness which would run through all those tests without manual intervention, and then figure out a way to get the output from all of those backups into some kind of meaningful graph for that instant visual on performance differences.

NobodyGotTimeForThat
No manual backup tuning for this DBA

The backup testing script

Step one in the automated tuning is a SQL script I created which accepts a few input variables:

  • @DatabaseName – name of the database you want to use for testing
  • @MaxBackupFiles – maximum number of files you want to write to at a time
  • @BackupFolder – destination folder for the backups
  • @WithCompression – whether or not to use SQL Server backup compression
  • @CreateJob – Whether or not to create a SQL Agent job to run the tests
  • @JobLogFileName – file path and name for a log for the job
  • @OutputToScreen – outputs the backup commands to the screen

When executed the script is designed to created SQL Agent job which you can then execute at a time of your choosing. The job will run through a series of backups for a database (at minimum 40) and capture the information in the job log file.

Be warned, performing this many backups test, with so many settings, can and will impact your server performance at some point. Be aware of this. Do not cause issues for your users when performing the tests, they will not like you for doing so.

Visualizing the results

Once the backup testing job is complete a job log file is left behind. This log file contains all of the information for each of the backups performed including the:

  • number of files backed up to
  • maxtranfersize setting used
  • buffercount setting used
  • speed of the backup in MB/sec
  • time the backup took to complete in seconds

Getting that information usable is done using a PowerShell script. A blog post by Jonathan Kehayis described his method for parsing SQLIO output to excel charts. Based on this script, the update version parses the backup log file that has been created, imports the information into excel and creates a couple of charts.

These charts show the speed (Figure 1) and backup time (Figure 2) for the default backup options, the absolute fastest backup to NUL, and the all of the other options that were tested.

Backup+Speed
Figure 1. Backup speed results for 100GB database backed up across 10Gb network to CIFS share

Backup+Time
Figure 2. Backup time results for 100GB database backed up across 10Gb network to CIFS share

The graphs provide a nice visual representation of the backup performance which helps to quickly identify that there are improvements that can be made.

The default SQL backup runs at just over 200MB/sec and takes about 235 seconds to complete, but we know that we can perform that same backup to a NUL device in about 75 seconds at a rate of over 1200MB/sec.

Just looking through the rest of the data it is clear to see that a lot of the values used make a negligible change to the performance of the backups, so we can work to make an informed decision on what the best file, maxtransfersize and buffercount settings should be.

Hopefully this is something that you can find useful. Please give it a try and send me feedback on your testing efforts.

Downloads: Backup Test Harness.sqlParse Backup Perf Tests.ps1

Generic Database Restores

The other day I found myself in the position of needing to restore multiple databases to a server. In this case all the log files would be on one drive and all the data files on another. Rather than sit there and write restore scripts for ~20 databases I figured it would be quicker to create a quick wrapper that I could just pass the filename into a then let nature take its course.

This is very basic, no restore performance tuning enhancements for example, however it does just the job that I need it to.

If I was smart I would have made this a stored procedure and then iterated through the list of backups in the directory using powershell and then executed the proc for each one. Maybe next time.

Anyhow, view the script below or download Database Restores Generic.sql

 

SET NOCOUNT ON

 

DECLARE @File NVARCHAR(1000) = N'MyDatabase.bak'

DECLARE @Path NVARCHAR(1000) = N'D:SQLBackups'

DECLARE @MDFPath NVARCHAR(1000) = 'E:MSSQLData'

DECLARE @LDFPath NVARCHAR(1000) = 'L:MSSQLLogs'

 

DECLARE @FullLoc NVARCHAR(2000)= @Path + @File

 

DECLARE @DatabaseName NVARCHAR(128)

DECLARE @RestoreMDF NVARCHAR(2000)

DECLARE @RestoreLDF NVARCHAR(2000)

DECLARE @RestoreCommandFull NVARCHAR(4000)

DECLARE @sqlexec NVARCHAR(4000) 

 

DECLARE @MDFID INT

DECLARE @LDFID INT

DECLARE    @MDFName NVARCHAR(128)

DECLARE @LDFName NVARCHAR(128)

 

 

DECLARE @RestoreFiles TABLE

    (

      ID INT IDENTITY(1,1),

      LogicalName NVARCHAR(128) , PhysicalName NVARCHAR(260) , [Type] CHAR(1) , FileGroupName NVARCHAR(128) , [size] NUMERIC(20, 0) ,

      MAXSIZE NUMERIC(20, 0) , FileID BIGINT , CreateLSN NUMERIC(25, 0) , DropLSN NUMERIC(25, 0) , UniqueID UNIQUEIDENTIFIER ,

      ReadOnlyLSN NUMERIC(25, 0) , ReadWriteLSN NUMERIC(25, 0) , BackupSizeInBytes BIGINT , SourceBlockSize INT , FileGroupID INT ,

      LogGroupGUID UNIQUEIDENTIFIER , DifferentialBaseLSN NUMERIC(25, 0) , DifferentialBaseGUID UNIQUEIDENTIFIER ,

      IsReadOnly BIT , IsPresent BIT , TDEThumbprint VARBINARY(32)

    )

 

DECLARE @RestoreHeader TABLE

    (

        BackupName NVARCHAR(128), BackupDescription NVARCHAR(255), BackupType SMALLINT, ExpirationDate DATETIME,

        Compressed CHAR(1), POSITION SMALLINT, DeviceType TINYINT, UserName NVARCHAR(128), ServerName NVARCHAR(128),

        DatabaseName NVARCHAR(128), DatabaseVersion INT, DatabaseCreationDate DATETIME, BackupSize NUMERIC(20,0),

        FirstLSN NUMERIC(25,0), LastLSN NUMERIC(25,0), CheckpointLSN NUMERIC(25,0), DatabaseBackupLSN NUMERIC(25,0),

        BackupStartDate DATETIME, BackupFinishDate DATETIME, SortORder SMALLINT, [CodePage] SMALLINT, UnicodeLocaleId INT,

        UnicodeComparisonStyle INT, CompatabilityLevel TINYINT, SoftwareVendorId INT, SoftwareVersionMajor INT,

        SoftwareVersionMinor INT, SoftwareVersionBuild INT, MachineName NVARCHAR(128), Flags INT, BindingID UNIQUEIDENTIFIER,

        RecoveryForkID UNIQUEIDENTIFIER, [Collation] NVARCHAR(128), FamilyGUID UNIQUEIDENTIFIER, HasBulkLoggedData BIT,

        IsSnapshot BIT, IsReadOnly BIT, IsSingleUser BIT, HasBackupChecksums BIT, IsDamaged BIT, BeginsLogChain BIT,

        HasIncompleteMetaData BIT, IsForceOffline BIT, IsCopyOnly BIT, FirstRecoveryForkID UNIQUEIDENTIFIER,

        ForkPointLSN NUMERIC(25,0), RecoveryModel NVARCHAR(60), DifferentialBaseLSN NUMERIC(25,0), 

        DifferentialBaseGUID UNIQUEIDENTIFIER, BackupTypeDescription NVARCHAR(60), BackupSetGUID UNIQUEIDENTIFIER,

        CompressedBackupSize BIGINT

        )

 

SELECT @sqlexec = 'RESTORE FILELISTONLY FROM DISK = ''' + @FullLoc + ''''

INSERT INTO @RestoreFiles EXEC (@sqlexec)

 

 

SELECT @sqlexec = 'RESTORE HEADERONLY FROM DISK = ''' + @FullLoc + ''''

INSERT INTO @RestoreHeader EXEC (@sqlexec)

 

SELECT @DatabaseName = DatabaseName FROM @RestoreHeader

 

 

SELECT @MDFID = MIN(ID) FROM @RestoreFiles WHERE [Type] != 'L'

WHILE @MDFID IS NOT NULL

BEGIN

 

 

    IF @MDFID = 1

        BEGIN

            SELECT @RestoreMDF = 'WITH MOVE ' + CHAR(39) + LogicalName + CHAR(39) + ' TO ' + CHAR(39) + @MDFPath +  REVERSE(LEFT(REVERSE(PhysicalName), CHARINDEX('', REVERSE(PhysicalName))-1)) + CHAR(39) + CHAR(13) FROM @RestoreFiles WHERE ID = @MDFID

            END

    ELSE

        BEGIN

            SELECT @RestoreMDF = @RestoreMDF + ', MOVE ' + CHAR(39) + LogicalName + CHAR(39) + ' TO ' + CHAR(39) + @MDFPath +  REVERSE(LEFT(REVERSE(PhysicalName), CHARINDEX('', REVERSE(PhysicalName))-1)) + CHAR(39) FROM @RestoreFiles WHERE ID = @MDFID

        END

 

SELECT @MDFID = MIN(ID) FROM @RestoreFiles WHERE ID > @MDFID AND [Type] != 'L' 

END

 

 

SELECT @LDFID = MIN(ID) FROM @RestoreFiles WHERE [Type] = 'L'

WHILE @LDFID IS NOT NULL

BEGIN

    SELECT @RestoreLDF = ISNULL(@RestoreLDF,'') + ', MOVE ' + CHAR(39) +LogicalName + CHAR(39) + ' TO ' + CHAR(39) + @LDFPath +  REVERSE(LEFT(REVERSE(PhysicalName), CHARINDEX('', REVERSE(PhysicalName))-1)) + CHAR(39) FROM @RestoreFiles WHERE ID = @LDFID

 

SELECT @LDFID = MIN(ID) FROM @RestoreFiles WHERE ID > @LDFID AND [Type] = 'L' 

END

 

 

 

SELECT @RestoreCommandFull = 'RESTORE DATABASE ' + QUOTENAME(@DatabaseName) + ' ' + CHAR(13) + 'FROM DISK = ''' + @FullLoc + '''' + CHAR(13)

SELECT @RestoreCommandFull = @RestoreCommandFull + @RestoreMDF + @RestoreLDF + CHAR(13) + ', NORECOVERY, STATS=20;'

--PRINT @RestoreCommandFull

 

EXEC (@RestoreCommandFull)

 

Refreshing Development Databases

Been a busy week at work and I’ve had to perform multiple dumps of production databases back to a development server. This is something that would normally be pretty quick and easy however a couple of things impacted that and led to having to make some choices about how to perform the task.

Here are the relevant details that I need to use to decide how to move the backup files:

  1. The servers are in two different datacenters and the WAN link between them is not very fast
  2. The backups are stored on a CIFS share that is non-routable for anything other than machines on that VLAN (for security reasons)
  3. There is only a single machine in the datacenter that has the ability to reach through the firewall to the development server
  4. I had to get the data across the WAN as fast as possible, however did not want to have to babysit what would be a several hour process (it should restarted in the case of any issues)
  5. All the hosts involved are running Windows 2008R2 SP1
  6. I did not want to install any extra software on any of the servers
  7. This is not an ongoing process and so I wanted something quick. No need to try and over-engineer the situation

 

How to get at the backup

My first thought was to use the improved TCP stack in Windows 2008R2, log on to the development server and pull the data from the CIFS share. Unfortunately this was out as the development machine does not have access to the subnet in order to pull the data.

Next was to perform a backup directly across the WAN to the development machine. This would not work due to firewall restrictions preventing communication between the machines (and this would also not be network optimized)

The only option left was to use the single machine in the datacenter (let’s call this the jump server), on the relevant VLAN, which had access through the firewall to push the backup to the development box.

 

How to optimize the copy and have it restartable

Opening a terminal server session to the jump server, opening an explorer window to the source and destination, then dragging the file over would copy the file. There have also been reported incidents where using the drag and drop method has caused memory problems (here’s an example from a forum thread http://social.technet.microsoft.com/Forums/en-US/winservergen/thread/3f8a80fd-914b-4fe7-8c93-b06787b03662/ ). This also does not address the concern with automatic restarts for the copy operation.

Opening a terminal server session to the jump server and running copy/xcopy would do the trick. There are no memory issues that I’m aware of with this method. We do have the problem whereby the copy process would not restart in the event of a problem. This method also does not do the best job of managing copies over a network. So this was out.

Opening a terminal server session to the jump server and using robocopy to copy the file made the most sense. Using the additional /r and /w switches you can have the copy process retry as many times as needed and have a defined interval between those retries. Robocopy also has a level of network optimization built in. As a bonus it also provides ongoing information as to the copy completion percentage and details statistics when complete. Robocopy is built in to Windows 2008R2 and so also meets the requirement of not wanting to install any more software.

 

Final copy solution

In the end I found myself opening up a terminal server remote desktop session to the jump server which had both a route to the backup file and access through the firewall to the development server. From there I used Robocopy to perform the file copy over the WAN, providing the best performance as well as automatic restarts from any failures.

 

I’m curious, would you have handled this differently?

Quickly Build Transaction Log Restore Commands With PowerShell

I’m in the middle of a server migration and so I’m setting up log shipping to get the databases consistent between the two servers prior to cutting over.

The backup copy over the WAN was taking a long time and so I had a lot of transaction log files that I needed to apply. I could have let log shipping take care of performing the log restores, but I found that a secondary process had created log backups also, just with different names. This lack of consistency was causing me a problem so I needed a quick way to build a restore command of all the files in order.

This was a quick win for PowerShell. I was able to quickly grab a list of the files in date order and pipe those to a string and build the restore command. This I was able to copy into SSMS and perform the restores.

$filelist = dir LogShippingPrimaryp$backup | where-object {$_.Extension -eq '.trn'} | Sort-Object {$_.CreationTime} | select Name

 

foreach ($file in $filelist)

{

$restorefile = $file.Name

 

$text = "restore log MyDatabase from disk = 'P:ForRestore$restorefile' with norecovery;"

Write-Output $text

}

 

If you wanted to take this a step further you could actually use invoke-sqlcmd to execute the $text string and perform the restore (don’t even contemplate doing that if you aren’t running at least SQL 2008 R2). I always prefer to do that within SSMS though.

While I’ve used this script for a migration task it also works really well if you need to perform a point in time restore for your database from a full backup and some transaction log backups. Can you imagine how long it would take you to build out the restore string to recover a database which had transaction log backups every 15 minutes for the last 20 hours? That’s 80 log files. Want to do that manually? This is a much nicer way to go.

Optimizing Database Restores

You can Google/Bing/AltaVista and get a lot of results for speeding up your backups. Those results will include using third-party tools, performing file and filegroup backups, differential backups and the like. Some of the more technical blogs will talk about adjusting some of the backup parameters to improve throughput.

There’s really not much written (that I’ve found, please add comments with links to posts you’ve found) that talks about tuning database restores.

I started messing with some restore parameters last week and found that it’s possible to make some dramatic performance improvements.

 

How this started

We were performing a data center migration at work. As a part of this migration we were going from SQL 2000 to SQL 2008R2 for some of our databases.

The database, we’ll call MyDB, is 620GB in size. There are 5 data files and a single log file. Current backups are performed using Snap technology, which wasn’t going to help us with performing a migration as we needed to get the data over and setup logshipping, so we needed to perform a regular backup.

Third party tools were not an option as this SQL instance was running SP3a and we weren’t able to find anything that would work below SP4.

A full backup was taken to local disk, it took a little over 9 hours (server and disk limitations prevented any work to try and improve that).

The backup file was transferred to a USB drive, taken to the new datacenter and copied over to a CIFS share where it would be available for performing restores.

 

Hardware/software setup

It’s going to help to have an idea of the configuration that’s being used here.

Hardware:

  • Server: HP DL585 G7 (clustered)
  • CPU: 4 x AMD 6176SE (2294Mhz, 12 cores) for a total of 48 cores
  • RAM: 128GB
  • Storage: Netapp SAN with WAFL filesystem – Fiber connected with 2 x 4Gb HBAs, MPIO configured for least queue depth
  • Network: 10Gb/s (HP NC375i) – 2 connections, configured for failover only, no load balancing
  • One LUN defined (as mount points) for each file in the database to be restored

 

Software:

  • SQL 2008R2 CU7 Enterprise Edition
  • Max server memory: 100GB
  • Affinity masks not set
  • Max degree of parallelism: 6
  • 4 x tempdb files

 

 

Basic restore

The initial restore was performed with default options, no optimization was performed.

RESTORE DATABASE MyDB FROM DISK = 'CIFSDeviceBackupSQLBackupsProductionForRestore.bak' 

WITH MOVE 'MyDB' TO 'L:MyDB_DataMSSQLDataMyDB_Data.mdf'

    , MOVE 'MyDB2' TO 'L:MyDB_Data2MSSQLDataMyDB_Data2.ndf'

    , MOVE 'MyDB3' TO 'L:MyDB_Data3MSSQLDataMyDB_Data3.ndf'

    , MOVE 'MyDB4' TO 'L:MyDB_Data4MSSQLDataMyDB_Data4.ndf'

    , MOVE 'MyDB5' TO 'L:MyDB_Data5MSSQLDataMyDB_Data5.ndf'

    , MOVE 'MyDB_Log' TO 'L:MyDB_LogsMSSQLDataMyDB_Log.ldf'

    , STATS = 5

    , NORECOVERY

 

Restore time was 2:48.

 

Optimization techniques

I knew that adjusting the MaxTransferSize and BufferCount settings could provide significant improvements in backup performance (go read a fantastic SQLCat article on this) and figured that the same would be true for performing restores.

Before I could start changing values I needed to know what values were being used with the basic restore. To do this I enabled trace flags 3213 (trace SQL activity during a backup/restore) & 3605 (output trace activity to the SQL error log).

DBCC TRACEON (3213, -1)

DBCC TRACEON (3605, -1)

 

With the flags enabled I performed another restore which provided the following information:

image

So by default we’re running with a BufferCount of 6 and MaxTransferSize of 1,024.

 

The MaxTransferSize value is the largest unit of transfer to be used between SQL Server and the backup media. This is specified in bytes and can range in value from 65536 (64 KB) to 4194304 (4 MB).

The BufferCount specifies the number of IO buffers to be used for the restore operation. Any positive number can be used for this, however you have to be careful with larger numbers as you could potentially cause out of memory errors.

 

Changing the MaxTransferSize

Not having any obvious hardware restrictions I decided that as the restore was being performed across the network adjusting the MaxTransferSize might provide the biggest benefit.

Feeling confident I just bumped it to the max and performed the restore.

RESTORE DATABASE MyDB FROM DISK = 'CIFSDeviceBackupSQLBackupsProductionForRestore.bak' 

WITH MOVE 'MyDB' TO 'L:MyDB_DataMSSQLDataMyDB_Data.mdf'

    , MOVE 'MyDB2' TO 'L:MyDB_Data2MSSQLDataMyDB_Data2.ndf'

    , MOVE 'MyDB3' TO 'L:MyDB_Data3MSSQLDataMyDB_Data3.ndf'

    , MOVE 'MyDB4' TO 'L:MyDB_Data4MSSQLDataMyDB_Data4.ndf'

    , MOVE 'MyDB5' TO 'L:MyDB_Data5MSSQLDataMyDB_Data5.ndf'

    , MOVE 'MyDB_Log' TO 'L:MyDB_LogsMSSQLDataMyDB_Log.ldf'

    , STATS = 5

    , NORECOVERY

    , MAXTRANSFERSIZE = 4194304 

 

This time the restore took 45 minutes, a reduction of 2 hours over the default.

 

Changing the BufferCount

Having already reduced the restore time by over 70% I wanted to see if I could get a little bit more performance from it. So I made an adjustment to the BufferCount value and set it to 75.

RESTORE DATABASE MyDB FROM DISK = 'CIFSDeviceBackupSQLBackupsProductionForRestore.bak' 

WITH MOVE 'MyDB' TO 'L:MyDB_DataMSSQLDataMyDB_Data.mdf'

    , MOVE 'MyDB2' TO 'L:MyDB_Data2MSSQLDataMyDB_Data2.ndf'

    , MOVE 'MyDB3' TO 'L:MyDB_Data3MSSQLDataMyDB_Data3.ndf'

    , MOVE 'MyDB4' TO 'L:MyDB_Data4MSSQLDataMyDB_Data4.ndf'

    , MOVE 'MyDB5' TO 'L:MyDB_Data5MSSQLDataMyDB_Data5.ndf'

    , MOVE 'MyDB_Log' TO 'L:MyDB_LogsMSSQLDataMyDB_Log.ldf'

    , STATS = 5

    , NORECOVERY

    , MAXTRANSFERSIZE = 4194304 

    , BUFFERCOUNT = 75

 

This restore took a little less time, 38 minutes, shaving another 7 minutes off the already shortened time. At this point I was down about 78% over the regular restore command.

I tried pushing the BufferCount to 100 to see if I could trim things further but found no appreciable difference and so left it at 75.

 

Comparing backup types

As a part of the move log shipping was to be setup to offload reporting for users and so I needed to perform a restore of the database from the new SQL 2008R2 server to another with exactly the same specifications.

The native SQL 2000 backup file was 520GB in size, with no optimizations it restored in 2:48, and optimized restored in just 38 minutes (not bad for a network restore).

The compressed SQL 2008R2 backup file was just 121GB in size. An optimized restore of this database took just 23 minutes. That’s an 86% saving over the unoptimized SQL 2000 based restore.

 

Summary

There are times when you are not able to perform partial restores, or you need to restore entire databases to a new environment in as short a time as possible. Don’t always accept the defaults that SQL Server will give you in these situations. There are things that you can do to improve the performance of your restores (and your backups). It’s worth your spending some time now working in your environment and optimizing your restores.

A little time playing with these things now could be the difference between your database being down for 20 minutes as opposed to 3 hours.

How much is 2 1/2 hours worth to your business?

 

 

Such fun.

Grabbing The Newest File From Subdirectories Using PowerShell

Every once in a while I come up with a need for something a little out of the ordinary, in this instance I was moving backups from one machine to another. Robocopy is the obvious tool of choice to do this. Trouble was that the backups were not all within a single directory, rather they were in multiple subdirectories. Robocopy will of course handle this with the /S switch. What Robocopy can’t handle is the fact that I want only the most recent file from each one of those subdirectories, not all of them (in this case I just wanted to move the most recent differential backup from one location to another).

I figured I could sit down and query msdb for this information and dump that out. I mean it works, it’s functional and does exactly what I would need it to do. Where’s the fun in that though, really? Add to that it would only handle SQL backups, what if I had a need to do the same thing later on  for other types of files? The TSQL query wouldn’t work for me then.

 

PowerShell to the rescue

Seriously, I think that I’m going to get a cape for PowerShell that it can wear around the place as it’s that damn super (although I am not sure that I want to see it’s underwear outside its pants).

In this example I’m going to be working from C:Temp on my local machine.

Within C:Temp there are two folders and a subfolder:

Each of these folders contains a couple of files:

 

To grab the most recent file from a particular folder is a quick one liner:

dir c:tempsubfolder2 | sort-object {$_.LastWriteTime} -Descending | select-object -First 1

 

That is the basis for the rest of the script. Essentially everything else just handles recursing through the subdirectories to grab this information:

cls

$Path = 'C:Temp' #Root path to look for files

$DestinationPath = 'RemoteD$' #Remote destination for file copy

 

#Grab a recursive list of all subfolders

$SubFolders = dir $Path -Recurse | Where-Object {$_.PSIsContainer} | ForEach-Object -Process {$_.FullName}

 

#Iterate through the list of subfolders and grab the first file in each

ForEach ($Folder in $SubFolders)

    {

    $FullFileName = dir $Folder | Where-Object {!$_.PSIsContainer} | Sort-Object {$_.LastWriteTime} -Descending | Select-Object -First 1 

    

    #For every file grab it's location and output the robocopy command ready for use

    ForEach ($File in $FullFileName)

        {

        $FilePath = $File.DirectoryName

        $FileName = $File.Name

        Write-Output "robocopy $FilePath $DestinationPath $FileName /R:6 /W:30 /Z"

        }

    }

Running this gives the results:

 

Quick and easy, does just what it says on the box. Feel free to download CopyNewestFileFromSubDir.ps1 and give it a try. Let me know what enhancements you put around it (here’s a quick one, to limit the type of files evaluated change !$_.PSIsContainer to $_.Extension –eq “.bak” )

T-SQL Tuesday #19–What A Disaster

Allen Kinsel (blog|twitter) is running this months T-SQL Tuesday and wanted to know about preparing or recovering from a disaster. I thought this might be a good opportunity to tell a little story of how a disaster sucked up around three weeks of my life a couple of years ago.

 

It was a normal day, I was sitting and quietly going through a small staging release when my director asked me to come into his office. Instantly I started wondering “what did I do?”

Not being able to come up with anything egregious I settled down the stomach gurgles and went wandering in. That I walked in and was offered a seat I got nervous again, this was not a good sign with this director. Now I was really curious as to what could be up.

 

What was up?

“Are you working on anything big right now?” were the words that kicked off something that changed my views on a lot of things.

“Nothing that can’t be put on hold, what’s up?”

“Did you hear about the <redacted> outage?”

“Sure, everyone here has heard about that, it’s a seriously messed up situation.”

“How do you feel about being a part of the solution?”

My long nurtured DBA sense of responsibility kicked in at this point and I heard myself saying “sure”

“Great, I’ll shoot an email off to the admin, you’ll be on a flight to <redacted> first thing in the morning.”

 

I quickly rearranged all of my plans for the next few days and resisted giving myself a facepalm until I was well clear of the office.

 

Up, up and away

The next morning I hopped on a plane and by mid-afternoon was at my place of destination. When I walked in the door it was all hands to the pump where people were rushing around like crazy and the smell of desperation was in the air.

I was brought into a room and the situation explained to me…

A couple of days prior there had been an attempt at a microcode upgrade on a SAN. The upgrade failed, crashed the SAN and corrupted all of the data. No databases could be attached or started, no files were accessible, there was no filesystem, nothing. It was bad.

I asked at what point the decision was going to be made to scratch it and go to a backup (figuring if there had been a DR site for this it would have been in place already).

 

Backups? We don’t need no stinking backups

Yup, you guessed it, there were no backups. I asked when the last time a backup was taken, someone stated that they thought a backup had been taken 9 months before, but they couldn’t be sure and they didn’t know of anyone that could get into that datacenter and see.

The reason for no backup? Apparently it was taking too long to perform the backup and so the decision was made to just turn off the backup process.

I came to find out later no effort had been made to tune the backup process or to attempt alternative backup methods.

I find it unfathomable that there were no backups on a critical system that supported > 1 million users and had extremely high visibility. You would have thought that DR would have been priority #1, but that was not the case.

 

Sleepless nights

While there was little that could be done with the dead storage there was a lot of work around what could be done as a mitigation strategy. What could be done to restore service while other work was done in an attempt to recover some amount of data.

A new SAN was brought up, database installations performed, a change management process was put in place (one had not existed before) and there was a lot of discussion around getting backups working immediately.

I didn’t move from the building the first 36 hours. Thankfully the company brought in three square meals a day for everyone that was there to ensure that we at least got fed. People were sleeping on the floor in offices just to try and get a couple of hours sleep so they could remain functional.

Restoration of service was a slow arduous process as great care had to be taken with the order of enabling certain components.

Slowly things got back to normal, hourly calls with the VP dropped to every four hours. I was able to sleep in a bed and get some rest and a change of clothes (at one point I told the VP I was running out of clothes and asked how much longer I was going to be there, his response was that I should probably go and expense some underwear).

 

Getting back some data

A little over two weeks after everything went kaboom we started getting word of some data recovery. A third party company had been brought in and had been performing a block by block recovery of the storage from the bad SAN. They were not able to pull files or anything that simple, they were just able to pull access data on blocks. With a great deal of effort they managed to recover 90% of the data, which then had to somehow get validated and reconciled with the data now in the system.

Scripts abounded, levels of confidence in the data had to be decided upon and the risks ascertained for data restoration. That to me was a very scary concept. I’m glad that the decision on that one was made well above my pay grade.

After about three weeks I was able to go home. My work (and everyone else who was sent down to work on the recovery attempt) was acknowledged in a company meeting a couple of months later.

 

Takeaways

This is a real world example of a disaster. If there had been a backup a great many good people would not have been stuck away from home for three weeks.

It gave me a much greater appreciation for what can happen in a disaster. Don’t get caught out, make sure your backups are good and you have a strong business continuity strategy.