Over the last 10 years plus of being a DBA I’ve performed LUN manipulation tasks on dozens of occasions with no errors or problems. Other than adding new disks to Windows 2003 clusters I’ve never had to take SQL Server offline to perform these tasks either.
A simple request
I needed a new drive to be added to a server I had just taken over in order to perform backups of some large databases. This is something that the SAN engineers and Windows support teams have a lot of experience doing and so we weren’t concerned that there might be any problems.
The uh-oh moment
I was running a couple of scripts on the SQL instance hosted on that server when all of a sudden I had a corruption errors show up. The dreaded error 823 sat there in red staring me in the face. I had just taken over this machine, there were no backups and right as we were trying to add a LUN to be able to perform those things went boom.
My stomach hit the floor.
Remembering the words of Douglas Adams
The last thing you want to do as a DBA when there’s a problem is panic. Panic I did not. I took a deep breath and walked down to speak with the SAN and Windows admins to find out what might have happened. They reported no problems, showed that the new LUN was on there and a new drive created ready for me to start performing backups.
My nerves calmed somewhat I went back to my desk and started trying to figure out what the problem was. The error logs were showing that the data files did not exist. Very strange as after connecting to a remote desktop session on the server I was able to see the drive and files without any issue even though SQL could not.
I started digging through the Windows event logs to see if anything stood out. Right at the time that the errors started happening a single error showed in the system event log:
Event Type: Error
Event Source: PlugPlayManager
Event Category: None
Event ID: 12
Description: The device ‘LUN SCSI Disk Device’ (SCSIDisk&Ven&Prod_LUN&Rev4&23432&3&2) disappeared from the system without first being prepared for removal.
The error stated that the LUN hosting the data files lost connectivity and came back online almost immediately.
With SQL not able to see the files or drive I had no choice but to restart the instance in the hope that things would come back online. With bated breath I restarted the service and waited to see what would happen. One by one the databases started coming back online and my pulse slowed a little. Once all the databases showed as recovered without error I kicked off a DBCC CHECKDB on each one so that I could feel comfortable that everything was good and happy. The DBCC execution took 7 hours and everything came back clean. Happy with this knowledge I kicked off a backup and went to bed.
So what happened?
Unlike any of the SQL instances I have worked with before this one was running on a VMware ESX virtual machine and the SQL LUNs, rather than being a part of the VM itself, were raw mapped drives. A raw mapped drive is handled using raw device mapping which creates a symbolic link from within the VMware file system to external LUN (at least that’s what my admin tells me and I believe him).
What we found was a quirk that when VMware performed a bus scan to find the new LUN and create the mapping it very briefly dropped connectivity to the existing raw mapped drive. This is what caused Windows to lose the drive and SQL to lose access to the files. The drive was down for about 3 seconds, more than enough time for SQL to recognize a problem.
From now on I will be very careful with any kind of LUN changes on VMware machines with raw mapped drives. I’ll ensure that we schedule an outage and take SQL offline while any adjustments are made just in case something like this crops up again.