There’s a lot of talk on High Availability. I know I’m a huge fan of it, in particular clustering (but I know it’s not for all situations and the changes that SQL 2012 with AlwaysOn Groups may mean that traditional clustering is used less and less). There are of course other HA solutions out there like Log Shipping, Mirroring and SAN replication technologies (sorry folks, I disagree on transactional replication as an HA concept unless you are talking downstream reporting infrastructure behind load balancers).
People constantly push to achieve the magical five nines availability number. What does that mean?
Five nines means that your SQL Server must be available 99.999% of the time. This means you can have only 5.26 minutes of downtime a year.
So let’s say that you are running clustering and mirroring; that you have load balancing with peer-to-peer replication; that you are running SAN replication between two datacenters and have log shipping setup ready to go live at the flick of a switch. You have all your hardware ducks in a row and are ready for any emergency that may crop up.
I commend you for all of that.
Now let’s say that your manager is going to bonus you on the downtime that you have. You are the DBA, a single link in the chain, and your entire bonus is going to based on the number of nines that you can get over the year.
Time to sit down with your manager and ask some questions:
- Does this five nines bonus plan include maintenance time?
- How are you going to measure uptime?
- What tools are going to be used to measure it?
- What are the tolerances on those tools?
- What about upstream items?
Let’s take a look at these one at a time.
Is maintenance time included?
There may be bugs in the database code for the application that would require an app being offline while changes are made. You may want to patch SQL Server with the latest service pack in order to maintain supportability. You may want to test failover and DR. If none of these things are excluded then chances are you’ll have trouble making three nines (8.76 hours of downtime in a year) and none at all of making five nines.
How will uptime be measured?
Is uptime calculated based upon SQL Server being up and running? Or maybe on that there are successful connections? How about certain queries getting returned? Or a website returning a page within a certain time period?
So let’s say you are running a query. What if that query gets blocked or deadlocked? SQL is up and processing other transactions but that single query is having a problem.
Successful connections? Great, doesn’t mean that the database is available that they are going to use. Same issue with checking if SQL Server is available. Heck, are you measuring server uptime or database uptime here?
Website returning a page? What if there’s an issue with the web server? Or a networking problem between there and the SQL database?
What tool will you use?
Are you going to buy something off the shelf? Then who’s going to spend the time configuring it for your particular environment? Isn’t that time better spent on solidifying things, performance analysis? Query tuning?
You’re going to make your own tool? Great, how are you going to do that, what technologies? How is the information going to be reported upwards? Who has the last say on possible exceptions? You have the time to write this yourself? Awesome, I’m glad that you do.
What is your tool tolerance?
Tolerance? Yup. Let’s say that you’ve figured out all the other stuff and have agreements in place for everything and got a tool that will do the job. How does that tool work?
Let’s say it runs a query and fails. It says that the database is down. When will it next attempt to query the database? In 15 minutes? If that’s the case then any bad poll from your tool just put you outside of that five nines criteria and your bonus has been dinged.
Let’s just say that there was an actual problem and that happened 30 seconds after the last poll but you resolved it before the next one. Everything was down for 14 minutes but your tool didn’t capture that.
Some serious flaws there.
Are you the Windows, network, firewall, and datacenter admin? Are you responsible for generating the electricity that comes into the building? How about the HVAC system, is that yours? Are you checking the diesel levels in your backup generators weekly?
Each one of these upstream items can and will affect peoples ability to access your SQL Server and its databases. Unless you are responsible for them all then how can you be held accountable for perceived uptime on SQL Server?
SQL can be up, able to process data and all the databases available, but without a funcitoning network nobody would know that and your monitoring wouldn’t pick it up.
All of these things add up to not being able to accurately measure the mythical five nines uptime creature. If indeed we can’t get those kinds of accurate measurements and I can’t own those upstream processes then why should I be held accountable for a number.
Here’s a better idea. Let’s look at the processes, let’s look at whether or not things are available from a users perspective. Let’s gauge customer interactions, build outs, deployments. Let’s track tickets and issues. Let’s talk about the things we need to do to make things better. Let’s not go pulling a number out of the ether, it does nobody any favors.