You Can’t Meet Your RPO/RTO With AlwaysOn

“That title may have caught your attention. AlwaysOn is the future of HA/DR for SQL Server, and has been since the release of SQL 2012.  

AlwaysOn is actually a marketing term which covers Failover Cluster Instances (FCIs) and Availability Groups (AGs). Allan Hirt (@sqlha | blog) is a strong proponent of ensuring that people understand what this actually means. So much so that he even ranted about it a little

I’ve used FCIs for years, going back to the active/passive clustering days of old, and I’ve used Availability Groups in the last few months. They are both great, and both have limitations: FCIs with their shared storage and AGs with some network and quorum oddities. 

Both of them will do a fine job for you if you have the time, patience, and in the case of AGs, money to get them up and running. They still will not allow you to meet your RPO/RTO though. 

Critical to your business and your users is your up time, and that’s where the Recovery Time Objective (RTO) and Recovery Point Objective (RPO)  come into play. They reflect amount of time it will take to get your services back up and running, as well as the level of data loss that you are willing to accept.

 

 

Where FCI/AG win

The key problem with FCI/AG is that they do everything that they can to ensure that transactions are kept as up to date as possible. With FCI you move an entire instance over to another node, everything committed goes with it. With AGs the log records are shipped to the secondaries and applied in a synchronous or asynchronous fashion. The asynchronous setting is designed to get transactions there as soon as possible, and great for longer distances or where low commit times are ultra-critical. Both of these solutions solve two problems…a hardware issue or a software issue.

What does that mean? If your server goes down, then you can failover and lose next to nothing and be back up and running quickly. If Windows goes out to lunch on one of the machines then you can failover and keep ticking along.

So where do they fall down? 

 

 

What FCI/AG cannot do

Let’s say there’s a code release and a table accidentally has an update run against it with no where clause. All of a sudden you have a table in a 500GB database which contains 3 million rows and all of the information is wrong. Your users cannot use the application, your help desk is getting call after call and you are stuck. 

Your only option here is to restore your backup and roll up your transaction logs to the point right before the update happened. You’ve done tests on this and know that it will take 120 minutes to get back to that point. Now you have a 2 hour outage and users are screaming, the CIO is at your desk wondering how this happened, and demanding you get the database back up sooner.

FCIs and AGs are not going to help you in this situation. That update is already committed and so failing over the instance won’t help. The transaction logs were hardened immediately on your synchronous partner and applied within 5 seconds on your asynchronous target.

So how has AlwaysOn helped you in this situation? It hasn’t. And while you can sit there cussing out Microsoft for pushing this solution that has this massive failing it’s not going to solve your problem. That’s why you need something more than AlwaysOn. 

 

 

You can pry Log Shipping from my cold dead hands

 

 “Log Shipping?” I hear you ask, “but that’s so old.”

It sure is. It’s old, it’s clunky, and it is perfect for the scenario I just mentioned.  

You can configure log shipping to delay writing transaction logs to remote servers. Let’s say you delay logs for 1 hour. That accidental mass update was performed, you realize that you are in trouble. You quickly apply the logs on the secondary to the point in time before the update, bring the database online and repoint your clients. You are back up again in 5 minutes. It’s a momentary issue. Sure, you have an outage, but that outage lasts a fraction of the time. Your help desk is not inundated with calls, your users aren’t left out in the cold for hours. 

There’s nothing to say that you have to delay applying those logs for an hour. It could be 2 hours, or even 24. It really all depends on how you want to handle things. 

Sure, you have to do manual failover, and you don’t have the ability for automatic page level restores from one of the synchronous AG secondaries, but you have a level of data resiliency that AlwaysOn does not provide you. 

 

So while AlwaysOn technologies are great, and you should absolutely use them to enhance HA/DR in your business, but you have to be aware of their limitations, and be sure to use other parts of SQL Server to ensure that you can keep your business running. 

One thought on “You Can’t Meet Your RPO/RTO With AlwaysOn”

  1. Why not pause AlwaysON and then do the code release? If the data is wrong, force a failover from secondary to primary. Use it as rollback…….That will wipe out all the ‘bad’ transactions and data. The business are back online quickly and the developers can go back to the drawing board. A lot quicker than restoring from backup…

    Like

Leave a comment