FusionIO and the Go Faster Button

Nobody has ever said that FusionIO cards were slow (because they aren’t). Especially if you compare their performance to regular spinning media, or high performance SANs. After all, no SAN will allow you to measure storage write latency in microseconds.

Anyone that has had their database reside on a FusionIO card has had nothing but good things to say about them. The only thing is that a lot of people are probably not making the most out of the cards that they have. FusionIO cards actually have an artificial performance restriction put on them, and can go even faster than they do now.

Continue reading “FusionIO and the Go Faster Button”

Improving Performance When Querying Multiple DMVs

A couple of days ago I posted a stored procedure (sp_GetAGInformation) which queried multiple DMVs to pull together a bunch of AvailabilityGroup information. If you took a look at the code you would see that it used a couple of CTEs (Common Table Expressions).

CTEs are a great way to do recursive work, and they can also greatly simplify reading code. A CTE without recursion is really nothing more than a subquery that is nicely wrapped.

For example:


Basic CTE

Is the same thing as:


Basic Subquery

This can easily lead you down the path towards poor performance. It is quite easy to define a CTE once and use it multiple times, not realizing that every time you use the CTE then you are performing that subquery, meaning is has to be evaluated and executed.

For smaller queries this is not usually a problem, but for larger queries and more importantly when working with DMVs this can be a serious performance problem.

Continue reading “Improving Performance When Querying Multiple DMVs”

Rolling Upgrades With Availability Groups – A Warning

One of the great options provided by Availability Groups, in SQL Server 2012 Enterprise Edition and newer, is the ability to perform rolling upgrades to new Service Packs or Cumulative Updates.

The basic idea is that you apply the update to one of the AG secondary servers and then perform a failover of SQL to that server which then does the necessary things on the user databases to bring them up to the level of the update. The big advantage to this is that it minimizes the outage required to get the SP/CU applied, so that you are down for a few seconds instead of 40 minutes.


This is works really well for your regular user databases, however there is a problem when applying a CU or SP to a secondary server where a Integration Services (typically called SSISDB) is a member of an Availability Group. If you attempt to apply the CU/SP then it can fail and the SSISDB be left in an offline state. 

In order to apply the CU/SP you would first have to remove SSISDB from the Availability Group and recover it on each server you want to patch. Once you have completed patching all the servers you can add SSISDB back to the AG. But for that period of time you will be at risk, so get through and patch a couple of the machines and get the AG working for those as soon as possible.

Interestingly this does not apply for all CU/SP releases. Some do not make changes to SSISDB and this isn’t required. You can only find this out by patching, so be sure to get it going in your test environments first.

You Can’t Meet Your RPO/RTO With AlwaysOn

“That title may have caught your attention. AlwaysOn is the future of HA/DR for SQL Server, and has been since the release of SQL 2012.  

AlwaysOn is actually a marketing term which covers Failover Cluster Instances (FCIs) and Availability Groups (AGs). Allan Hirt (@sqlha | blog) is a strong proponent of ensuring that people understand what this actually means. So much so that he even ranted about it a little

I’ve used FCIs for years, going back to the active/passive clustering days of old, and I’ve used Availability Groups in the last few months. They are both great, and both have limitations: FCIs with their shared storage and AGs with some network and quorum oddities. 

Both of them will do a fine job for you if you have the time, patience, and in the case of AGs, money to get them up and running. They still will not allow you to meet your RPO/RTO though. 

Critical to your business and your users is your up time, and that’s where the Recovery Time Objective (RTO) and Recovery Point Objective (RPO)  come into play. They reflect amount of time it will take to get your services back up and running, as well as the level of data loss that you are willing to accept.



Where FCI/AG win

The key problem with FCI/AG is that they do everything that they can to ensure that transactions are kept as up to date as possible. With FCI you move an entire instance over to another node, everything committed goes with it. With AGs the log records are shipped to the secondaries and applied in a synchronous or asynchronous fashion. The asynchronous setting is designed to get transactions there as soon as possible, and great for longer distances or where low commit times are ultra-critical. Both of these solutions solve two problems…a hardware issue or a software issue.

What does that mean? If your server goes down, then you can failover and lose next to nothing and be back up and running quickly. If Windows goes out to lunch on one of the machines then you can failover and keep ticking along.

So where do they fall down? 



What FCI/AG cannot do

Let’s say there’s a code release and a table accidentally has an update run against it with no where clause. All of a sudden you have a table in a 500GB database which contains 3 million rows and all of the information is wrong. Your users cannot use the application, your help desk is getting call after call and you are stuck. 

Your only option here is to restore your backup and roll up your transaction logs to the point right before the update happened. You’ve done tests on this and know that it will take 120 minutes to get back to that point. Now you have a 2 hour outage and users are screaming, the CIO is at your desk wondering how this happened, and demanding you get the database back up sooner.

FCIs and AGs are not going to help you in this situation. That update is already committed and so failing over the instance won’t help. The transaction logs were hardened immediately on your synchronous partner and applied within 5 seconds on your asynchronous target.

So how has AlwaysOn helped you in this situation? It hasn’t. And while you can sit there cussing out Microsoft for pushing this solution that has this massive failing it’s not going to solve your problem. That’s why you need something more than AlwaysOn. 



You can pry Log Shipping from my cold dead hands


 “Log Shipping?” I hear you ask, “but that’s so old.”

It sure is. It’s old, it’s clunky, and it is perfect for the scenario I just mentioned.  

You can configure log shipping to delay writing transaction logs to remote servers. Let’s say you delay logs for 1 hour. That accidental mass update was performed, you realize that you are in trouble. You quickly apply the logs on the secondary to the point in time before the update, bring the database online and repoint your clients. You are back up again in 5 minutes. It’s a momentary issue. Sure, you have an outage, but that outage lasts a fraction of the time. Your help desk is not inundated with calls, your users aren’t left out in the cold for hours. 

There’s nothing to say that you have to delay applying those logs for an hour. It could be 2 hours, or even 24. It really all depends on how you want to handle things. 

Sure, you have to do manual failover, and you don’t have the ability for automatic page level restores from one of the synchronous AG secondaries, but you have a level of data resiliency that AlwaysOn does not provide you. 


So while AlwaysOn technologies are great, and you should absolutely use them to enhance HA/DR in your business, but you have to be aware of their limitations, and be sure to use other parts of SQL Server to ensure that you can keep your business running. 

Developing To Specifications

I’m a DBA. As a class of people you will find that DBAs have a tendency to rant a little about developers. I would certainly be someone that you would find in that category. The trouble is that most of the time I don’t think that it is the developers fault, it is just a case of shooting the messenger.

As an example let’s look at a new database release that was being asked.

The biggest issue was that the code was provided so late in the cycle that the issues found (and there were a great many of them) could not be fixed. Why could they not be fixed? Because an arbitrary release data had been provided to the customer and this data could not slip for any reason whatsoever. Artificial deadlines, one of the worst things that devs and DBAs have to deal with.

The developers agreed to get the code fixes added to their backlog and to get it into a sprint for fixing in the next month. So after much discussion and a firm commitment we decided to move ahead with the release.

My next question to the dev team was “how are you going to get the data out? You have a few procs here for loading data into tables, but nothing for being able to consume that data afterwards.”

The response was a little stunning:

The only requirement to get done by is to have the data written to a database. After this deploy, we are going to create a way to get the data out

Outstanding. Way to develop to requirements.

In this instance I cannot really place blame on the dev team, they are just doing what is asked of them. I think the bigger problem is sitting with the folks who are gathering the business requirements and translating those, along with timelines, up to the business.

I think that it might be time for DBAs to stop pointing fingers at devs and start holding accountable those who are causing these problems, namely the Business Analysts and Project Managers.

Who’s with me on this?

Automated Backup Tuning

Tuning your backups is a wonderful thing. You could easily reduce the time it takes to backup your databases by 50% of more just by changing a few settings, and yet so few people do it. I asked myself the question why and came up with two answers.

  1. People do not know how to tune their backups.
  2. It takes too long to run through the tuning process.

How to tune your backups

I’m not going to go over methods for tuning your backups in this post. There are people who have done a far better job at both explaining the adjustments you can make and that have created videos to help you understand and perform the process yourself.

My biggest concern was directed at the level of effort required to test all the possible permutations of files, maxtransfersize and buffercount values, after all, who has time to figure all of that out and then capture the data to look at the relative performance characteristics of each one?

I decided that the best way to do this was to create a nice little test harness which would run through all those tests without manual intervention, and then figure out a way to get the output from all of those backups into some kind of meaningful graph for that instant visual on performance differences.

No manual backup tuning for this DBA

The backup testing script

Step one in the automated tuning is a SQL script I created which accepts a few input variables:

  • @DatabaseName – name of the database you want to use for testing
  • @MaxBackupFiles – maximum number of files you want to write to at a time
  • @BackupFolder – destination folder for the backups
  • @WithCompression – whether or not to use SQL Server backup compression
  • @CreateJob – Whether or not to create a SQL Agent job to run the tests
  • @JobLogFileName – file path and name for a log for the job
  • @OutputToScreen – outputs the backup commands to the screen

When executed the script is designed to created SQL Agent job which you can then execute at a time of your choosing. The job will run through a series of backups for a database (at minimum 40) and capture the information in the job log file.

Be warned, performing this many backups test, with so many settings, can and will impact your server performance at some point. Be aware of this. Do not cause issues for your users when performing the tests, they will not like you for doing so.

Visualizing the results

Once the backup testing job is complete a job log file is left behind. This log file contains all of the information for each of the backups performed including the:

  • number of files backed up to
  • maxtranfersize setting used
  • buffercount setting used
  • speed of the backup in MB/sec
  • time the backup took to complete in seconds

Getting that information usable is done using a PowerShell script. A blog post by Jonathan Kehayis described his method for parsing SQLIO output to excel charts. Based on this script, the update version parses the backup log file that has been created, imports the information into excel and creates a couple of charts.

These charts show the speed (Figure 1) and backup time (Figure 2) for the default backup options, the absolute fastest backup to NUL, and the all of the other options that were tested.

Figure 1. Backup speed results for 100GB database backed up across 10Gb network to CIFS share

Figure 2. Backup time results for 100GB database backed up across 10Gb network to CIFS share

The graphs provide a nice visual representation of the backup performance which helps to quickly identify that there are improvements that can be made.

The default SQL backup runs at just over 200MB/sec and takes about 235 seconds to complete, but we know that we can perform that same backup to a NUL device in about 75 seconds at a rate of over 1200MB/sec.

Just looking through the rest of the data it is clear to see that a lot of the values used make a negligible change to the performance of the backups, so we can work to make an informed decision on what the best file, maxtransfersize and buffercount settings should be.

Hopefully this is something that you can find useful. Please give it a try and send me feedback on your testing efforts.

Downloads: Backup Test Harness.sqlParse Backup Perf Tests.ps1