Category: Rant

High Availability–It’s Not All About The Hardware/Software

As I write this I’m sitting my my house working with the last hour of juice left in my laptop. We had a major Snow + Seattle = Bad Newssnowstorm here in the Seattle area which was followed by a big ice storm and some heavy winds, all of which have coincided to knock out power to over 200,000 homes. I’ve been without power for over 75 hours now and it’s given me a lot of time to think about multiple facets of High Availability (HA) and Disaster Recovery (DR).

HA/DR are critical items for all companies. HA in various degrees, but most certainly DR and the ability to provide some kind of business continuity model. Different companies, large and small, have different requirements around HA/DR. Some could easily remain down for a couple of days with no lingering effects, others cannot sustain downtime on their most critical systems of more than a couple of minutes. Based upon those requirements (and any budget constraints) a company could implement HA/DR using Failover Clustering, Mirroring, Log Shipping, SAN Replication, Peer-to-Peer Replication, VMotion or some other technology.

 

 

Where HA/DR often falls down

This is all good and well but what about the people that you need in order to get things up and running, the folks that you need in order to provide you with ongoing support in the event that a situation arises where you have to go to your DR site, or provide support for that critical system where a mirror has failed?

In the last 80 hours I have been called by the NOC at work stating that they have an issue. I don’t have power at home so my internet access is off. I don’t have sufficient power in my phone to tether it (don’t tell the wireless company I can do that) and it would take 4 hours sitting in the car to get sufficient power to allow me to do that. I do have a mobile broadband USB device, however I’m not getting coverage in my house right now from that carrier and I have been unable to get out of my road. In other words in the whole HA/DR side of things I’m a single point of failure. Luckily I have a colleague with power who I was able to point the NOC to in order to provide resolution to the issue.

My colleague and I live less than 5 miles apart, we are just luckily on different circuits and substations for the power company. If neither of us had been able to dial in then the problem would not have been resolved.

 

If you truly want to implement HA/DR you need to go further than just looking at your hardware infrastructure or the way that your software is setup. It means more than your DNS configuration and your Active Directory setup. There’s more to it than setting a Failover Partner in your application connection string. It’s about the people that you need to support these things!

 

 

Artificial limitations

I’ve been a DBA for about 12 years now. I’ve not been hands on with a server since doing datacenter implementation work about 5 years ago. Other than in very small shops it’s rare that a DBA touches a server, most don’t even have access to the datacenter in which their database servers reside. There has been a huge move towards setup in Colocation Datacenters (CoLo’s) over the last few years, meaning that more often than not the datacenter where you servers reside are probably not even in the same physical building that you are. There is also the current move to the Cloud initiative that’s gaining some traction. Given all of this information I ask you this:

Why are DBAs so often tied to a physical location for work?

 

This is a serious question. There are so few telecommuting options for DBAs out there. You can look on the job boards and see virtually none. There’s a lot of relocation opportunities if you want to move to another state, but it doesn’t make a lot of sense. If you are going to geographically disperse your hardware due to HA/DR needs why would you not do that same for the teams that have to support it?

So many companies point blank refuse to even entertain the idea of telecommuters from another state (heck the vast majority won’t even contemplate allowing IT support staff to work from home a couple of days a week).

The same companies that happily outsource all development processes to an unknown entity in India or China refuse to entertain the idea of allowing Windows/Unix Admins or DBAs to telecommute.

 

I feel that it is somewhat ironic that the same people that expect you to come to the office and be tied to a physical location also have you be on call so that you can provide remote support when something goes wrong.

 

 

Moving beyond the office

Forcing IT support staff to come to a physical location 5 days a week makes little to no sense to me. I can understand the desire to have folks in the same physical location but as your organization grows it really needs to find a way to get past that mentality. There are various options such as LiveMeeting or GoToMeeting that can provide a telepresence in meetings for people that work remotely. Instant Messaging can be used to constantly keep people in touch. Services such as Yammer can be used to quickly and easily provide collaborative thought outside of meetings.

Looking at my current situation, if my colleague and I had both been without power we would have had no way to provide support. Our servers sit in a CoLo that we don’t have access to so even if we could have gotten there, we would have had no hands on way to work with the servers.

Now imagine we had a third person on the team that lived and worked in another state (or that the second person was living in another state and telecommuted in). Even with power down for me (or the entire state) we would still have someone available to provide support (unless by extremely unusual circumstances they were also down). With a remote DR site they could have effected whatever steps were necessary to bring that online and check that things were working ok.

Let’s look at this from another angle. I’m in the Pacific Time Zone and get to work usually 7:00-7:30am and leave 4:30-5:00pm. If we had someone working in the Eastern Time Zone and they worked the same hours we would have coverage (without emergencies handled by being on call) from 4am PST – 4:30pm PST. That’s an extra 3 hours of standard DBA coverage. Imagine how much more work could get done. How overnight problems could be identified and resolved earlier?

We live in a global society and the company that I work for does business internationally. Would it not make a lot more sense to provide a wider margin of coverage? Think about it, it would also improve the quality of life for anyone that had to be on call. For me the worst time to get called out is about 4am as I’ll never get back to sleep. If there was someone already in the office on the east coast I wouldn’t need to get called out. The same for them with evening plans. They could go out to dinner or a movie knowing that they wouldn’t have to worry about missing a call.

 

 

TL;DR

It’s the 21st century, there are many options to prevent the need for anyone to be physically tied to an office location (certainly amongst IT staff). In order to provide higher levels of support to companies they really need to embrace telecommuting options beyond providing emergency support.

 

 

Late Note: After 80 hours PSE were able to restore power to my neighborhood. Throughout the outage PSE did a fantastic job of handling things. Ultimately over 375,000 were without power at one time or another. PSE provided constant information wherever possible via their Twitter feed and website. They brought in crews from around the country to assist in the restoration efforts. All in all I think they handled things really well. I have no complaints with how they handled any of this. (I just wish I’d been closer to the top of the restore list)

DBA as a Commodity

If you’ve been working around the IT industry for a few years then you have probably heard mention of commodity servers. These are machines that run a standardized hardware spec, and are easily interchangeable with others. The price point around the machines is that it is typically cheaper to replace than it is to repair them, so once the warranty runs out you just throw the old one away and get a new one.

I was thinking about this the other day, and how it really goes beyond servers, which is where I thought about Database Administrators as a Commodity (or DBAaaC).

When first considering DBAaaC as an idea it really does not seem to make much sense. After all DBAs have to do all kinds of different work, and many have very different sets of skills that are applicable to a wide variety of situations.

If you are thinking that then the odds are that you are a DBA, or have worked closely with one over the years. This is not something that the vast majority of people, in particular IT management type folks seem to think.

What makes me believe this to be the case?

DBA Job Requirements

I’ve been in the DBA business for about 17 years now (why yes, that does make me feel old, thank you for asking). In this time I have had a few jobs, interviewed for many, and been an interviewer for even more. I have seen more generic DBA job desscriptions than I care to remember.

Generic Job Description
Typical job description

There are hundreds of jobs descriptions for SQL Server DBA jobs that look almost exactly like the one above, do a search yourself and see what commonalities seem to exist between roles. Think back to the last job description that you read, was it something similar to this?

Would it surprise you to learn that the above job description was actually copied from the job description of a company that was looking for an Oracle DBA?

Certifications

How many jobs have you seen looking for someone holding an MCDBA? That’s a certification that existed through SQL Server 2000 and was replaced by the MCITP with 2005, and then with the MCSE Data Platform in the last couple of years. It would seem that with management team that these things are easily interchangeable, after all there really is not difference between running SQL Server 2000 running in a cluster, and SQL Server 2016 with a multi-subnet Availability Group running on a cluster that is not joined to a domain, using DMVs to help you identify performance bottlenecks.

What about jobs that are looking for an MVP? That’s an award given by Microsoft for community contributions. It is not a barometer of technical knowledge, and yet I have seen several DBA jobs looking for people who are MVPs.

Over the years I have seen many jobs that also will only consider people that have bachelor degrees, because apparently having a bachelor degree means that you will be good at your job. I do not have a degree (of any kind), and know several excellent DBAs that do not either, but every one of us would be excluded from these jobs just because we did not take four years of college (and we’ll ignore for now how that plays into systemic racism in the US).

Experience

Here are the general guidelines for DBAs that I’ve seen over the years:

  • Junior DBA – < 3 years of experience
  • DBA – 3-6 years of experience
  • Senior DBA – > 6 years of experience

These are usually baked right into the job descriptions, and it is how companies seem to manage peoples careers and job progressions (not to mention candidates looking for job seem to think that their skills put them in line for a job thanks to years working with SQL Server, not with actual knowledge).

This completely ignores whether a person looking for a DBA role has 5 years or varied experience, or 5 years doing the same thing day in and day out (checking if jobs ran, and answering business questions).

All the Technologies

Your generic job description will frequently be looking for someone with experience (if not expertise) with

  • SSIS
  • SSRS
  • SSAS/OLAP
  • DTS (yes really)
  • AlwaysOn (even though it’s not a thing)
  • MSCS (even though it was replaced by WSFC almost a decade ago)
  • VLDB (many places believe > 50GB = VLDB)
  • VMWare
  • Azure
  • TCP/IP
  • Windows 2003/2008/2008R2/2012/2012R2
  • TSQL
  • Maintenance Plans
  • Replication
  • and many more…

There tend to be long lists, like the above, for job roles. It is really just a list of buzzwords that the hiring manager (or the recruiting team) have put together so that they can plug them into their software to automatically weed out candidates that don’t have those words in their resumes.

The above examples are all ways to quickly recognize if a company is likely to treat anyone coming in as a DBAaaC.

What would be better?

A More Reasonable Job Description

Given the very generic information that is listed for almost all DBA positions what would make a role stand out? Usually it is something that is a little more crafted, and provides and understanding of the world in which the DBA works. For example a job that listed the following things:

  • Required 3+ years with Availability Groups
  • Understanding of DMVs and how to use them to troubleshoot server performance
  • Can tune queries to improve performance and lower server impact
  • Ability to manage SQL Server backups and restores to meet companies documented RPO/RTO
  • Experience managing maintenance for tables larger than 100GB in size
  • Knowledge of how Windows and Active Directory settings can impact SQL Server performance
  • Can work with development teams on best practice guidelines for new code, and provide performance enhancements when problems occur
  • Can troubleshoot SSIS packages
  • Preferred current MCSE certification

This by no means would be a perfect list, however it does provide some good information, and would let a potential candidate know that management here at least has a clue as to what is required to perform in the role, and provides a basis for a conversation.

Working as a DBA

Once you are in a company how can you recognize whether or not you are being treated as a DBAaaC?

Think about potential new hires that the company are looking to make. Are they using generic job descriptions?

Do promotions seem to happen thanks to longevity in the role, rather than be based upon some more hard based criteria around the work that you do? Is there a planned career path for you to move to the next position?

Do you perform the same work day in, and day out? Are you solely focused on backups, or indexing, or checking on failed jobs?

Is there a way that you can receive ongoing training to help you get additional knowledge? (even if this means just paying for some online videos, or giving you a section of time every month to spend on skills development)

Does your manager treat you like a truly valued member of the team, or are you just there ensuring that things are set to run smoothly?

If any of the above ring true to you then the chances are that you are seen as a commodity, and you can be thrown away at any point, and replaced by someone else (at least management believes that to be the case).

A Word of Advice to Management

There is every chance that you believe that the people that work for you are interchangeable, no matter the role (I’ve seen this over the years with developers, database, Windows, and Unix admins). People really aren’t, except at the lowest skill levels (and frequently even not then).

People are not machines, do not think that you can just throw the old one away and pick a new one up. Consider the ongoing learning that is required for your people to have skills that remain current, and help you move ahead. Certainly, at times you will have people leave, or you may have to let people go, but don’t treat your entire team as if they could be replaced at a moments notice.

If you think that all DBAs are the same, spend some time looking on Stack Exchange at some of the questions asked, and consider what questions on there your DBAs could answer (and what questions they may post). You’ll quickly see that DBAs are all different, they have different areas of focus, and each can give you different things. They are far from being a commodity.

Who Needs a DBA?

I was perusing my RSS Feeds this morning, and came across an article by Steve Jones (b|t) about Changing Times. In this post Steve talks about Microsoft not having any DBAs in their cloud team, and that it is all managed by DevOps folks. This message came from a talk that was given at SQLBits and summarized by Victoria Holt.

MS devops facts

Wow, this looks amazing, and really seems to ring the death knell for the DBA…or does it?

Let’s consider what we are looking at here, and for the purposes of this, I am going to focus on Azure SQL Database (or whatever we’re calling it this week).

There are 1.7 million production databases that do not have DBAs. Well, that’s flat out wrong, there might not be DBAs at Microsoft, but you can be pretty sure that there are DBAs out there managing a portion of those. I guess we should really focus on the Microsoft side of things.

So there are 1.7 million databases, what is Microsoft’s role in these databases? Other than keeping up the infrastructure that the databases are based upon not much really. They have no posted performance metrics to which they must aspire. They are not going to be in the business of ensuring that your backups are taken, your HA requirements are based upon you setting up the services and regions correctly, and they don’t care how fast or slow your query is. There is an SLA for uptime, and don’t worry, if it gets below 99% you get a 25% discount.

Given that there are no requirements to perform any work that a typical DBA would be required to do why would they need a DBA at all? They wouldn’t. It is this kind of information that makes people who work as DBAs question their existence and future, and frankly, there is no need to do that.

As a DBA you will find that the cloud will become a significant part of your future (if it is not already), and your role will shift, but there will always be a place for the on-premises products (for as long as it is made), and that means DBAs to manage it. Heck if you want to see what the world without DBAs looks like just skim through dba.stackexchange sometime.

I feel that a lot of messaging coming out of Microsoft puts a not insignificant level of fear into some people, as they attempt to set an example where the only people that really matter are developers, and developers sitting in a DevOps role. Is Microsoft going to change their tune around any of this? Nope, I don’t see that happening at any point. One thing you could question is, given that there are all these Azure databases that don’t have DBAs why is it that Microsoft has any DBAs working for them at all? Surely DevOps people would be all that they need.

You Can’t Meet Your RPO/RTO With AlwaysOn

“That title may have caught your attention. AlwaysOn is the future of HA/DR for SQL Server, and has been since the release of SQL 2012.  

AlwaysOn is actually a marketing term which covers Failover Cluster Instances (FCIs) and Availability Groups (AGs). Allan Hirt (@sqlha | blog) is a strong proponent of ensuring that people understand what this actually means. So much so that he even ranted about it a little

I’ve used FCIs for years, going back to the active/passive clustering days of old, and I’ve used Availability Groups in the last few months. They are both great, and both have limitations: FCIs with their shared storage and AGs with some network and quorum oddities. 

Both of them will do a fine job for you if you have the time, patience, and in the case of AGs, money to get them up and running. They still will not allow you to meet your RPO/RTO though. 

Critical to your business and your users is your up time, and that’s where the Recovery Time Objective (RTO) and Recovery Point Objective (RPO)  come into play. They reflect amount of time it will take to get your services back up and running, as well as the level of data loss that you are willing to accept.

 

 

Where FCI/AG win

The key problem with FCI/AG is that they do everything that they can to ensure that transactions are kept as up to date as possible. With FCI you move an entire instance over to another node, everything committed goes with it. With AGs the log records are shipped to the secondaries and applied in a synchronous or asynchronous fashion. The asynchronous setting is designed to get transactions there as soon as possible, and great for longer distances or where low commit times are ultra-critical. Both of these solutions solve two problems…a hardware issue or a software issue.

What does that mean? If your server goes down, then you can failover and lose next to nothing and be back up and running quickly. If Windows goes out to lunch on one of the machines then you can failover and keep ticking along.

So where do they fall down? 

 

 

What FCI/AG cannot do

Let’s say there’s a code release and a table accidentally has an update run against it with no where clause. All of a sudden you have a table in a 500GB database which contains 3 million rows and all of the information is wrong. Your users cannot use the application, your help desk is getting call after call and you are stuck. 

Your only option here is to restore your backup and roll up your transaction logs to the point right before the update happened. You’ve done tests on this and know that it will take 120 minutes to get back to that point. Now you have a 2 hour outage and users are screaming, the CIO is at your desk wondering how this happened, and demanding you get the database back up sooner.

FCIs and AGs are not going to help you in this situation. That update is already committed and so failing over the instance won’t help. The transaction logs were hardened immediately on your synchronous partner and applied within 5 seconds on your asynchronous target.

So how has AlwaysOn helped you in this situation? It hasn’t. And while you can sit there cussing out Microsoft for pushing this solution that has this massive failing it’s not going to solve your problem. That’s why you need something more than AlwaysOn. 

 

 

You can pry Log Shipping from my cold dead hands

 

 “Log Shipping?” I hear you ask, “but that’s so old.”

It sure is. It’s old, it’s clunky, and it is perfect for the scenario I just mentioned.  

You can configure log shipping to delay writing transaction logs to remote servers. Let’s say you delay logs for 1 hour. That accidental mass update was performed, you realize that you are in trouble. You quickly apply the logs on the secondary to the point in time before the update, bring the database online and repoint your clients. You are back up again in 5 minutes. It’s a momentary issue. Sure, you have an outage, but that outage lasts a fraction of the time. Your help desk is not inundated with calls, your users aren’t left out in the cold for hours. 

There’s nothing to say that you have to delay applying those logs for an hour. It could be 2 hours, or even 24. It really all depends on how you want to handle things. 

Sure, you have to do manual failover, and you don’t have the ability for automatic page level restores from one of the synchronous AG secondaries, but you have a level of data resiliency that AlwaysOn does not provide you. 

 

So while AlwaysOn technologies are great, and you should absolutely use them to enhance HA/DR in your business, but you have to be aware of their limitations, and be sure to use other parts of SQL Server to ensure that you can keep your business running. 

In-Memory OLTP – Shiny, But Useful?

SQL 2014 introduces In-Memory OLTP to the product (although I prefer the name Hekaton).  It is a significant leap forward so far as the technology goes, and can lead to significant performance improvements when utilized, but is it going to be a technology for the masses? Maybe eventually, but right now I see limited use cases for it.

 

Limitations on compiled tables

First up there are limitations around what you can do with the In-Memory tables. There’s no option to perform a truncate or merge into one. You also cannot perform a cross-database join with one, or access it from a CLR module using context connections (see Accessing Memory-Optimized Tables Using Interpreted Transact-SQL).

Interested in using contained databases for portability? Forget using an In-Memory table then, just not going to happen. You can’t put one on a particular filegroup or partition either, or use computed columns. Like to use IDENTITY? Not any more. DRI? Nope, not supported, nor are clustered primary keys.

That being said you can do pretty much anything else with these tables, just so long as you don’t want to use a LOB data type, datetimeoffset, or potentially have a row overflow (see Supported Data Types). Then again, were that the case you shouldn’t be doing those kinds of things in your OLTP system. We’ve never seen anyone do that, right?

You could also have problems with transactions in these tables, and so you have to be very careful with setting retry logic

You might want to be careful when creating your tables too, because once they are there you can’t alter them, you actually have to drop and recreate it.

 

Limitations on indexes

Unique indexes are a no go, so are filtered ones. Want to drop or alter an index? Well that’s a table drop and recreate right there. Those null columns you have? Well you can’t have those in an index. Maybe it’s time to start listening to Joe Celko and getting rid of those pesky nullable columns.

 

Limitations on compiled procedures

On the bright side your developers will no longer be able to use cursors, so it’s not all doom and gloom. Set based items are a little more challenging though, you cannot do a multi-row insert using INSERT…VALUES, but that’s pretty minor, and just requires more typing.

You won’t be getting too recursive, CTEs aren’t supported, nor are subqueries. 

DELETE and UPDATE from aren’t allowed, nor are CASE statements. Forget about using UDFs, or the OUTPUT clause to get information back.

Want to get data from more than one table? UNION won’t get you there, nor will INTERSECT or EXCEPT. OUTER JOINS and APPLY aren’t supported either.

Filtering your results? Don’t expect to use ( OR IN (..)) as a construct or NOT, and you better be sure what you are looking for because LIKE isn’t allowed either. 

By no means is this a complete list, there are lots more limitations. 

 

So many limitations, such little time

In-Memory is fast, that’s been demonstrated, but there are so many restrictions around it’s usage that for most companies it is not going to be something that is even worth looking at. The use cases for it are probably too extreme.  Yes, it will get used, and used to great effect by some, but I think that we are looking at less than 1% of the install base could actually see true use cases and be able to adjust their development methods and code base to be able to support it (at least in this version).

So while this is an exciting breakthrough in performance you’ll have to excuse me while I keep working on tuning my code and indexes for performance, and setting up my AGs for DR.

 

 

 

For a full list of Unsupported Transact-SQL Constructs for In-Memory OLTP visit Technet

 

Developing To Specifications

I’m a DBA. As a class of people you will find that DBAs have a tendency to rant a little about developers. I would certainly be someone that you would find in that category. The trouble is that most of the time I don’t think that it is the developers fault, it is just a case of shooting the messenger.

As an example let’s look at a new database release that was being asked.

The biggest issue was that the code was provided so late in the cycle that the issues found (and there were a great many of them) could not be fixed. Why could they not be fixed? Because an arbitrary release data had been provided to the customer and this data could not slip for any reason whatsoever. Artificial deadlines, one of the worst things that devs and DBAs have to deal with.

The developers agreed to get the code fixes added to their backlog and to get it into a sprint for fixing in the next month. So after much discussion and a firm commitment we decided to move ahead with the release.

My next question to the dev team was “how are you going to get the data out? You have a few procs here for loading data into tables, but nothing for being able to consume that data afterwards.”

The response was a little stunning:

The only requirement to get done by is to have the data written to a database. After this deploy, we are going to create a way to get the data out

Outstanding. Way to develop to requirements.

In this instance I cannot really place blame on the dev team, they are just doing what is asked of them. I think the bigger problem is sitting with the folks who are gathering the business requirements and translating those, along with timelines, up to the business.

I think that it might be time for DBAs to stop pointing fingers at devs and start holding accountable those who are causing these problems, namely the Business Analysts and Project Managers.

Who’s with me on this?