Post number four in the Apathy series covers on-call, something which has been the bane of many a DBA and sysadmin over the years. More recently this has become a cause of consternation for developers, especially those that have entered the DevOps world.
If you don’t know what being on-call is like, well lucky you. There are a lot of folks out there that envy you. Likewise if you are a part of a large on-call rotation, or just the third layer of support for things. Even so, being on-call might be something that leads you down the road to not liking your job any longer.
Symptoms of reaching on-call burnout
Once you get into an on-call rotation you find that your life changes a great deal. Time that was once your own may not be yours any longer, and those nice 8 hour nights of sleep that you enjoy become a thing of the past. Here are a few things that you need to keep an eye out for that will have a long lasting impact:
- A short on-call rotation that leads you to being the point person every two to three weeks
- Standard tasks that need to be completed as a part of being on-call
- Out of hours standard tasks that need to be completed by the on-call person
- Frequent call outs
- Random calls for issues completely unrelated to your responsibilities
- Your friends want to know when you’ll be able to hang out
- You did not get to see Star Wars: The Force Awakens for three weeks as you had to be available just in case
- Putting your phone to your ear as you stand up and walk away from the dining table is a normal thing
- You consider getting a waterproof phone so that you can take a shower
- There is a spare room made up for you so that you don’t bother your partner climbing back into bed at 3am
- You haven’t been to the local user group for several months as there isn’t a solid Wi-Fi connection there
- Choosing a cellular service provider is all about whether or not you can tether on LTE while taking the bus home
- You start to think that sleeping is for wimps as you’ve gone days without it
- The backpack you carry is really heavy as it has a built in battery just so that you have enough juice just in case
- The dogs have got lost in the lawn that you meant to cut four weekends ago
- When your phone rings you actually want it to be a telemarketer
I could continue, but you get the point, or if you don’t you could take a look at the symptom list from The Overworked Predicament, there is a lot of overlap. The fact is that being on-call can take over your life and prevent you from doing any of the things that you want to do.
Attacking the on-call exhaustion
Getting through the issues associated with being on-call can be extremely difficult, simply by virtue of the fact that much of what happens is beyond your control. For the moment let’s ignore those things and take a look at things that you can control (or might be able to change).
Drop the commute
Unless you are one of the lucky few that gets to work from home then you have a commute to work, be it short or long. In fact a 2013 survey showed that the average American worker has a 25 minute each way commute. That’s almost an hour a day, and 8% of workers spend that on a one way trip, just to get to the office.
Just think what that extra hour, or two, could do for you. That’s more time to spend asleep, to see your loved ones, or you could even get some exercise in.
I’m not advocating that you go to your management folks as say that you are going to work from home from now on, just ask if they’ll give you the option to work from home those weeks when you are the on-call person, or to not have to physically come to the office when you’ve been called out that night.
Some managers will balk at this idea and state that you must be in the office in order that you can do your job. I have always found this to be an interesting, and hugely flawed argument.
The workplace fallacy
you cannot do your job when you are not in the office, as such you must be here every day
At the same time management says
you must be available at all times when you are the on-call person, so that you may respond immediately, wherever you are
Which one of those arguments is true?
If you can only do your job in the office then logically, you must either live there, and not leave when you are on call, or that they must employ sufficient members of staff in the same role as you so that there is coverage in the office 24 x 7. If this is the case, then great, you no longer have to be on call. I call that a win.
On the other side of the argument, if you have to be available constantly, and ready to respond immediately when the phone rings, and you are actually allowed to go home (I think management decries labor laws a lot of the time), then you must have the ability and the capability to work from home. If you have this ability then there is no hard and fast reason why you should be forced to go into the office after a rough night of being woken up every 2 hours.
Should you manager come up with some reason that you cannot actually work from home, then it might be worth politely explaining that it will take a lot longer to respond to being called out when you have to commute into the office to fix the problem. After all, you can’t work from home, right?
Eliminating standard tasks
Every night, at 10pm, you have to login and kick off an ETL process. The reason you have to do this is because sometimes it fails, and when it does then you have to be right there to get it started again so that it will complete by morning.
Or how about, every Saturday, at 11pm, we have 3 hours of downtime for weekly maintenance. This downtime will include rebuilding indexes, updating statistics, and ensuring that we aren’t processing any transactions so that Janice and Fred in the warehouse can get an accurate inventory count.
Solidify your processes
If the nightly ETL process fails 20% of the time then there are a couple of things that you can do.
First things first, figure out why it keeps failing. If it is something under your control then take care of it. For example, there is sometimes another process still running when the ETL job kicks off, and those two running together cause tempdb to grow, and it fills the drive, causing everything to fail. In this case add additional capacity to that drive, or split tempdb files over multiple drives, or even set a hard stop on the first process so that the more vital ETL job has the resources that it needs.
Now the ETL process only fails 5% of the time. Now it fails because a remote endpoint doesn’t respond quickly enough, and so you have to restart things manually. Well, is it worth your while babysitting this for the one time in twenty that it doesn’t work? How about setting up an alert for that job so that if it does fail then it will alert you by sending you a text message. Most cellular providers have email to SMS gateways that will allow you to send plain text emails out that will get processed and delivered as text messages on your phone. You can setup a really obnoxious ring tone for when that message comes in, so that you’ll wake up and deal with the error.
Let’s take this a step further though, if this ETL process fails 5% of the time because of a remote endpoint problem, but usually succeeds on the second attempt, then have the job perform a retry that data pull before reporting failure and waking you up with that foghorn notification.
Now your actual failure rate is down to less than 1%. Given that this job will now, possibly fail about once a year, do you really need to sit there and watch it? Nah, go to bed, get some rest. You’ll know soon enough if there is a problem to be resolved.
This ETL process is just a basic example of the kinds of things that get folks sitting at a machine night after night. There are lots of other examples, like you are watching the backups, or checking that the reindex process completed without any errors. You can take yourself out of the equation for each one of these with some planning and work.
Automate your maintenance
Rather than kicking back and watching something great on Netflix you are going through and running index maintenance. Watching those indexes slowly rebuild themselves, while stuck on a conference bridge with two other bored people, and usually at least one person who is three sheets to the wind.
After the indexes have been rebuilt you sit and wait for Janice to come on the bridge and tell you that the warehouse count has been completed, and the site can come back up so that you can start taking orders again.
What is going wrong here?
Unless you are the master of all that you survey, the chances are that you do not manage the company website, and that you do not have the permissions required to take the site offline, and then bring it back online again. If you cannot perform this action, then why are you required?
All of the work that you just did, waiting while indexes rebuilt, could have been scheduled for when the site was down. In fact, rather than sitting and waiting for the downtime, there is a strong chance that you could have had a lot of that work automated and executing on a nightly basis. So why are you there watching how things go? It doesn’t make a great deal of sense to me, does it you?
But what if something goes wrong? Well, then someone will pick up the phone and call you, and then you can jump online and figure out whatever the issue is. There’s really no requirement that you sit there and totally ruin your evening for no purpose whatsoever.
Stop manually releasing code
One of the more frequent reasons I hear for people being stuck online is to manually sit there and perform database code releases.
I will readily admit, for many years I was the person that wanted that level of tight control over the code going in and how it was going in. I wanted to be sure that I was in control of hitting F5 or CTRL-E (I’m not picky) and that the scripts got run exactly how I wanted. This wore on me a great deal over time, and in the last couple of years I have learned to let go, and embrace continuous integration (CI).
The article I’ve linked references several tools that you can use to get CI up and running. These are not really required, and you can, with a little work up front, get a version of CI running yourself just by using free source control software, and writing your code in such a way that you will always end up in a known state.
Quick and easy? No.
Find some “me” time
This is certainly something that sounds easier said than done, but finding a little bit of down time can do wonders for your mental state during those on-call weeks.
Make the most of lunch
Whether you are in the office, or working from home, make yourself take a lunch break. During this break, step away from your desk. Be gone from it for at least 30 minutes. Ideally take a short walk outside, otherwise, go read a book or magazine. No matter which you do, no screen time. Do not take your phone out and tinker with it, or break out the tablet to surf the web. The closest you should come to screen time is reading on your e-book reader that uses an electronic paper display.
The point here is to get away from the work that is going on, and do a little bit of recharging. Jumping right back to a regular computer screen just puts your head right back in the same place that you were trying to escape from. Give yourself the opportunity to get away from that stuff, even if it is just for a few minutes. You’ll be surprised at the difference that it makes.
I cannot step away for lunch though, I am on call, and something might break
I’ve heard that argument a lot over the years. Heck, I’ve made that argument a lot over the years. I got over this by realizing that I had to eat (well my wife told me, and I listen to her, she’s way smarter than I am). If you don’t eat you’ll go hungry. Going hungry does nothing for your mood, it just makes you hangry and that just makes the whole work situation feel that much worse. Now if I’m going to eat I don’t want to be eating whatever the work snack machine decides to spit out, I would want something nourishing. This means going to the break room and enjoying the salad that I made the previous night, or heading out to get something from a local establishment. All of this is time away from your desk and all the things that have been bugging you.
There is always a change that something will break while you are away, but unless you are a team of one, there will be someone else there to cover you for those few minutes. Or the worst comes to the worst, someone calls and you hotfoot it back to your computer to fix the problem.
Get some exercise
What’s so great about exercise? Well, other than helping (along with a properly balanced diet) to make you healthier, exercise also grants you the gift of endorphins. These little neurotransmitters interact with brain receptors to help you feel good. Short term affects include a feeling of positivity to life, and your situation. Longer term affects include a reduction in stress and lower anxiety.
You don’t have to go out running for two hours, or join one of those expensive gyms with the membership that you can never cancel. The odds are that near where you live, or work, there is a gym that will hook you up with a membership for under $20 a month. Alternatively, look for a local YMCA, which could provide you a lot of benefits, besides just the gym. Do not forget to look through your work perks and benefits package, they might have a deal for discounted, or even free, gym membership that you could take advantage of.
Just thirty minutes of exercise a couple of days a week could make you feel like a new person. You can even take your phone along with you, and most gyms have Wi-Fi so that you can get connected quickly should you need to do some work in an emergency.
The key here is to find the exercise that works for you. Not everything works for everyone. I enjoy running, my wife teaches dance fitness, and I have friends that love to cycle 40 minutes to work and back. Find the thing that is right for you.
Before you go jumping into something, be sure to check with your doctor that you are ready to begin some kind of exercise routine. The point of this is to help your overall health, we don’t want to inadvertently make you worse.
See your friends
Being stuck in the on-call situation can easily lead to feeling disconnected from your social group. They could still have the chance to go out, have fun, see things, do things. I’m sure that they are disappointed that you can’t go on that 14 mile hike this weekend, but it’s not going to stop them.
To get past this see if your friends want to hit up a local restaurant for dinner, or have a couple of them over to hang out for a bit. They will provide you with a welcome distraction from the worry about the phone ringing, and will help you keep that connection. Connections like these will help to keep you grounded.
Take time out for family
If you have a partner and/or kids, be sure to spend time with them while you are stuck in the 10th Circle of Hell (because that’s what on-call is). This is for them as much as it is for you.
Your kids probably enjoy you tucking them into bed at night (or if they are teenagers then freaking out because you are in their room). Wherever possible don’t let little things like that get lost due to work. Even if you are stuck dealing with a really nasty problem take 5 minutes, excuse yourself from the conference bridge for a “bio break” and say goodnight to them. Or even just go and see them and check on how their day was.
Your partner will be a little more understanding than the kids about what you are dealing with, but they shouldn’t be ignored either. Never neglect aspects of your relationships because of work. Just a two minute conversation checking in on them, and how they are doing will help them feel that they matter (because they do!) and it will help you keep in mind that there’s more important things in this world than Craig who is causing issues because he still hasn’t figured out the difference in using >= <= and between.
Time in lieu
You’ve been on-call for the last two weeks. It’s been many nights of hell, and you are salaried, so you are going to be seeing any big fat checks coming your way from overtime. This is a pretty bad situation. You really need a break, but you don’t want to waste your vacation time because you are really excited about that trip to Alliance, Nebraska you have coming up in a few weeks.
Go talk to your boss, and tell them how tired and worn out you are from being woken up 9 of the last 14 nights. Explain how you worked 70 hours last week and really just need a day to recharge your batteries. Then ask if you can take a day off in lieu of all that time worked.
The worst that can happen here? They tell you no. If they tell you yes, well then you’ve just got a day back where you can sleep, catch a movie, feed the dogs, take a nap, get to the store, snooze a little, and then rest up before going to bed that night, content in the knowledge that you aren’t on-call that week and should be able to sleep the night through.
Eliminating random calls
If something is truly broken I’ve never really minded having to get up in the middle of the night to deal with it. Things go wrong sometimes, it happens. The calls I absolutely hate though are the ones where nothing is actually broken, or that whatever issue there is has absolutely nothing to do with your area of responsibility.
First level response
You may think that you are the first level of response, but unless you are getting a direct notification from your monitoring software, or from an alert your, or your team created, then you aren’t.
Frequently the first level responder is going to be the person sitting in your Network Operations Center (NOC). This NOC tech is going to be the person that first gets the alarm, or sees the alert, and then make the decision on who they are going to call.
The key to helping the NOC tech call the right person is going to be documentation.
Documentation sucks right? Yes, yes it does. I’ve never liked creating it, even though I’ve had to do it at various levels for years. Having it though is a key component in stopping those random calls.
Let’s look at a very short piece of documentation that will prevent a 2am call out by the NOC:
- Job Name – Flange modulator returns notification
- Execution time – Nightly 2am
- On failure – Email DBA team for rerun early in business day, follow up with phone call if confirmation of rerun not received by 9am
This is an example of a very basic runbook entry. The runbook could be as simple as printed pieces of paper stored in a binder, or you could make it a living document stored in an online repository such as a wiki. To take it to another level, you have the alerting software include a link to the relevant wiki article in its notification. This way when the NOC receives that alert they can just click a link and instantly see what they need (or do not need) to do next.
Job notifications aside, the runbook should also include things like
- Server ownership details – never get called for a server that’s not your responsibility
- Application ownership – get the application subject matter expert (SME) on the phone first
- Application triage steps – don’t even wake the app owner, go through this list of troubleshooting steps first to help identify the root cause and then contact the relevant person
- Escalation steps – who’s next in line just in case the battery died in Charlene’s phone and she didn’t know it
- What to do if someone is firing torpedoes at your thermal exhaust port
Fill that runbook with as much detail and information as you think is required, and then keep it up to date as things change (because nothing ever stays the same).
There should not be a requirement for you get always have to get up in the middle of the night because something went wrong, or there is some kind of maintenance process that needs to be accomplished. Setting up tools for use by your first level responding folks can go a long way to reducing the amount of time it takes you to hate that song you set as your ringtone.
There will always be unknown items that you cannot plan for, and so have to deal with, but things that crop up with a relative level of frequency could be handled by other methods.
As an example, at my current gig there is a requirement for failing over Availability Groups (AGs) so that Windows patching can take place. This is one of those things that it doesn’t make sense to get a DBA out of bed for, so I spent some time coming up with some tooling for the NOC techs. They could launch a tool, see the config of all of our AGs, and then use the tool to fail any AG over from a primary replica to a secondary replica. Writing the tool and getting it working correctly, without granting an undue level of permissions to the NOC techs, took a while to do. Once it was complete, however, there was no need for any DBA to be online to perform that task.
Similarly, for automated releases we had to grant the a team the ability to enable and disable a user that was used by the release process to push out code. We store encrypted credentials in the automation tool that use this user. So we came up with another process, using Windows Auth, that would allow certain users, with no access to any data, or user databases, to enable the higher privileged user. And to ensure that we met security standards we logged who enabled the user, had it auto-disable after a set period of time, and fire off alerts should it not be disabled within that threshold.
Tools make life easier for everyone. Spending the time and effort to carefully craft things that your organization could use, and that would really benefit you, and your quality of life, is time that is really well spent.
Stop being on-call
Wise words these. I can hear you now “wow, why didn’t I think of this?”
I know, it’s not like you can just turn around and tell your boss that you are no longer going to be in the on-call rotation, but there are other options that might work out in the longer term.
I’m not saying that you need to be the CEO to get yourself out of on-call, but moving up the ladder a little can help.
Moving into management, for example. Most managers are not on-call, they might be an escalation point, but that is a great deal better than being the person to get the calls night after night after night. The burnout is a lot less that’s for sure.
What if, like me, the thought of management is abhorrent? Then look to move up within your given technical discipline. The lower rung you are on in that ladder the higher the chance that you’ll be in that rotation. As a DBA you are far more likely to end up being that first person called when something goes wrong. As a principal DBA you will be considered an SME, and as such are far more likely to move out of that first level, and up into an escalation tier that only gets called with things have truly gone bad (double edged sword this one).
Change your role
Loving the DevOps role, but finding yourself exhausted? How about making the move back to just being a developer? Sure, as a developer you might still end up being on-call, but the call out rate for a dev is a lot lower than that for an ops person. Adjust your skills and work towards moving into that dev team that does all the really cool stuff and wows the crowds at your companies hackathon (really, has an ops team EVER won anything at a hackathon?)
You might not want to leave that DevOps realm, but if your sanity is at stake then shouldn’t at least be worth considering?
Change your job
This is a bit of a final resort, but you could always go out and look for a job elsewhere that would allow you to do all the things that you enjoy, without the pain that is dealing with on-call. That is something that only you can decide upon, but before doing that it is worth considering, and trying some of the above options to see if it can mitigate some of the hurt you are feeling.
Being a part of the on-call rotation sucks. There is no smiling and making out like it is some kind of a blessing that other people are not aware of, it’s just a rotten experience.
The best way to deal with it is to try and find ways to minimize the pain that it will cause you. The easiest way to do this is to attack the reasons why you get called and eliminate them. That will not be a quick or simple task, but any effort you put in that direction will repay you tenfold in an improvement in your quality of life down the road.
P.S. I wrote this while on-call.