Twitter
Tuesday
Feb142012

The Loneliness of the On Call DBA

There are lots of great things about being a SQL Server DBA. You get to play with lots of cool software; get the chance to spend someone else’s money on hardware; constantly be in a position to learn new things; and be a member of an awesome community.

With the good always has to come a little bit of bad. I’ve been a production DBA for over 10 years now. That means for over 10 years I’ve been on call at various times. Way back when it used to be a case of carrying a pager everywhere and that pager get swapped between team members. Now the NOC has my cell number and I can be called at any point. I also get emails on my phone day and night. Work loves it when you’re in constant contact (no really, they do).

The toughest thing about being a DBA has to be those long late nights when you are dealing with a production issue. The feeling of loneliness is pretty intense when the house is quiet and you are sitting in front of a monitor, it’s bright light burning into your tired bleary eyes and making them burn. The only noises are the faint sounds of sleeping from another part of the house, the clicking of your keyboard as you type in the database restore commands and the low humming from the refrigerator.

This weekend I had to deal with a long brutal outage. I ended up working through the night and most of the next day. It is what I get paid for however; so while I didn’t exactly have a smile on my face I did walk away at the end with the satisfaction of knowing that I had done a good job. I was able to get everything back online as it should be and that the business functioning at 100% again.

When I first started working with SQL Server as a DBA I thought it was just a job, I didn’t know it would be a way of life. Truly, being a DBA is a job that you live. If you’re a DBA, live it well my friends.

Thursday
Feb092012

Logging Logons (or is that logins?)

Some say logon some say login. I tend to use them both (frequently within the same sentence). SQL Server really doesn’t help things by having you create a logon trigger that can do things with logins (see where I’m going here?).

Either way, we’re performing a server migration and needed to find out all the logins and clients that are connecting to the SQL instance. I also wanted to capture this data to a table to make it easy to report against (eliminating turning on successful logins). The ideal solution for me was to use a logon trigger, and as the instance in question is running SQL 2008 this worked out great.

I created a login, table of holding, table of rollup holding and granted the relevant permissions so that I could capture the data. Then I created the actual login logon trigger itself which captured the logon login, host name, application name and datetime information and logged the data into the table.

I added a procedure and scheduled job to rollup the data nightly so as to prevent the table from getting too large (which would be quite easy to do on a big server).

There are no primary keys and only a single clustered index. It’s all very light. If you are looking to capture logon/login data then this is a good starting point. Code below…

 

USE master 
go
--create the login which will be used to log the data
CREATE LOGIN LoginLogging WITH PASSWORD = 'somepassword123!@#' ;
GRANT VIEW SERVER STATE TO LoginLogging ;
 
 
--change this to the database of your choosing
USE DBAdmin
go
--create the user matching the login
CREATE USER LogonLogging FOR LOGIN LogonLogging ;
 
--create the table that will hold the data
CREATE TABLE LogonList
    (
      LoginName NVARCHAR(256) ,
      HostName NVARCHAR(256) ,
      ApplicationName NVARCHAR(256) ,
      LoginDate DATETIME2
    ) ;
CREATE INDEX CI ON dbo.LogonList(LoginDate)
 
CREATE TABLE LogonListDailyRollup
    (
      LoginName NVARCHAR(256) ,
      HostName NVARCHAR(256) ,
      ApplicationName NVARCHAR(256) ,
      LoginDate DATE,
      LoginCount INT
    ) ;
    
    
--allow the login to insert data to the table    
GRANT INSERT ON dbo.LogonList TO LogonLogging ;    
 
--proc to rollup the data daily to keep space low and ease reporting
CREATE PROCEDURE RollupLogonData
 
AS
 
SET NOCOUNT ON
 
BEGIN TRY
    BEGIN TRAN
 
    INSERT  INTO dbo.LogonListDailyRollup
            ( LoginName ,
              HostName ,
              ApplicationName ,
              LoginDate ,
              LoginCount
            )
            SELECT  LoginName ,
                    HostName ,
                    ApplicationName ,
                    CONVERT(DATE, LoginDate) AS LoginDate ,
                    COUNT(*) AS LoginCount
            FROM    dbo.LogonList
            WHERE   LoginDate < CONVERT(DATE, GETDATE())
            GROUP BY LoginName ,
                    HostName ,
                    ApplicationName ,
                    CONVERT(DATE, LoginDate);
 
 
    DELETE  FROM dbo.LogonList
    WHERE   LoginDate < CONVERT(DATE, GETDATE());
 
END TRY
BEGIN CATCH
    SELECT  ERROR_MESSAGE()
    ROLLBACK TRAN
END CATCH
COMMIT TRAN
 
GO
 
 
 
--db context back to master for creating the login trigger
USE master 
go
 
CREATE TRIGGER LogLogons ON ALL SERVER
    WITH EXECUTE AS 'LogonLogging'
    FOR LOGON
AS
    BEGIN
        INSERT  INTO DBAdmin.dbo.LogonList
                ( LoginName ,
                  HostName ,
                  ApplicationName ,
                  LoginDate 
                )
                SELECT  ORIGINAL_LOGIN() ,
                        HOST_NAME() ,
                        APP_NAME() ,
                        GETDATE()
                WHERE   ORIGINAL_LOGIN() NOT LIKE N'NT SERVICE%'
    END ;
 
GO
Wednesday
Feb082012

Mass Updating Job Step Retry Attempts

Sometimes I find that the SQL Server team make curious choices as regards default operations. When setting up replication by default the log reader and distribution agent job steps will retry 2,147,483,647 times before failing. Doing the math on the retry interval of 1 minute that means that the job wouldn’t actually hit a failed status for over 4000 years. While you can setup alerts around replication a lot of folks don’t do that and you could easily be in a situation where replication has been down for so long that you have to reinitialize the subscription and perform a snapshot to get things running again.

Having a great many distribution jobs running on my server I wanted to perform a mass update. It could have gone through and updated each job step but that would have taken me a couple of days. I needed a better way. Digging through the SQL Server Agent stored procedures I found that sp_update_jobstep would do the work for me. I just needed to pass in the relevant parameters and then restart the jobs so that the change took effect.

As such I ended up writing the following script. Just adjust the @retries variable to the value that you want to work with (in this instance it would be an hours worth of retries before the job failed). This script is limited to the Distribution job step but could be easily modified to manage the log reader also.

Running this script took two minutes and it updated 60 distribution agent jobs. I’m getting ready to kick it off on the rest of my servers.

SET NOCOUNT ON
      
DECLARE @retries INT = 60 --Set to the value that you want for retry attempts
 
DECLARE @JobIdUpd UNIQUEIDENTIFIER ,
    @JobStepUpd INT,
    @JobStatus INT,
    @JobName SYSNAME,
    @Message NVARCHAR(256)
 
 
--Create a temp table which will hold the results from xp_sqlagent_enum_jobs
--Note xp_sqlagent_enum_jobs is an undocumented stored procedure, it may not be there soon
CREATE TABLE #enumjobs
    (
      job_id UNIQUEIDENTIFIER NOT NULL ,
      last_run_date INT NOT NULL ,
      last_run_time INT NOT NULL ,
      next_run_date INT NOT NULL ,
      next_run_time INT NOT NULL ,
      next_run_schedule_id INT NOT NULL ,
      requested_to_run INT NOT NULL ,
      request_source INT NOT NULL ,
      request_source_id SYSNAME NULL ,
      running INT NOT NULL , 
      current_step INT NOT NULL ,
      current_retry_attempt INT NOT NULL ,
      job_state INT NOT NULL
    ) 
 
 
   
DECLARE UpdateJobsCrsr CURSOR
FOR
SELECT  sjs.job_id ,
        sjs.step_id ,
        sj.name
FROM    msdb.dbo.sysjobsteps sjs
        INNER JOIN msdb.dbo.sysjobs sj ON sjs.job_id = sj.job_id
WHERE   subsystem = 'Distribution'
        AND retry_attempts > 60
 
OPEN UpdateJobsCrsr ;
 
FETCH NEXT FROM UpdateJobsCrsr
INTO @JobIdUpd, @JobStepUpd,@JobName
WHILE ( @@FETCH_STATUS <> -1 ) 
    BEGIN
   
--Update the job step so that the number of retry attempts is set to the @retries variable                
EXEC msdb.dbo.sp_update_jobstep @job_id = @jobidUpd, @step_id = @jobstepUpd, @retry_attempts = @retries  
SELECT @Message = N'Job ' + @JobName + ' updated.'
RAISERROR (@Message, 0, 1) WITH nowait 
 
--We have to restart the job in order for this change to take effect
--Check to see if the job is running, if so stop and restart it, otherwise leave it alone
--Note, you'll need to be a sysadmin for this to work (or change up some stuff)
    TRUNCATE TABLE #enumjobs;
    INSERT INTO #enumjobs
                EXEC master.dbo.xp_sqlagent_enum_jobs 1, 'sa', @JobIdUpd;
                
SELECT @JobStatus = job_state FROM #enumjobs
 
IF @JobStatus <> 4 --This is the job idle state
    BEGIN
--Stop the job
        EXEC msdb.dbo.sp_stop_job @job_id = @JobIdUpd
--Wait a couple of seconds for the job to stop
        WAITFOR DELAY '00:00:03'        
--Start the job up again
        EXEC msdb.dbo.sp_start_job @job_id = @JobIdUpd    
    END
ELSE 
    BEGIN
        SELECT @Message = N'Job ' + @JobName + ' was not running, please start manually if required.'              
        RAISERROR (@Message, 0, 1) WITH NOWAIT
    END
                
        FETCH NEXT FROM UpdateJobsCrsr
            INTO @JobIdUpd, @JobStepUpd, @JobName
    END
 
 
CLOSE UpdateJobsCrsr ;
GO
DEALLOCATE UpdateJobsCrsr ;
GO
 
--Cleanup
DROP TABLE #enumjobs   
Tuesday
Jan312012

PASS Summit 2012–Register Now And Save

It’s the end of January and the price of attendance to the PASS Summit 2012 is about to go up.

Today you can register for $1,095 (that’s 50% off the full price). Tomorrow that price goes up. Get in now and save some bucks. The price only goes up by $100, but with you having the option of pre-purchasing the Summit DVD set for $125 when you register. That means by registering today you can get all the sessions delivered to you for only $25. Now THAT is a bargain.

Sunday
Jan222012

High Availability–It’s Not All About The Hardware/Software

As I write this I’m sitting my my house working with the last hour of juice left in my laptop. We had a major Snow + Seattle = Bad Newssnowstorm here in the Seattle area which was followed by a big ice storm and some heavy winds, all of which have coincided to knock out power to over 200,000 homes. I’ve been without power for over 75 hours now and it’s given me a lot of time to think about multiple facets of High Availability (HA) and Disaster Recovery (DR).

HA/DR are critical items for all companies. HA in various degrees, but most certainly DR and the ability to provide some kind of business continuity model. Different companies, large and small, have different requirements around HA/DR. Some could easily remain down for a couple of days with no lingering effects, others cannot sustain downtime on their most critical systems of more than a couple of minutes. Based upon those requirements (and any budget constraints) a company could implement HA/DR using Failover Clustering, Mirroring, Log Shipping, SAN Replication, Peer-to-Peer Replication, VMotion or some other technology.

 

 

Where HA/DR often falls down

This is all good and well but what about the people that you need in order to get things up and running, the folks that you need in order to provide you with ongoing support in the event that a situation arises where you have to go to your DR site, or provide support for that critical system where a mirror has failed?

In the last 80 hours I have been called by the NOC at work stating that they have an issue. I don’t have power at home so my internet access is off. I don’t have sufficient power in my phone to tether it (don’t tell the wireless company I can do that) and it would take 4 hours sitting in the car to get sufficient power to allow me to do that. I do have a mobile broadband USB device, however I’m not getting coverage in my house right now from that carrier and I have been unable to get out of my road. In other words in the whole HA/DR side of things I’m a single point of failure. Luckily I have a colleague with power who I was able to point the NOC to in order to provide resolution to the issue.

My colleague and I live less than 5 miles apart, we are just luckily on different circuits and substations for the power company. If neither of us had been able to dial in then the problem would not have been resolved.

 

If you truly want to implement HA/DR you need to go further than just looking at your hardware infrastructure or the way that your software is setup. It means more than your DNS configuration and your Active Directory setup. There’s more to it than setting a Failover Partner in your application connection string. It’s about the people that you need to support these things!

 

 

Artificial limitations

I’ve been a DBA for about 12 years now. I’ve not been hands on with a server since doing datacenter implementation work about 5 years ago. Other than in very small shops it’s rare that a DBA touches a server, most don’t even have access to the datacenter in which their database servers reside. There has been a huge move towards setup in Colocation Datacenters (CoLo’s) over the last few years, meaning that more often than not the datacenter where you servers reside are probably not even in the same physical building that you are. There is also the current move to the Cloud initiative that’s gaining some traction. Given all of this information I ask you this:

Why are DBAs so often tied to a physical location for work?

 

This is a serious question. There are so few telecommuting options for DBAs out there. You can look on the job boards and see virtually none. There’s a lot of relocation opportunities if you want to move to another state, but it doesn’t make a lot of sense. If you are going to geographically disperse your hardware due to HA/DR needs why would you not do that same for the teams that have to support it?

So many companies point blank refuse to even entertain the idea of telecommuters from another state (heck the vast majority won’t even contemplate allowing IT support staff to work from home a couple of days a week).

The same companies that happily outsource all development processes to an unknown entity in India or China refuse to entertain the idea of allowing Windows/Unix Admins or DBAs to telecommute.

 

I feel that it is somewhat ironic that the same people that expect you to come to the office and be tied to a physical location also have you be on call so that you can provide remote support when something goes wrong.

 

 

Moving beyond the office

Forcing IT support staff to come to a physical location 5 days a week makes little to no sense to me. I can understand the desire to have folks in the same physical location but as your organization grows it really needs to find a way to get past that mentality. There are various options such as LiveMeeting or GoToMeeting that can provide a telepresence in meetings for people that work remotely. Instant Messaging can be used to constantly keep people in touch. Services such as Yammer can be used to quickly and easily provide collaborative thought outside of meetings.

Looking at my current situation, if my colleague and I had both been without power we would have had no way to provide support. Our servers sit in a CoLo that we don’t have access to so even if we could have gotten there, we would have had no hands on way to work with the servers.

Now imagine we had a third person on the team that lived and worked in another state (or that the second person was living in another state and telecommuted in). Even with power down for me (or the entire state) we would still have someone available to provide support (unless by extremely unusual circumstances they were also down). With a remote DR site they could have effected whatever steps were necessary to bring that online and check that things were working ok.

Let’s look at this from another angle. I’m in the Pacific Time Zone and get to work usually 7:00-7:30am and leave 4:30-5:00pm. If we had someone working in the Eastern Time Zone and they worked the same hours we would have coverage (without emergencies handled by being on call) from 4am PST – 4:30pm PST. That’s an extra 3 hours of standard DBA coverage. Imagine how much more work could get done. How overnight problems could be identified and resolved earlier?

We live in a global society and the company that I work for does business internationally. Would it not make a lot more sense to provide a wider margin of coverage? Think about it, it would also improve the quality of life for anyone that had to be on call. For me the worst time to get called out is about 4am as I’ll never get back to sleep. If there was someone already in the office on the east coast I wouldn’t need to get called out. The same for them with evening plans. They could go out to dinner or a movie knowing that they wouldn’t have to worry about missing a call.

 

 

TL;DR

It’s the 21st century, there are many options to prevent the need for anyone to be physically tied to an office location (certainly amongst IT staff). In order to provide higher levels of support to companies they really need to embrace telecommuting options beyond providing emergency support.

 

 

Late Note: After 80 hours PSE were able to restore power to my neighborhood. Throughout the outage PSE did a fantastic job of handling things. Ultimately over 375,000 were without power at one time or another. PSE provided constant information wherever possible via their Twitter feed and website. They brought in crews from around the country to assist in the restoration efforts. All in all I think they handled things really well. I have no complaints with how they handled any of this. (I just wish I’d been closer to the top of the restore list)