Sometimes I find that the SQL Server team make curious choices as regards default operations. When setting up replication by default the log reader and distribution agent job steps will retry 2,147,483,647 times before failing. Doing the math on the retry interval of 1 minute that means that the job wouldn’t actually hit a failed status for over 4000 years. While you can setup alerts around replication a lot of folks don’t do that and you could easily be in a situation where replication has been down for so long that you have to reinitialize the subscription and perform a snapshot to get things running again.
Having a great many distribution jobs running on my server I wanted to perform a mass update. It could have gone through and updated each job step but that would have taken me a couple of days. I needed a better way. Digging through the SQL Server Agent stored procedures I found that sp_update_jobstep would do the work for me. I just needed to pass in the relevant parameters and then restart the jobs so that the change took effect.
As such I ended up writing the following script. Just adjust the @retries variable to the value that you want to work with (in this instance it would be an hours worth of retries before the job failed). This script is limited to the Distribution job step but could be easily modified to manage the log reader also.
Running this script took two minutes and it updated 60 distribution agent jobs. I’m getting ready to kick it off on the rest of my servers.
SET NOCOUNT ON
DECLARE @retries INT = 60 --Set to the value that you want for retry attempts
DECLARE @JobIdUpd UNIQUEIDENTIFIER ,
@JobStepUpd INT,
@JobStatus INT,
@JobName SYSNAME,
@Message NVARCHAR(256)
--Create a temp table which will hold the results from xp_sqlagent_enum_jobs
--Note xp_sqlagent_enum_jobs is an undocumented stored procedure, it may not be there soon
CREATE TABLE #enumjobs
(
job_id UNIQUEIDENTIFIER NOT NULL ,
last_run_date INT NOT NULL ,
last_run_time INT NOT NULL ,
next_run_date INT NOT NULL ,
next_run_time INT NOT NULL ,
next_run_schedule_id INT NOT NULL ,
requested_to_run INT NOT NULL ,
request_source INT NOT NULL ,
request_source_id SYSNAME NULL ,
running INT NOT NULL ,
current_step INT NOT NULL ,
current_retry_attempt INT NOT NULL ,
job_state INT NOT NULL
)
DECLARE UpdateJobsCrsr CURSOR
FOR
SELECT sjs.job_id ,
sjs.step_id ,
sj.name
FROM msdb.dbo.sysjobsteps sjs
INNER JOIN msdb.dbo.sysjobs sj ON sjs.job_id = sj.job_id
WHERE subsystem = 'Distribution'
AND retry_attempts > 60
OPEN UpdateJobsCrsr ;
FETCH NEXT FROM UpdateJobsCrsr
INTO @JobIdUpd, @JobStepUpd,@JobName
WHILE ( @@FETCH_STATUS <> -1 )
BEGIN
--Update the job step so that the number of retry attempts is set to the @retries variable
EXEC msdb.dbo.sp_update_jobstep @job_id = @jobidUpd, @step_id = @jobstepUpd, @retry_attempts = @retries
SELECT @Message = N'Job ' + @JobName + ' updated.'
RAISERROR (@Message, 0, 1) WITH nowait
--We have to restart the job in order for this change to take effect
--Check to see if the job is running, if so stop and restart it, otherwise leave it alone
--Note, you'll need to be a sysadmin for this to work (or change up some stuff)
TRUNCATE TABLE #enumjobs;
INSERT INTO #enumjobs
EXEC master.dbo.xp_sqlagent_enum_jobs 1, 'sa', @JobIdUpd;
SELECT @JobStatus = job_state FROM #enumjobs
IF @JobStatus <> 4 --This is the job idle state
BEGIN
--Stop the job
EXEC msdb.dbo.sp_stop_job @job_id = @JobIdUpd
--Wait a couple of seconds for the job to stop
WAITFOR DELAY '00:00:03'
--Start the job up again
EXEC msdb.dbo.sp_start_job @job_id = @JobIdUpd
END
ELSE
BEGIN
SELECT @Message = N'Job ' + @JobName + ' was not running, please start manually if required.'
RAISERROR (@Message, 0, 1) WITH NOWAIT
END
FETCH NEXT FROM UpdateJobsCrsr
INTO @JobIdUpd, @JobStepUpd, @JobName
END
CLOSE UpdateJobsCrsr ;
GO
DEALLOCATE UpdateJobsCrsr ;
GO
--Cleanup
DROP TABLE #enumjobs