Tag: SQL 2012

Gathering AG Information – sp_GetAGInformation

The release of SQL 2012 brought about SQL Server AlwaysOn Availability Groups (AGs) as a new way to manage HA for databases.

With AGs came a whole lot of new DMVs to give you information. They also provided a nice dashboard which gives a view into the status of a particular AG
AG+Dashboard

AG Dashboard

This can be quite useful, however it is missing a great deal of information, that as a DBA, I would find useful, like read routing and listener configurations. On top of that the dashboard only provides information on one of the AGs at a time. If you have more than one AG then you have to open up an entirely new dashboard.

This just wasn’t working out for me, and so I wrote a stored procedure (sp_GetAGInformation) to provide me with the configuration information for all the AGs running on a server.

When executed it provides:

  • Availability Group Name
  • Listener Name (if exists)
  • Primary Replica Name
  • Automatic Failover Partner (if exists)
  • Sync Secondary Replicas (if any)
  • Async Secondary Replicas (if any)
  • Read Routing Replicas (if any, in routing order)
  • List of Databases in Availability Group

sp_GetAGInformation+Results
Results of executing sp_GetAGInformation

As you can quickly see in the above example the AGAdvWrks AG has a listener, an auto-failover partner and two servers in the read routing order. It also contains two databases. AGTestAG doesn’t have any sync secondaries, or a listener, and only contains a single database.

If you have several AGs running in your environment this can be a real time saver. What’s also great is to pull this data centrally and report against it.

For example, right now I have a PowerShell process that queries every server, pulls the data back to a central location and reports on any changes in the configuration (if a servers gets pulled out for some reason, or a database added or removed from an AG). This can be an a real timesaver, in particular when you need to connect to a primary, but aren’t sure which server it is (given that neither SQLPS nor SSMS support multisubnet failover connection settings).

One of the limitations is that the data can only be obtained from the primary in an AG as certain sets of the data only reside there, and the read routing configuration can be (and should be) set differently on each server.

Give sp_GetAGInformation as try and let me know what you think. Any ideas for improvements are warmly welcomed.

Traffic Flow With Read-Intent Routing

One of the big advantages to using SQL Server Availability Groups is the ability to automatically push read traffic over to a secondary server. This is particularly useful for larger queries that would take a few seconds to run and consume large amounts of resources. It’s not something recommended for short, fast queries, just because the additional latency of connecting to the secondary could slow down the overall response time for the query.

The Microsoft documentation on setting up Read-Only Routing in SQL AGs is pretty solid and explains how to get this up and running.

 

Firewall and traffic routing

In secure environments there is usually a firewall that resides between the front end web, application or mid-tier servers and the backend database server. This firewall would block all traffic to the backend except for specific ports to specific IP addresses. This is one of the defense in depth items that helps to keep your databases secure.

When using a firewall in conjunction with SQL Server Availability Groups (AGs) it is common to just open up the firewall to the AG Listener. That way there is a single IP open for all the database servers that reside in the AG and any machine that is not acting as the AG primary is not available through the firewall (reducing attack vectors again, a nice side effect).

Given this you might well expect that when routing traffic off to a readable secondary in the AG that it would follow the flow of:

download (5)

Here the client (either directly or through a web, app, or mid-tier) performs an action that does a read query against the AG Listener. The expected traffic flow would be (from what we would see IP address wise, the AG Listener would actually connect to the primary, in this case SQL1):

Client – AG Listener – Readable Secondary – AG Listener – Client
so
Client – SQLAG01 – SQL2 – SQLAG01 – Client

This way the primary server (in this case SQL1) would arbitrate all the traffic for the query that comes in. In fact read routing does not function this way.

In order to perform the expected task of reducing the load on the primary the primary actually tells the client to redirect to the secondary server, and so the process goes:

download (6)

The correct communication is

Client – AG Listener – Secondary – AG Listener – Client – Secondary – Client
or
Client – SQLAGL01 – SQL2 – SQLAGL01 – Client – SQL2 – Client

When the client request comes in SQL has to check that the readable secondary is available to accept the query (otherwise it will go to the next server in the routing list, which is why you should always have the primary as the last server in the routing list, just in case every other server is out of service).

This means the query will take a little longer to execute as the arbitration and network changes will take additional milliseconds to complete (why it is not ideal for small, fast selects).

 

Where does the firewall come in?

Using a firewall and only opening up the IP of the Listener is the best way to handle security, but if you want to use readable secondary server and read-intent routing that’s not going to work. Due to the way that the traffic is routed you would need to open up the firewall to each individual server and port that would be a secondary.

So in our above example the firewall would need to be opened to SQLAGL01, SQL1 & SQL2 in order to support client requests. If those rules aren’t opened then you’re client traffic will be blocked and you’ll get the dreaded “Named Pipes Provider: Error 40” error, which isn’t much of a help.

 

Testing your read-intent connections

A really useful way of testing your read-intent connections is to use a quick PowerShell script from your front end server (if running Windows) prior to putting it into rotation. Download Check-ReadRouting.PS1 and enter the AG Listener name, or IP Address and the name of a database in the AG. If things are working correctly it will return the name of the primary and first server in your read-only routing list.

If you get a timeout then you have either not set the read-intent URL correctly for your secondary, or you are having firewall issues connecting, and so should investigate further.

 

Read-routing can be really powerful and useful, you just have to be careful of the gotchas in getting it working correctly.

 

Querying Change Tracking Tables Against a Secondary AG Replica

If you aren’t familiar with Change Tracking I would recommend heading out and reading Kendra Little’s Change Tracking master post which contains a wealth of information. 

I’ve been using CT for a while now and it does what it says on the box. The performance can be painful when querying tables that have changed a lot (the changetable function performs a huge amount of aggregation and seems to be optimized for 1000 rows of data). One of the things that I’ve always wanted to do is perform loads into an ODS from a DR site. 

I use AvailabilityGroups to ensure that a near real-time copy of the data is kept in another data center in another part of the country. I’ve tried a couple of times to query the change information from one of the secondary replicas, but sadly it’s not supported and so I would get the error

Msg 22117, Level 16, State 1, Line 1
For databases that are members of a secondary availability replica, change tracking is not supported. Run change tracking queries on the databases in the primary availability replica.

 

Yesterday I was messing around with database snapshots and was really happy to discover that it is possible to use the changetable function against a snapshot and not receive any errors. This will only work against readable secondary replicas (as the database needs to be online in order to be able to take the snapshot).

This is also the case with log shipped copies of databases. If the database is in standby then you can access the changetable function directly, or do so off a snapshot.

 

It doesn’t seem like this is a big deal, but if you like to load data into an ODS or Data Warehouse server and it’s not located in the same location as your AG primary, then this is huge as you can asynchronously write data over a WAN and then do all your data loads local to the ODS. This is far more efficient (and your network admin will like you a lot more) than pulling the data over in a large chunk nightly and saturating a part of the network.

Just another way that you can make your DR system work for you.

Rolling Upgrades With Availability Groups – A Warning

One of the great options provided by Availability Groups, in SQL Server 2012 Enterprise Edition and newer, is the ability to perform rolling upgrades to new Service Packs or Cumulative Updates.

The basic idea is that you apply the update to one of the AG secondary servers and then perform a failover of SQL to that server which then does the necessary things on the user databases to bring them up to the level of the update. The big advantage to this is that it minimizes the outage required to get the SP/CU applied, so that you are down for a few seconds instead of 40 minutes.

 

This is works really well for your regular user databases, however there is a problem when applying a CU or SP to a secondary server where a Integration Services (typically called SSISDB) is a member of an Availability Group. If you attempt to apply the CU/SP then it can fail and the SSISDB be left in an offline state. 

In order to apply the CU/SP you would first have to remove SSISDB from the Availability Group and recover it on each server you want to patch. Once you have completed patching all the servers you can add SSISDB back to the AG. But for that period of time you will be at risk, so get through and patch a couple of the machines and get the AG working for those as soon as possible.

Interestingly this does not apply for all CU/SP releases. Some do not make changes to SSISDB and this isn’t required. You can only find this out by patching, so be sure to get it going in your test environments first.

SQL Server 2012 Data Loss Bug

Microsoft have just posted a hotfix for a very serious bug in SQL Server 2012

This bug could potentially cause you to lose data during online index rebuilds.

 

At this time I would not recommend downloading the hotfix (wait for a CU which will address the problem). In the meantime follow the posted workaround which is to perform online index rebuilds with a MAXDOP of 1. For example:

 

ALTER INDEX <idxname> ON <objectname> REBUILD WITH (ONLINE=ON, MAXDOP=1);

 

Change Tracking Cleanup – Going Beyond Automatic

As I’ve previously mentioned, the automatic cleanup process of the syscommittab table limits you to being able delete 14.4 million rows a day. This is a problem in a highly transactional system with volatile tables.

Where the automated cleanup process let’s you down there is an undocumented stored procedure which will allow you to perform a manual cleanup of syscommittab in order to prevent it from growing out of control.

The procedure sys.sp_flush_commit_table_on_demand accepts a BIGINT for the number of rows that you want to delete from syscommittab. 

When the procedure executes it will delete as many rows as you pass in, doing so in batches of 10,000. It will not delete any rows that are newer than the minimum cleanup date, so you do not have to worry about deleting too much data and running into a problem with no longer being able to use your Change Tracking tables.

While the sp_flush_commit_table_on_demand has a deadlock priority of low I would recommend temporarily disabling the automatic cleanup process when running it, just to prevent possible locking or blocking issues with new rows trying to be inserted. It also makes the most sense to run this during a quieter time on your SQL Instance, just to reduce the impact further. 

You can temporarily disable Change Tracking cleanup by running


ALTER DATABASE CCTest SET CHANGE_TRACKING (AUTO_CLEANUP = OFF);

When you are done with the manual cleanup you can re-enable by running


ALTER DATABASE CCTest SET CHANGE_TRACKING (AUTO_CLEANUP = ON, CHANGE_RETENTION = 24 HOURS);

Note, when re-enabling Change Tracking I’ve included the CHANGE_RETENTION information. If you do not include this then the default of 72 hours will be used. This may be too high or too low, depending on your needs, so be careful.

 

In summation, for a busy system watch the size of the syscommittab table and if necessary run sys.sp_flush_commit_table_on_demand if the automated cleanup process cannot keep up.

Change Tracking Cleanup – Example

In my last post I wrote about a limitation of Change Tracking and in particular a limitation in the cleanup process for the syscommittab table.

I thought that a quick example of seeing this in action might be useful in allowing you to check for the issue yourself.

 

First things first, we setup a database, enable CT, create and load up a table with rows.

 


CREATE DATABASE CTTest;
ALTER DATABASE CTTest SET RECOVERY SIMPLE;
ALTER DATABASE CTTest SET CHANGE_TRACKING = ON (CHANGE_RETENTION = 5 MINUTES, AUTO_CLEANUP = ON);
USE CTTest
GO
CREATE TABLE CTTable (C1 INT NOT NULL
,
C2 UNIQUEIDENTIFIER NOT NULL
,
C3 UNIQUEIDENTIFIER NOT NULL
,
CONSTRAINT PK_CTTable PRIMARY KEY CLUSTERED (C1)
);
GO
ALTER TABLE CTTable ENABLE CHANGE_TRACKING;
GO
CREATE TABLE CTLoader (C1 INT NOT NULL
,
C2 UNIQUEIDENTIFIER NOT NULL
,
C3 UNIQUEIDENTIFIER NOT NULL
,
CONSTRAINT PK_CTLoader PRIMARY KEY CLUSTERED (C1)
);
SET NOCOUNT ON;
DECLARE @i INT = 1
WHILE @i < 2000000
BEGIN
BEGIN TRAN
INSERT INTO
CTTable
SELECT @i, NEWID(), NEWID();
COMMIT TRAN
SELECT
@i += 1
END

With the table in place we create an XEvent session to capture when the cleanup runs for syscommittab.

CREATE EVENT SESSION [ChangeTrackingCleanup] ON SERVER
ADD EVENT sqlserver.syscommittab_cleanup
(
ACTION(
sqlserver.database_id
, sqlserver.sql_text
, sqlserver.session_id
, sqlserver.username
, sqlserver.client_hostname
, sqlos.task_time
, sqlserver.tsql_stack
, sqlserver.database_name
, package0.collect_system_time
)
WHERE (database_id=6)
)
ADD TARGET package0.event_file(SET filename=N'C:Program FilesMicrosoft SQL ServerMSSQL11.MSSQLSERVERMSSQLLogChangeTrackingCleanup.xel')
WITH (EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS)
GO
ALTER EVENT SESSION [ChangeTrackingCleanup] ON SERVER STATE = START
GO

 

With the session running we just need to wait a few minutes and pull in the XEvent data to see what’s been going on with the cleanup (HT for getting the XEM information to Jon Kehayias post)

DECLARE @path NVARCHAR(260), @mdpath NVARCHAR(260)
-- Get the log file name and substitute * wildcard in
SELECT
@path = LEFT(column_value, LEN(column_value)-CHARINDEX('.', REVERSE(column_value)))
+
'*'
+ RIGHT(column_value, CHARINDEX('.', REVERSE(column_value))-1)
FROM sys.dm_xe_sessions s
JOIN sys.dm_xe_session_object_columns soc
ON s.address = soc.event_session_address
WHERE s.name = 'ChangeTrackingCleanup'
AND soc.OBJECT_NAME = 'event_file'
AND soc.column_name = 'filename'
-- Get the metadata file name and substitute * wildcard in
SELECT
@mdpath = LEFT(column_value, LEN(column_value)-CHARINDEX('.', REVERSE(column_value)))
+
'*'
+ RIGHT(column_value, CHARINDEX('.', REVERSE(column_value))-1)
FROM sys.dm_xe_sessions s
JOIN sys.dm_xe_session_object_columns soc
ON s.address = soc.event_session_address
WHERE s.name = 'ChangeTrackingCleanup'
AND soc.OBJECT_NAME = 'event_file'
AND soc.column_name = ' metadatafile'
SELECT
row_number() OVER (
ORDER BY event_data.value('(event/@timestamp)[1]', 'datetimeoffset')) AS RowNum
, n.value('(data[@name="database_id"]/value)[1]', 'int') AS DatabaseId
, n.value('(data[@name="rows_deleted"]/value)[1]', 'bigint') AS CommitRowsDeleted
, CONVERT(datetime2(0), DATEADD(hour, -7, n.value('(data[@name="cleaned_upto_date"]/value)[1]', 'datetimeoffset'))) AS CleanedUpToDate
, event_data.value('(event/@name)[1]', 'nvarchar(200)') AS EventType
, CONVERT(datetime2(0), DATEADD(hour, -7, event_data.value('(event/@timestamp)[1]', 'datetimeoffset'))) AS EventTime
FROM (
SELECT
CAST(event_data AS XML) AS event_data
FROM
sys.fn_xe_file_target_read_file(@path, @mdpath, NULL, NULL)
WHERE
OBJECT_NAME = 'syscommittab_cleanup'
) AS tab
CROSS APPLY event_data.nodes('event') AS q(n)

syscommittab+cleanup

Running this shows that the cleanup process runs about every 62 seconds and only deletes 10,000 rows. Given that, there’s no way to be able to keep syscommittab under control if you run more than 14 million transactions through change tracked tables on a daily basis.