Tag: AvailabilityGroups

AlwaysOn, Availability Groups, Uncategorized February 7, 2025

The Curious Case of the Vanishing AG IP

Before you read any further, be sure to go and read Kendra Little’s fantastic post on How to Survive Opening A Microsoft Support Ticket for SQL Server or Azure SQL. There are details I won’t go into in this post, but I just wanted to note that the issues Kendra raises are not ones that only she experiences.

The Basics

Availability Groups (AGs) are a great way to handle business continuity. When they work, they work great. When they don’t work…well the documentation and tooling are rather lacking to help you get through.

One of the things with AGs is that you can have servers on different subnets. This is useful if you want to have AGs span across multiple data centers or want to perform cage migrations, or, in the case of Azure VMs, you want to have an AG that can automatically failover and isn’t at the mercy of a timeout of an Azure load balancer.

Failing over an AG to a different subnet works well, but it does require configuring your Windows Server Failover Cluster (WSFC) resource so that all of the IPs associate with the AG listener are registered (known as RegisterAllProvidersIp which is now the default). If they aren’t and you failover to a different subnet, you’re then at the mercy of your DNS TTL for the amount of time it will take for clients to connect to SQL on that new subnet (or you connect to each machine and flush its DNS cache).

With RegisterAllProvidersIp enabled, every IP address associated with the AG listener is presented and the client connection string includes MultiSubnetFailover=True (with supported client libraries) then the client will test all the presented addresses and then connect to the one that responds.

Adding a new IP to a listener for an AG is done using an ALTER AVAILABILITY GROUP MODIFY LISTENER command. This will normally update DNS appropriately but there are manual steps you can take to ensure it done. Why an extra step? Because if the IP is not registered and you failover to the new subnet, you end up with the TTL problem I mentioned earlier. A way around this is to create a new A record in DNS that points to the new IP. That way, when you failover, that IP is already in DNS and you don’t run into issues. You can create the A record either through the DNS console or through PowerShell.

The Issue

A requirement came in for a new server in a new subnet in an existing AG. A simple process and one that’s been done dozens of times in the past with a well worked SOP. In this instance, the new server was added to the WSFC, logins all added, databases restored, the instance added to the AG, and the listener modified to include it.

After getting the DNS team to manually add the new A record, a nslookup confirmed the IP appeared in the DNS record.

Great stuff! Ready to move into service then.

Except, prior to adding traffic to the new instance, the nslookup ran again and somehow the new IP had vanished from the A record. The DNS team stated they hadn’t removed it. Logs showed that none of the DBAs had executed anything to remove the IP, and yet it had vanished.

The DNS team added it back once more. Everyone validated it.

The next day it was gone once more.

The DNS admin did some looking and it showed that the AG listener computer account deleted it. Odd. So, I went digging through the SQL Server logs. Nothing there. Then I dumped the cluster logs and went digging through those.

There were some entries in the logs indicating that the WSFC was checking the IPs were valid, but only showed the IPs that already existed, and not the new one added. Looking back, this process ran every 24 hours.

Unfamiliar with this, it was time to open a ticket.

The Investigation

After explaining the issue multiple times and collecting many set of logs and and answering the same question a large number of times, we received a couple of “things to try” that included turning off RegisterAllProvidersIp (which would have caused an outage in a failover to a box on a different subnet) and to remove permissions from DNS for the AG listener (which would mean we couldn’t add new IPs using TSQL or PowerShell).

After several false starts over weeks, recreating the A records over and over again (I truly feel bad for the DNS admin who just kept his PowerShell script to hand and just hit enter once a day), we got to someone who moved past beyond reading some random web pages and gave us the first piece of useful information.

The Fix

The short version, is that after adding an IP to an AG listener, you have to restart the AG network name for it to actually pick up the change. You can do this by either failing over the AG to any other replica or offline/online the network name resource using the Cluster Manager GUI or PowerShell.

When a new IP is added to the AG listener, it’s added to the static config, but that config is not read in until the computer name is restarted. In this case, we hadn’t performed a failover or restarted that resource and so the WSFC used the cached record to validate the IPs with DNS. When it noticed that DNS had an extra IP that wasn’t in the cached configuration, it removed it.

After adding the IP once more, and performing an after-hours restart of the computer network name, the new A record remained through the next DNS check in the WSFC. After leaving it a couple more days to be sure that it wasn’t going to vanish again, the new server was added fully into service.

I asked for a link to documentation on this facet of AGs and WSFCs. Apparently there is none. So, this is just a warning note for those of you maybe adding IPs in extra subnets – restart your resources to ensure your change is picked up.

AlwaysOn, Availability Groups November 9, 2016

Availability Groups Issue With 2016 CU2

SQL Server 2016 has been out for a few months now, with Cumulative Update 2 coming out in late September. Yesterday I was running into issues with deploying CU2 to one of my environments.

Typically, when running Availability Groups (AGs) you patch all of the secondary replicas, and then fail over to one of those which will then upgrade the user databases (SSISDB caveat not included). In this case after applying CU2 to one of the secondary replicas it was no longer able to communicate properly with the primary, and so was not synchronizing.

Looking in the SQL Server error log showed the following error:

An error occurred in a Service Broker/Database Mirroring transport connection endpoint, Error: 8474, State: 11. (Near endpoint role: Target, far endpoint address: ”)

This was indicative of an issue with the AG HADR endpoint (yes, still called a database mirroring connection).

Figuring that the connection issue was with the newly patch secondary I queried for the connection error on the secondary.

SELECT r.replica_server_name ,
 r.endpoint_url ,
 rs.connected_state_desc ,
 rs.last_connect_error_description ,
 rs.last_connect_error_number ,
 rs.last_connect_error_timestamp
 FROM sys.dm_hadr_availability_replica_states rs
 JOIN sys.availability_replicas r ON rs.replica_id = r.replica_id
 WHERE rs.is_local = 1;

The relevant column here being last_connect_error_description

An error occurred while receiving data: ‘10054(An existing connection was forcibly closed by the remote host.)’.

Having checked the error log and knowing that there were no login errors I knew that there was something else going on.

The servers in question were not fresh builds of SQL Server 2016, rather they had been upgraded from earlier versions, with the AGs being upgraded along the way. Older versions of SQL Server used the RC4 encryption algorithm on the endpoints, and so I was curious as to whether that had been changed as a part of any of the upgrade processes.

SELECT name ,
 type_desc ,
 state_desc ,
 role_desc ,
 is_encryption_enabled ,
 connection_auth_desc ,
 encryption_algorithm_desc
 FROM sys.database_mirroring_endpoints;

The relevant column being the encryption_algorithm_desc

RC4

I thought there was a chance that this was the issue, and wanted to change it, but in a non-breaking way (as there was another secondary replica that was using this algorithm).

Fortunately, the SQL Server team provided the ability to use more than one algorithm on and endpoint (or even none). By altering the endpoint I could specify a newer AES algorithm, with a fallback to RC4. All it required was an alter statement to be executed on each of the replicas.

ALTER ENDPOINT [Hadr_endpoint]
 STATE=STARTED
 AS TCP (LISTENER_PORT = 5022, LISTENER_IP = ALL)
 FOR DATA_MIRRORING (ROLE = ALL, AUTHENTICATION = WINDOWS NEGOTIATE
 , ENCRYPTION = REQUIRED ALGORITHM AES RC4)
 GO

As soon as this command was executed the AG picked up, the databases started synchronizing once more, and things were back to a happy state.

As a recommendation, check your endpoint encryption algorithms prior to applying any cumulative updates, or service packs to SQL Server, and ensure that they are current (use AES primarily). You also have the option to turn off encryption, but I wouldn’t recommend it.

TL;DR

If you are running an older RC4 encryption algorithm on your Availability Group or Database Mirroring endpoints you may lose connectivity when applying 2016 cumulative updates. Update to the newer AES algorithm to prevent this.

AlwaysOn, SQL May 18, 2015

Availability Groups & Reindexing

I’ve been working with AGs for the last year and have a couple of things as regards indexing that I thought would be good to share:
Continue reading “Availability Groups & Reindexing” →

SQL October 29, 2014

Improving Performance When Querying Multiple DMVs

A couple of days ago I posted a stored procedure (sp_GetAGInformation) which queried multiple DMVs to pull together a bunch of AvailabilityGroup information. If you took a look at the code you would see that it used a couple of CTEs (Common Table Expressions).

CTEs are a great way to do recursive work, and they can also greatly simplify reading code. A CTE without recursion is really nothing more than a subquery that is nicely wrapped.

For example:

Basic CTE

Is the same thing as:

Basic+Subquery

Basic Subquery

This can easily lead you down the path towards poor performance. It is quite easy to define a CTE once and use it multiple times, not realizing that every time you use the CTE then you are performing that subquery, meaning is has to be evaluated and executed.

For smaller queries this is not usually a problem, but for larger queries and more importantly when working with DMVs this can be a serious performance problem.

Continue reading “Improving Performance When Querying Multiple DMVs” →

SQL October 28, 2014

sp_GetAGInformation – Updated

Yesterday I posted sp_GetAGInformation, a stored procedure for gathering information about configuration of Availability Groups. Based on feedback I’ve added an additional column to indicate the health state of the AG.

Download the updated version for this additional information.

AlwaysOn, SQL, SQL 2012 October 27, 2014

Gathering AG Information – sp_GetAGInformation

The release of SQL 2012 brought about SQL Server AlwaysOn Availability Groups (AGs) as a new way to manage HA for databases.

With AGs came a whole lot of new DMVs to give you information. They also provided a nice dashboard which gives a view into the status of a particular AG

AG Dashboard

This can be quite useful, however it is missing a great deal of information, that as a DBA, I would find useful, like read routing and listener configurations. On top of that the dashboard only provides information on one of the AGs at a time. If you have more than one AG then you have to open up an entirely new dashboard.

This just wasn’t working out for me, and so I wrote a stored procedure (sp_GetAGInformation) to provide me with the configuration information for all the AGs running on a server.

When executed it provides:

Availability Group Name
Listener Name (if exists)
Primary Replica Name
Automatic Failover Partner (if exists)
Sync Secondary Replicas (if any)
Async Secondary Replicas (if any)
Read Routing Replicas (if any, in routing order)
List of Databases in Availability Group

Results of executing sp_GetAGInformation

As you can quickly see in the above example the AGAdvWrks AG has a listener, an auto-failover partner and two servers in the read routing order. It also contains two databases. AGTestAG doesn’t have any sync secondaries, or a listener, and only contains a single database.

If you have several AGs running in your environment this can be a real time saver. What’s also great is to pull this data centrally and report against it.

For example, right now I have a PowerShell process that queries every server, pulls the data back to a central location and reports on any changes in the configuration (if a servers gets pulled out for some reason, or a database added or removed from an AG). This can be an a real timesaver, in particular when you need to connect to a primary, but aren’t sure which server it is (given that neither SQLPS nor SSMS support multisubnet failover connection settings).

One of the limitations is that the data can only be obtained from the primary in an AG as certain sets of the data only reside there, and the read routing configuration can be (and should be) set differently on each server.

Give sp_GetAGInformation as try and let me know what you think. Any ideas for improvements are warmly welcomed.

AlwaysOn, SQL, SQL 2012 September 15, 2014

Traffic Flow With Read-Intent Routing

One of the big advantages to using SQL Server Availability Groups is the ability to automatically push read traffic over to a secondary server. This is particularly useful for larger queries that would take a few seconds to run and consume large amounts of resources. It’s not something recommended for short, fast queries, just because the additional latency of connecting to the secondary could slow down the overall response time for the query.

The Microsoft documentation on setting up Read-Only Routing in SQL AGs is pretty solid and explains how to get this up and running.

Firewall and traffic routing

In secure environments there is usually a firewall that resides between the front end web, application or mid-tier servers and the backend database server. This firewall would block all traffic to the backend except for specific ports to specific IP addresses. This is one of the defense in depth items that helps to keep your databases secure.

When using a firewall in conjunction with SQL Server Availability Groups (AGs) it is common to just open up the firewall to the AG Listener. That way there is a single IP open for all the database servers that reside in the AG and any machine that is not acting as the AG primary is not available through the firewall (reducing attack vectors again, a nice side effect).

Given this you might well expect that when routing traffic off to a readable secondary in the AG that it would follow the flow of:

Here the client (either directly or through a web, app, or mid-tier) performs an action that does a read query against the AG Listener. The expected traffic flow would be (from what we would see IP address wise, the AG Listener would actually connect to the primary, in this case SQL1):

Client – AG Listener – Readable Secondary – AG Listener – Client
so
Client – SQLAG01 – SQL2 – SQLAG01 – Client

This way the primary server (in this case SQL1) would arbitrate all the traffic for the query that comes in. In fact read routing does not function this way.

In order to perform the expected task of reducing the load on the primary the primary actually tells the client to redirect to the secondary server, and so the process goes:

The correct communication is

Client – AG Listener – Secondary – AG Listener – Client – Secondary – Client
or
Client – SQLAGL01 – SQL2 – SQLAGL01 – Client – SQL2 – Client

When the client request comes in SQL has to check that the readable secondary is available to accept the query (otherwise it will go to the next server in the routing list, which is why you should always have the primary as the last server in the routing list, just in case every other server is out of service).

This means the query will take a little longer to execute as the arbitration and network changes will take additional milliseconds to complete (why it is not ideal for small, fast selects).

Where does the firewall come in?

Using a firewall and only opening up the IP of the Listener is the best way to handle security, but if you want to use readable secondary server and read-intent routing that’s not going to work. Due to the way that the traffic is routed you would need to open up the firewall to each individual server and port that would be a secondary.

So in our above example the firewall would need to be opened to SQLAGL01, SQL1 & SQL2 in order to support client requests. If those rules aren’t opened then you’re client traffic will be blocked and you’ll get the dreaded “Named Pipes Provider: Error 40” error, which isn’t much of a help.

Testing your read-intent connections

A really useful way of testing your read-intent connections is to use a quick PowerShell script from your front end server (if running Windows) prior to putting it into rotation. Download Check-ReadRouting.PS1 and enter the AG Listener name, or IP Address and the name of a database in the AG. If things are working correctly it will return the name of the primary and first server in your read-only routing list.

If you get a timeout then you have either not set the read-intent URL correctly for your secondary, or you are having firewall issues connecting, and so should investigate further.

Read-routing can be really powerful and useful, you just have to be careful of the gotchas in getting it working correctly.

AlwaysOn, SQL, SQL 2012 August 13, 2014

Querying Change Tracking Tables Against a Secondary AG Replica

If you aren’t familiar with Change Tracking I would recommend heading out and reading Kendra Little’s Change Tracking master post which contains a wealth of information.

I’ve been using CT for a while now and it does what it says on the box. The performance can be painful when querying tables that have changed a lot (the changetable function performs a huge amount of aggregation and seems to be optimized for 1000 rows of data). One of the things that I’ve always wanted to do is perform loads into an ODS from a DR site.

I use AvailabilityGroups to ensure that a near real-time copy of the data is kept in another data center in another part of the country. I’ve tried a couple of times to query the change information from one of the secondary replicas, but sadly it’s not supported and so I would get the error

Msg 22117, Level 16, State 1, Line 1
For databases that are members of a secondary availability replica, change tracking is not supported. Run change tracking queries on the databases in the primary availability replica.

Yesterday I was messing around with database snapshots and was really happy to discover that it is possible to use the changetable function against a snapshot and not receive any errors. This will only work against readable secondary replicas (as the database needs to be online in order to be able to take the snapshot).

This is also the case with log shipped copies of databases. If the database is in standby then you can access the changetable function directly, or do so off a snapshot.

It doesn’t seem like this is a big deal, but if you like to load data into an ODS or Data Warehouse server and it’s not located in the same location as your AG primary, then this is huge as you can asynchronously write data over a WAN and then do all your data loads local to the ODS. This is far more efficient (and your network admin will like you a lot more) than pulling the data over in a large chunk nightly and saturating a part of the network.

Just another way that you can make your DR system work for you.

AlwaysOn, Installation, SQL 2012 August 4, 2014

Rolling Upgrades With Availability Groups – A Warning

One of the great options provided by Availability Groups, in SQL Server 2012 Enterprise Edition and newer, is the ability to perform rolling upgrades to new Service Packs or Cumulative Updates.

The basic idea is that you apply the update to one of the AG secondary servers and then perform a failover of SQL to that server which then does the necessary things on the user databases to bring them up to the level of the update. The big advantage to this is that it minimizes the outage required to get the SP/CU applied, so that you are down for a few seconds instead of 40 minutes.

This is works really well for your regular user databases, however there is a problem when applying a CU or SP to a secondary server where a Integration Services (typically called SSISDB) is a member of an Availability Group. If you attempt to apply the CU/SP then it can fail and the SSISDB be left in an offline state.

In order to apply the CU/SP you would first have to remove SSISDB from the Availability Group and recover it on each server you want to patch. Once you have completed patching all the servers you can add SSISDB back to the AG. But for that period of time you will be at risk, so get through and patch a couple of the machines and get the AG working for those as soon as possible.

Interestingly this does not apply for all CU/SP releases. Some do not make changes to SSISDB and this isn’t required. You can only find this out by patching, so be sure to get it going in your test environments first.

AlwaysOn, Rant, SQL 2012 November 5, 2013

You Can’t Meet Your RPO/RTO With AlwaysOn

“That title may have caught your attention. AlwaysOn is the future of HA/DR for SQL Server, and has been since the release of SQL 2012.

AlwaysOn is actually a marketing term which covers Failover Cluster Instances (FCIs) and Availability Groups (AGs). Allan Hirt (@sqlha | blog) is a strong proponent of ensuring that people understand what this actually means. So much so that he even ranted about it a little.

I’ve used FCIs for years, going back to the active/passive clustering days of old, and I’ve used Availability Groups in the last few months. They are both great, and both have limitations: FCIs with their shared storage and AGs with some network and quorum oddities.

Both of them will do a fine job for you if you have the time, patience, and in the case of AGs, money to get them up and running. They still will not allow you to meet your RPO/RTO though.

Critical to your business and your users is your up time, and that’s where the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) come into play. They reflect amount of time it will take to get your services back up and running, as well as the level of data loss that you are willing to accept.

Where FCI/AG win

The key problem with FCI/AG is that they do everything that they can to ensure that transactions are kept as up to date as possible. With FCI you move an entire instance over to another node, everything committed goes with it. With AGs the log records are shipped to the secondaries and applied in a synchronous or asynchronous fashion. The asynchronous setting is designed to get transactions there as soon as possible, and great for longer distances or where low commit times are ultra-critical. Both of these solutions solve two problems…a hardware issue or a software issue.

What does that mean? If your server goes down, then you can failover and lose next to nothing and be back up and running quickly. If Windows goes out to lunch on one of the machines then you can failover and keep ticking along.

So where do they fall down?

What FCI/AG cannot do

Let’s say there’s a code release and a table accidentally has an update run against it with no where clause. All of a sudden you have a table in a 500GB database which contains 3 million rows and all of the information is wrong. Your users cannot use the application, your help desk is getting call after call and you are stuck.

Your only option here is to restore your backup and roll up your transaction logs to the point right before the update happened. You’ve done tests on this and know that it will take 120 minutes to get back to that point. Now you have a 2 hour outage and users are screaming, the CIO is at your desk wondering how this happened, and demanding you get the database back up sooner.

FCIs and AGs are not going to help you in this situation. That update is already committed and so failing over the instance won’t help. The transaction logs were hardened immediately on your synchronous partner and applied within 5 seconds on your asynchronous target.

So how has AlwaysOn helped you in this situation? It hasn’t. And while you can sit there cussing out Microsoft for pushing this solution that has this massive failing it’s not going to solve your problem. That’s why you need something more than AlwaysOn.

You can pry Log Shipping from my cold dead hands

“Log Shipping?” I hear you ask, “but that’s so old.”

It sure is. It’s old, it’s clunky, and it is perfect for the scenario I just mentioned.

You can configure log shipping to delay writing transaction logs to remote servers. Let’s say you delay logs for 1 hour. That accidental mass update was performed, you realize that you are in trouble. You quickly apply the logs on the secondary to the point in time before the update, bring the database online and repoint your clients. You are back up again in 5 minutes. It’s a momentary issue. Sure, you have an outage, but that outage lasts a fraction of the time. Your help desk is not inundated with calls, your users aren’t left out in the cold for hours.

There’s nothing to say that you have to delay applying those logs for an hour. It could be 2 hours, or even 24. It really all depends on how you want to handle things.

Sure, you have to do manual failover, and you don’t have the ability for automatic page level restores from one of the synchronous AG secondaries, but you have a level of data resiliency that AlwaysOn does not provide you.

So while AlwaysOn technologies are great, and you should absolutely use them to enhance HA/DR in your business, but you have to be aware of their limitations, and be sure to use other parts of SQL Server to ensure that you can keep your business running.