Category: Availability Groups

The Curious Case of the Vanishing AG IP

Before you read any further, be sure to go and read Kendra Little’s fantastic post on How to Survive Opening A Microsoft Support Ticket for SQL Server or Azure SQL. There are details I won’t go into in this post, but I just wanted to note that the issues Kendra raises are not ones that only she experiences.

The Basics

Availability Groups (AGs) are a great way to handle business continuity. When they work, they work great. When they don’t work…well the documentation and tooling are rather lacking to help you get through.

One of the things with AGs is that you can have servers on different subnets. This is useful if you want to have AGs span across multiple data centers or want to perform cage migrations, or, in the case of Azure VMs, you want to have an AG that can automatically failover and isn’t at the mercy of a timeout of an Azure load balancer.

Failing over an AG to a different subnet works well, but it does require configuring your Windows Server Failover Cluster (WSFC) resource so that all of the IPs associate with the AG listener are registered (known as RegisterAllProvidersIp which is now the default). If they aren’t and you failover to a different subnet, you’re then at the mercy of your DNS TTL for the amount of time it will take for clients to connect to SQL on that new subnet (or you connect to each machine and flush its DNS cache).

With RegisterAllProvidersIp enabled, every IP address associated with the AG listener is presented and the client connection string includes MultiSubnetFailover=True (with supported client libraries) then the client will test all the presented addresses and then connect to the one that responds.

Adding a new IP to a listener for an AG is done using an ALTER AVAILABILITY GROUP MODIFY LISTENER command. This will normally update DNS appropriately but there are manual steps you can take to ensure it done. Why an extra step? Because if the IP is not registered and you failover to the new subnet, you end up with the TTL problem I mentioned earlier. A way around this is to create a new A record in DNS that points to the new IP. That way, when you failover, that IP is already in DNS and you don’t run into issues. You can create the A record either through the DNS console or through PowerShell.

The Issue

A requirement came in for a new server in a new subnet in an existing AG. A simple process and one that’s been done dozens of times in the past with a well worked SOP. In this instance, the new server was added to the WSFC, logins all added, databases restored, the instance added to the AG, and the listener modified to include it.

After getting the DNS team to manually add the new A record, a nslookup confirmed the IP appeared in the DNS record.

Great stuff! Ready to move into service then.

Except, prior to adding traffic to the new instance, the nslookup ran again and somehow the new IP had vanished from the A record. The DNS team stated they hadn’t removed it. Logs showed that none of the DBAs had executed anything to remove the IP, and yet it had vanished.

The DNS team added it back once more. Everyone validated it.

The next day it was gone once more.

The DNS admin did some looking and it showed that the AG listener computer account deleted it. Odd. So, I went digging through the SQL Server logs. Nothing there. Then I dumped the cluster logs and went digging through those.

There were some entries in the logs indicating that the WSFC was checking the IPs were valid, but only showed the IPs that already existed, and not the new one added. Looking back, this process ran every 24 hours.

Unfamiliar with this, it was time to open a ticket.

The Investigation

After explaining the issue multiple times and collecting many set of logs and and answering the same question a large number of times, we received a couple of “things to try” that included turning off RegisterAllProvidersIp (which would have caused an outage in a failover to a box on a different subnet) and to remove permissions from DNS for the AG listener (which would mean we couldn’t add new IPs using TSQL or PowerShell).

After several false starts over weeks, recreating the A records over and over again (I truly feel bad for the DNS admin who just kept his PowerShell script to hand and just hit enter once a day), we got to someone who moved past beyond reading some random web pages and gave us the first piece of useful information.

The Fix

The short version, is that after adding an IP to an AG listener, you have to restart the AG network name for it to actually pick up the change. You can do this by either failing over the AG to any other replica or offline/online the network name resource using the Cluster Manager GUI or PowerShell.

When a new IP is added to the AG listener, it’s added to the static config, but that config is not read in until the computer name is restarted. In this case, we hadn’t performed a failover or restarted that resource and so the WSFC used the cached record to validate the IPs with DNS. When it noticed that DNS had an extra IP that wasn’t in the cached configuration, it removed it.

After adding the IP once more, and performing an after-hours restart of the computer network name, the new A record remained through the next DNS check in the WSFC. After leaving it a couple more days to be sure that it wasn’t going to vanish again, the new server was added fully into service.

I asked for a link to documentation on this facet of AGs and WSFCs. Apparently there is none. So, this is just a warning note for those of you maybe adding IPs in extra subnets – restart your resources to ensure your change is picked up.

The Deception of the Log Reuse Wait Type

One day you’re happily working away and get interrupted by an alert that says your transaction logs are getting full or maybe growing. Ever the vigilant person you logon, crank open SSMS or ADS and take a look at sys.databases to check out the log_reuse_wait_desc.

AVAILABILITY_REPLICA the DMV proudly advises. Availability Groups – the lifesaver of your organization and the bane of your nightmares.

You crack open a SQL script and run it against the primary to see which replica is running behind. Strangely all the replicas appear to be up to date. The send queue is hovering around 60. The redo queue is not greater than a thousand on any of the asynchronous replicas.

Unsure why this is happening you use the tried and trusted method of restarting the hadr endpoint to see if that kicks things into motion. Nothing is resolved. The log is geting fuller and fuller and you really don’t want that thing to grow.

What do you do next?

You may go around to the secondary replicas and remove them from the AG, one at a time, to see if they fix the problem. Alternatively, you go ahead and check the running sessions on each of the replicas to see if there’s something going on you haven’t considered.

Your query discovers that the AG replica in your DR site is still taking a backup from overnight. It should have completed hours ago, but for some reason it is still running. Wondering if it is related you kill the backup.

With the backup dead, suddenly the transaction log on your primary clears and you can get back to your interrupted nap hard work.

 

This is an example of a choice in messaging within SQL Server that makes it more difficult to find a solution.

In this case the problem of the transaction log not truncating would have ideally been marked as ACTIVE_BACKUP_OR_RESTORE, after all, it was a backup that was blocking the log truncation. But the folks that wrote the product felt AVAILABILITY_REPLICA was more appropriate because the backup was running on a different replica than the primary and that was the root location for the truncation issue.

So, be aware, messaging may not always accurately reflect problems that are occurring.

 

Bug – Getting Read-Routing Data Using SMO

I was futzing around with SMO recently, and wanted to grab information around the read-routing list in an Availability Group. Things went fine for 2012 and 2014, but when it came to SQL Server 2016 I ran into a problem.

The read-routing property in SMO is a string collection (reference https://msdn.microsoft.com/en-us/library/microsoft.sqlserver.management.smo.availabilityreplica.readonlyroutinglist.aspx), which is fine for 2012/4, but SQL Server 2016 introduced the idea of load-balanced read-routing. When used this provide you with the option of having multiple read-only replicas which can handle traffic, and are load-balanced (it only uses a basic round-robin algorithm, but that’s a lot better than the single replica option that exists in the earlier versions).  In order to correctly know how read-routing is configured you need to know what replicas are in a load-balanced group. SMO does not provide you with the ability to get at this information (nor does it give you the chance to configure read-routing in a load-balanced scenario), as it is a basic collection, and has no further property information around it.

I’ve created a Connect item (Gathering Read Routing information using SMO is inaccurate in SQL 2016) to try and get MS to look into this, and maybe provide a fix, please upvote if you can.

If you want to try this for yourself you can use the C# below, try it against the lower versions, and then on a load-balanced 2016 configuration. While it’s still possible to get this data from TSQL I find it annoying that the server management objects that are specifically written to deal with this stuff don’t do the job.


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.SqlServer.Management.Smo;
using Microsoft.SqlServer.Management.Common;

namespace SmoTesting
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Enter the servername");
string connectServer = Console.ReadLine();
Console.WriteLine("Enter the AG name");
string agName = Console.ReadLine();


Server srv = new Server();
try
{
srv = new Server(connectServer);
srv.ConnectionContext.StatementTimeout = 60; //timeout after 60 seconds running the query

foreach (AvailabilityGroup ag in srv.AvailabilityGroups)
{
if (ag.Name == agName)
{
ag.PrimaryReplicaServerName.ToString());
foreach (AvailabilityReplica ar in ag.AvailabilityReplicas)
{
if (ar.Name.ToString() == "connectServer")
{
foreach (Object obj in ar.ReadonlyRoutingList)
{
Console.WriteLine(" {0}", obj);
}
}
}
}
}
}
catch (Exception ex)
{
Console.WriteLine(ex.InnerException.ToString());
}
finally
{
srv.ConnectionContext.Disconnect();
}
Console.WriteLine("press a key");
Console.Read();
}
}
}

Availability Groups Issue With 2016 CU2

SQL Server 2016 has been out for a few months now, with Cumulative Update 2 coming out in late September. Yesterday I was running into issues with deploying CU2 to one of my environments.

Typically, when running Availability Groups (AGs) you patch all of the secondary replicas, and then fail over to one of those which will then upgrade the user databases (SSISDB caveat not included). In this case after applying CU2 to one of the secondary replicas it was no longer able to communicate properly with the primary, and so was not synchronizing.

Looking in the SQL Server error log showed the following error:

An error occurred in a Service Broker/Database Mirroring transport connection endpoint, Error: 8474, State: 11. (Near endpoint role: Target, far endpoint address: ”)

This was indicative of an issue with the AG HADR endpoint (yes, still called a database mirroring connection).

Figuring that the connection issue was with the newly patch secondary I queried for the connection error on the secondary.

SELECT r.replica_server_name ,
 r.endpoint_url ,
 rs.connected_state_desc ,
 rs.last_connect_error_description ,
 rs.last_connect_error_number ,
 rs.last_connect_error_timestamp
 FROM sys.dm_hadr_availability_replica_states rs
 JOIN sys.availability_replicas r ON rs.replica_id = r.replica_id
 WHERE rs.is_local = 1;

The relevant column here being last_connect_error_description

An error occurred while receiving data: ‘10054(An existing connection was forcibly closed by the remote host.)’.

Having checked the error log and knowing that there were no login errors I knew that there was something else going on.

The servers in question were not fresh builds of SQL Server 2016, rather they had been upgraded from earlier versions, with the AGs being upgraded along the way. Older versions of SQL Server used the RC4 encryption algorithm on the endpoints, and so I was curious as to whether that had been changed as a part of any of the upgrade processes.

SELECT name ,
 type_desc ,
 state_desc ,
 role_desc ,
 is_encryption_enabled ,
 connection_auth_desc ,
 encryption_algorithm_desc
 FROM sys.database_mirroring_endpoints;

The relevant column being the encryption_algorithm_desc

RC4

I thought there was a chance that this was the issue, and wanted to change it, but in a non-breaking way (as there was another secondary replica that was using this algorithm).

Fortunately, the SQL Server team provided the ability to use more than one algorithm on and endpoint (or even none). By altering the endpoint I could specify a newer AES algorithm, with a fallback to RC4. All it required was an alter statement to be executed on each of the replicas.

ALTER ENDPOINT [Hadr_endpoint]
 STATE=STARTED
 AS TCP (LISTENER_PORT = 5022, LISTENER_IP = ALL)
 FOR DATA_MIRRORING (ROLE = ALL, AUTHENTICATION = WINDOWS NEGOTIATE
 , ENCRYPTION = REQUIRED ALGORITHM AES RC4)
 GO

As soon as this command was executed the AG picked up, the databases started synchronizing once more, and things were back to a happy state.

As a recommendation, check your endpoint encryption algorithms prior to applying any cumulative updates, or service packs to SQL Server, and ensure that they are current (use AES primarily). You also have the option to turn off encryption, but I wouldn’t recommend it.

TL;DR

If you are running an older RC4 encryption algorithm on your Availability Group or Database Mirroring endpoints you may lose connectivity when applying 2016 cumulative updates. Update to the newer AES algorithm to prevent this.

 

 

BUG with Availability Groups and sys.master_files

I recently came across a bug with SQL Server and Availability Groups whereby catalog view data is incorrectly reported on all secondary replicas.

This bug has the potential for putting the availability of your environment at risk as reporting around capacity could be calculated incorrectly.

Continue reading “BUG with Availability Groups and sys.master_files”