January 2020 – SQLozano

There are many ways a Windows cluster may get into problems, but this post is going present a specific one that I recently came into (as in “inadvertently provoked”). You may have a perfectly healthy cluster and suddenly one of the nodes is gone. You haven’t made any change to the cluster, but somehow the node won’t start its cluster services, looking offline for the rest of the cluster despite not having any communication issues.

Initial state

AGCLU01 with 4 nodes: AG01, AG02, AG03 and AG04

AGCLU02 with 2 nodes

The mess up

For some reason (the box crashed and is unrecoverable, the requirements for the cluster have changed, or was just a proof of concept), a node of the AGCLU01 cluster (AG04) is no longer available (either broken beyond repair, simply shut down permanently, or the machine has been properly decommissioned). Due to the node being unexpectedly lost or its decommission performed before it was evicted from the cluster, it has resulted in AGCLU01 ending up with only 3/4 nodes online at any given time.

*AGCLU01 with 4 nodes: AG01, AG02, AG03, and an offline AG04*

Meanwhile, a new AG04 machine is built with the same IP as the old one, since we have some set rules on the IP address assigned to a box based on their names for ease of identification. The requirements for our cluster has changed, and now each of them only needs 3 nodes so this new AG04 is added to the cluster AGCLU02.

*AGCLU02 with 3 nodes: AG05, AG06, and a brand new AG04*

Later, we found out AGCLU01 still has a ghost entry for a node named AG04 that no longer exist, so we decide to evict it from its old cluster AGCLU01.

The node will remain “Processing” the eviction order for a while: don’t expect it to complete any time soon (I waited for several minutes until I gave up and just hit refresh).

A clean *AGCLU01.sqlozano.com with only the three current nodes: AG01, AG02, and AG03*

So we got out AGCLU01 cluster all nice and clean with its 3 nodes? Let’s take a look at AGCLU02 and its 3 nodes.

AGCLU02 with 3 nodes: AG05, AG06, and an unexpectedly “downed” AG04

What’s happened to AG04? The box is up and running, so let’s check the cluster services.

A disabled Cluster Service is never a good sight on a cluster node

The first reaction

The cluster services are disabled, but that is not a big deal. Surely we can fix that by just starting it manually…

What can the system log tell us about that?

Filtering by the FailoverClustering source, the following errors can be found on AG04’s System logs at the time of its eviction from AGCLU01

If I had my speaker on, I’d hear the system log screaming at me

Event ID: 4621
Task Category: Cluster Evict/Destroy Cleanup
Message: This node was successfully removed from the cluster

Event ID: 4615
Task Category: Cluster Evict/Destroy Cleanup
Message: Disabling the cluster service during cluster node cleanup, has failed. The error code was '1115'. You may be unable to create or join a cluster with this machine until cleanup has been successfully completed. For manual cleanup, execute the 'Clear-ClusterNode' PowerShell cmdlet on this machine.

Event ID: 4629
Task Category: Cluster Evict/Destroy Cleanup
ssage: During node cleanup, the local user account that is managed by the cluster was not deleted. The error code was '2226'. Open Local Users and Groups (lusrmgr.msc) to delete the account.

Event ID: 4627
Task Category: Cluster Evict/Destroy Cleanup
Message: Deletion of clustered tasks during node cleanup failed. The error code was '3'. Use Windows Task Scheduler to delete any remaining clustered tasks.

Event ID: 4622
Task Category: Cluster Evict/Destroy Cleanup
Message: The Cluster service encountered an error during node cleanup. You may be unable to create or join a cluster with this machine until cleanup has been successfully completed. Use the 'Clear-ClusterNode' PowerShell cmdlet on this node.

Followed by the same error message repeated every 15 seconds:

Event ID: 1090
Task Category: Startup/Shutdown
Message: The Cluster service cannot be started. An attempt to read configuration data from the Windows registry failed with error '2'. Please use the Failover Cluster Management snap-in to ensure that this machine is a member of a cluster. If you intend to add this machine to an existing cluster use the Add Node Wizard. Alternatively, if this machine has been configured as a member of a cluster, it will be necessary to restore the missing configuration data that is necessary for the Cluster Service to identify that it is a member of a cluster. Perform a System State Restore of this machine in order to restore the configuration data.

What’s going on in the registry?

Let’s see how a “healthy” registry looks like in a cluster node, compared to our AG04

Left: AG04 | Right: AG05 with the highlighted “Cluster” hive

That’s it, the “Cluster” hive is missing from the registry. It was removed when the node was evicted from AGCLU01. Even though we meant to remove the node from AGCLU01 only, the command was sent through the network to the new AG04 node, and it received the order to remove all information regarding clusters it may retain.

Why did the cluster mistook the new AG04 for the old AG04?

In order to figure out why it was happening, I reproduced the following scenarios

Old DNS (AG04) with old IP (AG04’s).
Old DNS (AG04) with a new IP.
New DNS (AG07) with old IP (AG04’s), with old DNS (AG04) still active and pointing to the old IP (AG04’s).

and only the “Old name, old IP” combination caused this particular issue.

Although I couldn’t identify how the cluster managed to check both the DNS and the IP address, it appears the cluster sent the order to evict the node across the network, and it reaches a machine with the same name and same IP. This is good enough for most cases, but unfortunately doesn’t verify the machine receiving the order to clean its cluster configuration records is a member of the cluster sendind out the order.

How do I fix my cluster now?

The first reaction would be adding the server back on the AGCLU02 cluster, but we can’t add a server back into a cluster it is a member of

AG04 is a member of AGCLU02, and can’t be added twice

Well, maybe it can be added back to the first cluster it belonged to, AGCLU01

Somehow AG04 still thinks it’s a cluster node

No, it can not. Let’s try cleaning the node’s cluster configuration running

Clear-ClusterNode

No luck: still getting the same error when trying to add it to AGCLU02

AG04 still shows in the AGCLU02 list of nodes, since Clear-ClusterNode runs on the node and won’t change the cluster records

But what of AGCLU01?

Now I can add AG04 to the cluster AGCLU01 but not to the cluster it should belong now, AGCLU02, which retains some configurations and registry entries that identify this node as a member of the cluster already. But since I really want to get that AG04 node into AGCLU02, I’ll evict it from the cluster in order to be able to add it back again

Evicting an offline node… what could go wrong?

Now let’s try and add AG04 back to AGCLU02

Just a few more clicks until we recover our node

And we are back in business

How to avoid this in the first place?

First of all, always destroy your clusters cleanly: only when a machine is unrecoverable an offline node should be evicted from the cluster.

But if you must evict an offline node, make sure the DNS of the node to be evicted is no longer used, and if still exists is not pointing to a valid IP address assigned to a node member of an existing cluster.

And if the offline node evicted is brought back online, clean it’s cluster configuration, if only to keep it clean of components and avoid having error messages in the system log.

Notes on this test

This test was performed running Windows Server 2019 machines, based on a real world issue ocurred on machines running Windows Server 2016.

Month: January 2020

The node that wasn’t