There are many ways a Windows cluster may get into problems, but this post is going present a specific one that I recently came into (as in “inadvertently provoked”). You may have a perfectly healthy cluster and suddenly one of the nodes is gone. You haven’t made any change to the cluster, but somehow the node won’t start its cluster services, looking offline for the rest of the cluster despite not having any communication issues.
Initial state
AGCLU02 with 2 nodes
The mess up
For some reason (the box crashed and is unrecoverable, the requirements for the cluster have changed, or was just a proof of concept), a node of the AGCLU01 cluster (AG04) is no longer available (either broken beyond repair, simply shut down permanently, or the machine has been properly decommissioned). Due to the node being unexpectedly lost or its decommission performed before it was evicted from the cluster, it has resulted in AGCLU01 ending up with only 3/4 nodes online at any given time.
Meanwhile, a new AG04 machine is built with the same IP as the old one, since we have some set rules on the IP address assigned to a box based on their names for ease of identification. The requirements for our cluster has changed, and now each of them only needs 3 nodes so this new AG04 is added to the cluster AGCLU02.
Later, we found out AGCLU01 still has a ghost entry for a node named AG04 that no longer exist, so we decide to evict it from its old cluster AGCLU01.
The node will remain “Processing” the eviction order for a while: don’t expect it to complete any time soon (I waited for several minutes until I gave up and just hit refresh).
So we got out AGCLU01 cluster all nice and clean with its 3 nodes? Let’s take a look at AGCLU02 and its 3 nodes.
What’s happened to AG04? The box is up and running, so let’s check the cluster services.
The first reaction
The cluster services are disabled, but that is not a big deal. Surely we can fix that by just starting it manually…
What can the system log tell us about that?
Filtering by the FailoverClustering source, the following errors can be found on AG04’s System logs at the time of its eviction from AGCLU01
Event ID: 4621 Task Category: Cluster Evict/Destroy Cleanup Message: This node was successfully removed from the cluster
Event ID: 4615 Task Category: Cluster Evict/Destroy Cleanup Message: Disabling the cluster service during cluster node cleanup, has failed. The error code was '1115'. You may be unable to create or join a cluster with this machine until cleanup has been successfully completed. For manual cleanup, execute the 'Clear-ClusterNode' PowerShell cmdlet on this machine.
Event ID: 4629 Task Category: Cluster Evict/Destroy Cleanup ssage: During node cleanup, the local user account that is managed by the cluster was not deleted. The error code was '2226'. Open Local Users and Groups (lusrmgr.msc) to delete the account.
Event ID: 4627 Task Category: Cluster Evict/Destroy Cleanup Message: Deletion of clustered tasks during node cleanup failed. The error code was '3'. Use Windows Task Scheduler to delete any remaining clustered tasks.
Event ID: 4622 Task Category: Cluster Evict/Destroy Cleanup Message: The Cluster service encountered an error during node cleanup. You may be unable to create or join a cluster with this machine until cleanup has been successfully completed. Use the 'Clear-ClusterNode' PowerShell cmdlet on this node.
Followed by the same error message repeated every 15 seconds:
Event ID: 1090 Task Category: Startup/Shutdown Message: The Cluster service cannot be started. An attempt to read configuration data from the Windows registry failed with error '2'. Please use the Failover Cluster Management snap-in to ensure that this machine is a member of a cluster. If you intend to add this machine to an existing cluster use the Add Node Wizard. Alternatively, if this machine has been configured as a member of a cluster, it will be necessary to restore the missing configuration data that is necessary for the Cluster Service to identify that it is a member of a cluster. Perform a System State Restore of this machine in order to restore the configuration data.
What’s going on in the registry?
Let’s see how a “healthy” registry looks like in a cluster node, compared to our AG04
That’s it, the “Cluster” hive is missing from the registry. It was removed when the node was evicted from AGCLU01. Even though we meant to remove the node from AGCLU01 only, the command was sent through the network to the new AG04 node, and it received the order to remove all information regarding clusters it may retain.
Why did the cluster mistook the new AG04 for the old AG04?
In order to figure out why it was happening, I reproduced the following scenarios
- Old DNS (AG04) with old IP (AG04’s).
- Old DNS (AG04) with a new IP.
- New DNS (AG07) with old IP (AG04’s), with old DNS (AG04) still active and pointing to the old IP (AG04’s).
and only the “Old name, old IP” combination caused this particular issue.
Although I couldn’t identify how the cluster managed to check both the DNS and the IP address, it appears the cluster sent the order to evict the node across the network, and it reaches a machine with the same name and same IP. This is good enough for most cases, but unfortunately doesn’t verify the machine receiving the order to clean its cluster configuration records is a member of the cluster sendind out the order.
How do I fix my cluster now?
The first reaction would be adding the server back on the AGCLU02 cluster, but we can’t add a server back into a cluster it is a member of
Well, maybe it can be added back to the first cluster it belonged to, AGCLU01
No, it can not. Let’s try cleaning the node’s cluster configuration running
Clear-ClusterNode
No luck: still getting the same error when trying to add it to AGCLU02
But what of AGCLU01?
Now I can add AG04 to the cluster AGCLU01 but not to the cluster it should belong now, AGCLU02, which retains some configurations and registry entries that identify this node as a member of the cluster already. But since I really want to get that AG04 node into AGCLU02, I’ll evict it from the cluster in order to be able to add it back again
Now let’s try and add AG04 back to AGCLU02
And we are back in business
How to avoid this in the first place?
First of all, always destroy your clusters cleanly: only when a machine is unrecoverable an offline node should be evicted from the cluster.
But if you must evict an offline node, make sure the DNS of the node to be evicted is no longer used, and if still exists is not pointing to a valid IP address assigned to a node member of an existing cluster.
And if the offline node evicted is brought back online, clean it’s cluster configuration, if only to keep it clean of components and avoid having error messages in the system log.
Notes on this test
This test was performed running Windows Server 2019 machines, based on a real world issue ocurred on machines running Windows Server 2016.