Monday, August 7, 2017

BizTalk 2010 Failover Cluster Issue

Introduction

Failover clustering of BizTalk services is important for processes that require only a single BizTalk server to pick up a message from a source systems, particularly for transport mechanisms that do not have some kind of a locking mechanism such as FTP, MSMQ, POP3. In order for that to happen though, you need to configure a cluster for your BizTalk receive host. If you did not cluster the receive host, then there is a possibility that all the BizTalk servers could potentially pick up the same message at the same time, resulting in duplicate messages. When failing over from one node to the other the active node's host instances would stop and similarly the other node's host instances would start up.


Problem

A while back we were experiencing a problem on one of our BizTalk server (T03A) which required the operating system to be re-installed. This was all fine because we had our secondary test server (T03B) running, so it was safe to re-install T03A. The server was re-installed, all software was loaded, all the patches were successfully applied and the server was re-joined onto the cluster. Once BizTalk was installed, configured and re-joined to the BizTalk group, we started experiencing a strange behavior on T03A.

We started verifying the failover clustering between T03A and T03B servers to ensure that the failover works as expected. When checking T03B, everything looked good:


Though when we forced T03A to become the active cluster, I noticed that there was no (Active) labels even though T03A was the active server.

I then began to double check the failover cluster settings but all the settings appeared correct. I then began to search the internet for anything relating to this problem but I could not find anything that matched my problem. Lastly I started to check the configurations in WMI using PowerShell.

I opened a PowerShell with administrative rights and ran the following command on both T03A and T03B:

get-wmiobject MSBTS_HostInstance  -namespace "root\MicrosoftBizTalkServer" | Where-Object { $_.ClusterInstanceType -in (1,2,3) } | format-table HostName,HostType,ClusterInstanceType,ConfigurationState,RunningServer,ServiceState -AutoSize

T03A

T03B

Immediately, you can see the issue. The running server on node T03A is in lower case whereas the cluster node on T03B is all upper case.

Note:
The values in the tables can be found below.

ClusterInstanceType (https://msdn.microsoft.com/en-us/library/aa578062.aspx)

Value
ClusterInstanceType
0
UnClusteredInstance
1
ClusteredInstance
2
ClusteredVirtualInstance

ConfigurationState (https://msdn.microsoft.com/en-us/library/aa560498.aspx)

Value
ConfigurationState
1
Installed
2
Installation failed
3
Uninstallation failed
4
Update failed
5
Not installed


Value
ServiceState
1
Stopped
2
Start pending
3
Stop pending
4
Running
5
Continue pending
6
Pause pending
7
Paused
8
Unknown

Resolving the issue

To resolve the issue, you need to first evict the troublesome node from the cluster and then re-add it to the cluster.

Note:
You should never evict a cluster node from a cluster group because evicting it might actually cause your issue to become more serious and cause more harm than necessary. The following are the few exceptions to this:
  • Replacing a node with different hardware.
  • Reinstalling the operating system.
  • Permanently removing a node from a cluster.
  • Renaming a node of a cluster. 
Copied from this blog post: https://blogs.technet.microsoft.com/askcore/2010/03/03/when-should-i-evict-a-cluster-node/

Removing the node from the cluster

In our case we had to re-install the operating system as well as rename the node of T03A from lower case to upper case, so it was safe to evict T03A from the cluster.

1. Open the Failover Cluster Manager and connect to your BizTalk cluster.
2. Expand the BizTalk Cluster group.
3. Click on Nodes.
4. Right click on the troublesome node, "More Actions..." and then "Evict".


Adding the cluster node

One of the things that our infrastructure team did was to use the Failover Cluster Manager interface to add the T03A server back onto the cluster. When adding the server through the Failover Cluster Manger, the node would sometimes appear in upper case and sometimes in lower case. There is a way to force the the server to be added specifically in the way you want and that is to use a command line or PowerShell. See http://blog.workinghardinit.work/2012/04/26/failover-cluster-node-names-in-upper-lower-case-in-window-2012-with-cluster-exe-powershell-gui/.

Open a command prompt with administrative privileges and then type the following command:

cluster.exe /cluster:{CLUSTER GROUP NAME} /add /node:{SERVERNAME IN UPPERCASE}

Validate the cluster

Once the server was re-joined to the cluster group, you need to validate the cluster. See this article for more information on the validating the cluster: https://technet.microsoft.com/en-us/library/cc732035(v=ws.10).aspx

1. In the Failover Cluster Manager, connect to your BizTalk cluster.
2. Right click on the BizTalk cluster group name and then click on "Validate This Cluster...".


Once it passes, connect to the server with the cluster issue and open the BizTalk Administrative Console. You will now be able to see that the server has the "Active" labels on the server name.



Conclusion

Installing and configuring your BizTalk servers can be a tedious process especially when performed manually. These steps can be mitigated when automating the steps using scripts such as PowerShell. Automated scripts can be used to ensure that the steps taken are repeatable and can be performed without any user intervention. 

Hopefully this post will help someone else out there having a similar cluster issue. 

1 comment: