• Exchange 2010 Recommend DAG Update

    While working with a customer to resolve networking issues with their DAG last week I enquired if they had installed the recommended update for their DAG.  They were not aware of the Exchange product group’s recommendation so I thought I’d bubble this back up again:

     

    The post is on the Exchange team blog site and mentions the following update:

     

    KB2550886 - A transient communication failure causes a Windows Server 2008 R2 failover cluster to stop working

    This hotfix is strongly recommended for all databases availability groups that are stretched across multiple datacentres. For DAGs that are not stretched across multiple datacentres, this hotfix is good to have, as well. The article describes a race condition and cluster database deadlock issue that can occur when a Windows Failover cluster encounters a transient communication failure. There is a race condition within the reconnection logic of cluster nodes that manifests itself when the cluster has communication failures. When this occurs, it will cause the cluster database to hang, resulting in quorum loss in the failover cluster.

    As described on TechNet, a database availability group (DAG) relies on specific cluster functionality, including the cluster database. In order for a DAG to be able to operate and provide high availability, the cluster and the cluster database must also be operating properly.

    Microsoft has encountered scenarios in which a transient network failure occurs (a failure of network communications for about 60 seconds) and as a result, the entire cluster is deadlocked and all databases are within the DAG are dismounted. Since it is not very easy to determine which cluster node is actually deadlocked, if a failover cluster deadlocks as a result of the reconnect logic race, the only available course of action is to restart all members within the entire cluster to resolve the deadlock condition.

    The problem typically manifests itself in the form of cluster quorum loss due to an asymmetric communication failure (when two nodes cannot communicate with each other but can still communicate with other nodes). If there are delays among other nodes in the receiving of cluster regroup messages from the cluster’s Global Update Manager (GUM), regroup messages can end up being received in unexpected order. When that happens, the cluster loses quorum instead of invoking the expected behaviour, which is to remove one of the nodes that experienced the initial communication failure from the cluster.

    Generally, this bug manifests when there is asymmetric latency (for example, where half of the DAG members have latency of 1 ms, while the other half of the DAG members have 30 ms latency) for two cluster nodes that discover a broken connection between the pair. If the first node detects a connection loss well before the second node, a race condition can occur:

    • The first node will initiate a reconnect of the stream between the two nodes. This will cause the second node to add the new stream to its data.
    • Adding the new stream tears down the old stream and sets its failure handler to ignore. In the failure case, the old stream is the failed stream that has not been detected yet.
    • When the connection break is detected on the second node, the second node will initiate a reconnect sequence of its own. If the connection break is detected in the proper race window, the failed stream's failure handler will be set to ignore, and the reconnect process will not initiate a reconnect. It will, however, issue a pause for the send queue, which stops messages from being sent between the nodes. When the messages are stopped, this prevents GUM from operating correctly and forces a cluster restart.

    If this issue does occur, the consequences are very bad for DAGs. As a result, we recommend that you deploy this hotfix to all of your Mailbox servers that are members of a DAG, especially if the DAG is stretched across datacentres. This hotfix can also benefit environments running Exchange 2007 Single Copy Clusters and Cluster Continuous Replication environments.

    In addition to fixing the issue described above, KB2550886 also includes other important Windows Server 2008 R2 hotfixes that are also recommended for DAGs:

     

    Cheers,

    Rhoderick

    Technorati Tags:
  • Windows Service Log On As Inventory

    At a recent engagement a customer wanted to quickly scan through multiple servers to easily determine which services were set to log on as non-standard accounts, i.e. ones like network service, and local service were OK, but which ones were using an Active Directory logon for example?  They also wanted to search specific portions of the AD, so logic was added to start searching a collection of computers from a given OU.

    We took the opportunity to quickly knock up a PowerShell script to leverage the Windows 2008 R2 AD cmdlets in combination with WMI to show which services on multiple computers were using specific credentials.  Please find the script attached to this blog post.

    While the Get-Service cmdlet is able to query services on remote machines using the  -ComputerName parameter it is not able to interrogate the Log On As information for a service.  WMI is able to do this, and by using the Get-WMIObject cmdlet it was simple to query for the desired logon information.

    Note that the OU path is set in the script, and unless you work for TailspinToys.com you will have to edit the OU path to reflect the correct structure.

    Update 15-7-2014: The script was previously stored on the blog, but since a recent blog upgrade has blocked adding/editing attached files the script has been moved to the TechNet gallery:

    Service Log On As Inventory Script

     

    Cheers,

    Rhoderick

    Technorati Tags:
  • Fine Grained Control When Registering Multiple IP Addresses On a Network Card

    Edit: 24-1-2013:  A second article using PowerShell 3.0 is here

    Edit: 30-1-2013: – A third article is using advanced PowerShell 3.0 is here.

    Edit 28-8-2013: – A similar issue with the setting being removed is present in Windows Server 2012.  Article with workaround is here

     

     

    The previous behaviour in Windows was to register all IP addresses that were entered on the network card’s property sheet into DNS if the “Register this connection’s address in DNS” option was selected (Which is the default).

    Register In DNS Option on Network Card

     

    For servers with a single NIC which has one IP bound to it this works great as we can dynamically register changes in IP addressing into DNS and all in the world is good.  What happens though when you start to complicate matters and have additional IPs and additional NICs?

    In the two NIC scenario, it is easy to set one NIC to register into DNS and then clear the register in DNS option for the second NIC.  That allows for the IP on the first NIC to be registered and the IP on the second NIC will not.  This would be a common scenario for a server that had multiple interfaces where one would be used for a management/backup purpose and end users should not be able to resolve the server’s name to the management IP as their traffic would not be allowed to route to that interface.  That’s fine but what about the scenario of a single NIC with multiple IPs bound to it? An example would be a web server with multiple IPs for different web sites.

    Previously, if you did not want the server to register all of its IPs into DNS, then the register in DNS option would have to be disabled and the administrator would have to manually maintain the DNS registration information in the DNS zone.  If this was not done then all the IPs that were bound to the server would be registered in DNS and clients potentially would be returned an incorrect IP.

    Windows 2008 and 2008 R2 now have the option to selectively register IPs into DNS.  This capability was first released as an update for Windows 2008 and 2008 R2.  After you install this hotfix, you can assign IP addresses that will not be registered for outgoing traffic on the DNS servers by using a new flag of the Netsh command. This new flag is the skipassource flag.

    For example, the following command creates an IPv4 address that is not registered for outgoing traffic on the DNS servers:

    Netsh Int IPv4 Add Address <Interface Name> <IP Address> SkipAsSource=True

     

    "Interface Name" is the name of the interface for which you want to add a new IP address.

    "IP Address" is the IP address you want to add to this interface.

     

    For Example:

    Netsh Int IPv4 Add Address Team-1 172.16.5.10  SkipAsSource=True

     

     

    How can I see what IPs have this flag set?  To list the IPv4 addresses that have the skipassource flag set to true, run the following command:

    Netsh int ipv4 show ipaddresses level=verbose
     

    Note the “Skip As Source” entries in the below:

    Skip As Source

     

    That’s all pretty neat but if you are are wondering what is my interface name check the GUI or run the following Netsh command to show the interfaces:

    Netsh Interface Show Interface

     

    Netsh Interface Name

    Note the Interface Name column on the right hand side.

    Which corresponds to the GUI:

    Windows Network Connections

     

    Note that once you have configured the above, if you then go to the regular GUI and make changes there, the SkipAsSource flag is overwritten unless you have installed the update to correct this known issue.

    Consider the following scenario:

    • You have a computer that is running Windows 7 or Windows Server 2008 R2.
    • You install hotfix 2386184 (http://support.microsoft.com/kb/2386184/ ) on the computer to enable the skipassource flag of the netsh command.
    • You assign many IP addresses to a network adapter on the computer by using the netsh command together with the skipassource flag.
    • You update some IP settings for the network adapter in the Network and Sharing Centre graphical user interface (GUI). For example, you edit the subnet mask of an IP address that has the skipassource flag set to true.

    The issue occurs because the GUI does not recognize the skipassource flag, and the GUI uses an incorrect method to handle changes of IP settings. When IP settings are changed, the GUI deletes all the old IP addresses from the old list and then adds new IP addresses to the new list. Because the GUI does not know the skipassource flag, the GUI does not copy the flag when IP addresses are added to the list. Therefore, the skipassource flag is cleared.

     

    Cheers,

    Rhoderick