This week I accepted a case escalation out of Microsoft EMEA which involved the Quality of Experience monitoring and reporting component of Office Communications Server 2007. Knowing absolutely nothing about QoE, I downloaded the OCS 2007 Quality of Experience Monitoring Server package from Microsoft and added it to my existing lab environment. After an interesting (and challenging) installation experience, I cracked open the case notes that were logged by previous engineers.
The following is a sample of one of the error messages:
Event Type: Error Event Source: OCS Mediation Server Event ID: 25022 Computer: OCSMedSvr01 Description: The Quality-Metric server cannot be contacted. The Quality metric reports are not sent to the server. Exception: Microsoft.Rtc.Signaling.OperationTimeoutException: This operation has timed out. at Microsoft.Rtc.Signaling.SipAsyncResult.ThrowIfFailed() at Microsoft.Rtc.Signaling.Helper.EndAsyncOperation[T](Object owner, IAsyncResult asyncResult) at Microsoft.Rtc.Signaling.RealTimeEndpoint.EndSendMessage(IAsyncResult asyncResult) at Microsoft.RTC.MediationServerCore.QosReport.SendMessageCallback(IAsyncResult result) Cause: Either the Quality-Metric server is not running or unreachable. Resolution: Verify that Quality-Metric server is reachable from the computer running Office Communications Server.
After establishing associations between the Quality of Service server and the OCS Pool and Mediation servers, the Mediation server will send a quality metric report upon the completion of each call to the QoS server. The details of the quality of the call will be contained in a XML blob sent via a SERVICE message over secure SIP, TCP port 5061. Jitter, packet loss, network utilization, and audio quality are some of the metrics that are captured and reported for each call. Upon receiving the SERVICE request, the QoS server will respond to the Mediation server with 202 Accepted. If the QoE report fails to reach the QoS server, or if the response is never received by the Mediation server, there is no attempt to resend. QoE reporting is very much a fire-and-forget type of process.
While the error shown above is indicative of a network problem between the Mediation server and the QoS server, it can be a confusing to troubleshoot. So, where do you begin?
OCSLogger and Snooper, of course!
The first troubleshooting step is to install the OCS Resource Kit tools on the Mediation server(s) in your environment. The OCS 2007 Resource Kit tools can be downloaded from the Microsoft Download Center here. Once installed, open the OCS 2007 management console and drill down on Mediation Servers. Right click on your Mediation server and choose Logging Tool > New Debug Session. This will open the options for the OCSLogger tool.
Enable logging for the following components, leaving everything else configured with the default options:
Click Start Logging, then let it run for about an hour (or as long as it takes to reproduce the error as shown above). Once you see Event ID 25022 appear in the event log of your Mediation server, click Stop Logging. Click Analyze Log Files > Analyze to launch Snooper.
At the top of the Snooper window, enter the word SERVICE in the search blank, then click on the green magnifying glass. This will highlight all QoE reports sent from the Mediation server. Each SERVICE message will contain a quality metric report sent from the Mediation server, and each SERVICE message should be followed by 202 Accepted. Any break in this sequence will result in Event ID 25022, as shown below:
From the highlighted area in the screenshot above, we can tell that the Mediation server sent the QoE report at the conclusion of this call. However, the QoS server never responded with 202 Accepted to the Mediation server. If the Mediation server does not receive an acknowledgement for the receipt of the QoE report within approximately 30 seconds, the reporting event will time out and Event ID 25022 will be logged by the Mediation server.
So what’s next? Network Monitor!
From the Mediation server, download and install Microsoft Network Monitor from the Microsoft Download Center here. Once installed, launch Network Monitor and capture network traffic for about an hour or until Event ID 25022 appears in your application log. In chatty environments, network captures can grow quite large in a very short period of time. This is especially true of OCS Mediation servers, where voice streams sent over RTP/UDP can account for a significant percentage of network traffic in a large capture file. You may want to consider using a capture filter to limit much of the extraneous noise on the wire:
Capture Filter: ipv4.address==10.32.10.45 && tcp
After an hour has passed or as soon as Event ID 25022 appears in your application log, stop and save the network trace file. Then, click on the Display Filter tab and enter and apply the following display filter:
Display Filter: tcp.flags.reset==0x1 && tcp.port==5061
This is where troubleshooting this issue can become a bit confusing. See, the Mediation server and the QoS server never talk directly to each other, so you will never see a network failure occur directly between those two IP addresses. Instead, both the Mediation server and the QoS server communicate via the SIP protocol – and all SIP communication in OCS flows through the Front End servers in the OCS Pool.
And what device is responsible for handling network traffic sent to an OCS Pool hosting multiple Front End servers? You got it … a network load balancer!
Now let’s look at the network traffic captured by Network Monitor:
Here you can see a number of hard network resets (note only the reset flag ‘R’ is present), most of which originate from 10.32.12.141 – the virtual IP address of the load balancer sitting in front of the OCS Pool. These network resets occur at frequent intervals throughout the day and are responsible for the failed submission and/or acknowledgement of QoE reports between the Mediation and QoS servers.
If you are experiencing this problem, check the TCP idle timeout window for cleaning up stale connections in the configuration of your load balancer. Many load balancers are configured with a very small TCP idle timeout window (i.e. less than 5 minutes), which can cause semi-active TCP connections (like those involving SIP traffic) to be inadvertently garbage collected.
In LCS/OCS environments, load balancers should be configured with a 20 minute TCP idle timeout window as a best practice. Making a slight configuration change on your load balancer will often resolve this and many other quirky connectivity issues in your OCS environment.
Hope this helps!