This article highlights the ways in which Microsoft Lync Server 2010 Monitoring/Archiving server improves data transmission using Microsoft Message Queuing (MSMQ). 

Authors: Weiming Shen, Xu Liu, and Clark Chen

Publication date: May 2011

Product version: Lync Server 2010

The Microsoft Lync Server 2010 Monitoring/Archiving server offers several improvements that enhance the reliability of data transmission using Microsoft Message Queuing (MSMQ). In Microsoft Office Communications Server 2007 R2, messages were occasionally dropped if those messages could not be delivered or processed due to network issues or delivery expiration timeouts. With Lync Server, we retain these messages and attempt to deliver them after the blocking issues have been resolved. In addition, we also provide a manageable way to raise warning events before messages might be dropped (for reasons such as MSMQ running out of disk space). As a result, administrators have the opportunity to resolve the issue and avoid data loss in such cases.

The following illustration shows how Monitoring/Archiving server use MSMQ to interact with the Lync Server Front End servers:

Figure 1

 

In Office Communications Server 2007 R2, messages for Archiving, Call Detail Recording (CDR), and Quality of Experience (QoE) were delivered in a straightforward fashion. Each message was stamped with an expiration time (the TIME_TO_REACH_QUEUE setting) before being delivered to MSMQ. If the message could not be delivered to the Archiving or Monitoring server before the expiration time, the message was dropped by the MSMQ service and never written to the database. In addition, messages received by the target queue were marked with a second expiration time (the TIME_TO_BE_RECEIVED setting). If the message was not retrieved from the target queue before that expiration time, that message would be dropped, even though it had been delivered to the target queue before the TIME_TO_REACH_QUEUE setting expired. These expiration times were required because Front End servers had limited resources, and we wanted to have an automated way to purge messages when something went wrong, and without interfering with the messaging process.

With this approach, there were several scenarios in which processing could fail and messages would be dropped:

  • With wide area networks (such as those used in a branch office scenario), the network connection will occasionally go down. If that connection was not restored before the expiration time, messages were dropped. (And administrators were not notified that messages were dropped until those messages had actually been deleted and could no longer be recovered.) Even if the network connection was restored quickly there might still be data loss if the accumulated messages could not be processed before timing out.
  • Services running on the Monitoring/Archiving server or on the Front End server might temporarily go down. If the services were not restarted in time, messages that reached the expiration timeout were dropped.
  • Due to the increasing size of the workload, the processing services might not be able to process all the received messages in time. What often happened with Office Communications Server 2007 R2 was that the initial deployment provided Monitoring/Archiving server with enough capacity to process messages in a timely manner. Over time, however, more users began to use the system, and more Front End servers were deployed to handle these additional users. Unfortunately, in Office Communications Server it is not easy to scale Monitoring/Archiving server to handle new users and new, multiple Front End servers. If the services on the Monitoring/Archiving servers were overwhelmed by the number of messages, messages in the target queue might have expired before they could be processed. Again, administrators would not be notified of the problem until after the messages had already been dropped.

These issues can also lead to a problem with the Critical mode feature supported in Office Communications Server 2007 R2, the optional feature that shuts down the messaging system if messages cannot be archived. In Office Communications Server 2007 R2, it might take several hours before the system realizes that messages are accumulating in the target queue and expiring before they can be written to the database. That means that several hours' worth of data might be lost before the system shuts itself down. That could be a serious problem for organizations that are required by law to keep a copy of all electronic communications.

These issues are primarily due to limitations in data transmission reliability in Office Communications Server 2007 R2 monitoring and archiving. Because of that, we have introduced several major improvements in Lync Server 2010:

  • The heartbeat message. The heartbeat message was introduced as a way to detect MSMQ problems before messages are dropped. Heartbeat messages are delivered along with each data message that MSMQ delivers to the target queue. When a heartbeat message, which contains a minimal amount of information and is thus much smaller than a data message, is processed, the sending agent receives an ACK (acknowledgement message) which tells the system that the heartbeat message (and its accompanying data message) have been processed. If problems occur with either delivery or processing, the sending agent will receive an error NACK (negative acknowledgment), either because delivery could not be completed or because the heartbeat message timed out. (Heartbeat messages are still given expiration timeouts; data messages are not.)
  • Front End MSMQ quota monitoring. By default, MSMQ is given 1 gigabyte of disk space on each Front End server. (You can use Server Manager to increase the disk space allocated to MSMQ.) Quota monitoring in Lync Server 2010 helps ensure that administrators receive an alert before data loss occurs due to MSMQ running out of disk space. If MSMQ has reached 95% of its disk quota, a warning is written to the event log, and administrators will have an opportunity to increase the quota size or free up disk space on the affected computer. After the initial warning has been issued, an information event will be written to the event log any time quota usage exceeds 90%.
  • Queue health monitoring. Queue health monitoring enables administrators to know the current status of the connection between a Front End server and the Monitoring/Archiving server, as well as the status of the MSMQ service and processing service.
  • Queue health is based on the ACK/NACK heartbeat messages. A timeout NACK on a target queue indicates a problem delivering messages to the target queue, either because the network is down, the MSMQ service is down, or the MSMQ TCP port is blocked. A timeout NACK on the processing service indicates that there is a problem with the message processing service on the Monitoring/Archiving server.
    Note that other issues, such as message encryption corruption, are reported in the event log as error events.
  • Message resending mechanism. Messages that cannot be delivered or processed in a timely fashion are no longer dropped. Instead, those messages are retained in the dead letter queue and will be automatically re-sent by the MSMQ service.

The following diagram illustrates the improvements made to archiving and CDR in Lync Server 2010. Note that these enhancements do not necessarily apply to QoE; that's because QoE messages are typically not considered mission critical and, as a result, can be dropped in certain scenarios.

Figure 2

 

The following table shows the set of new features supported by each monitoring component:

Feature

CDR

Archiving

QoE

Heartbeat message

Yes

Yes

Yes

Queue health monitoring

Yes

Yes

Yes

Quota protection

Yes

Yes

Yes

Admin queue

Yes

Yes

Yes

Dead letter queue

Yes

Yes

No

Critical mode

No

Yes

No

Many of these features can be customized by modifying registry values found under HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\RtcSrv\Parameters. These configurable values are listed in the following table:

Registry Value

Default Value

Description

QuotaHighThreshold

95%

Amount of quote space that must be used before the initial low disk space warning is sent.

QuotaLowThreshold

90%

Amount of quote space that must be used before subsequent low disk space warnings are sent.

HeartbeatInterval

1 minute

Amount of time between system heartbeat checks.

CheckDLQInterval

10 seconds

Amount of time the system waits before checking the dead letter queue.

CheckQuotaInterval

15 seconds

Amount of time between quota disk space checks.

HeartbeatToReachQueue

10 minutes

Amount of time it takes for a heartbeat message to reach the target queue.

HeartbeatToBeReceived

60 minutes

Amount of time it takes a heartbeat message to be received and processed.

DataToReachQueue

3 hours

Amount of time it takes for a data message to reach the target queue.

DataToBeReceived

4 hours

Amount of time it takes a data message to be received and processed.

Any changes made to these registry values will not take effect until the Front End service is restarted. Be careful when making these changes, as an improper configuration could lead to data loss.

Lync Server Resources

We Want to Hear from You