Multi-point Control Units for Web Conferencing and Audio & Video.. How do IT Admins determine if these servers are staying healthy? As I explained in my previous blog in order to implement a comprehensive monitoring plan, the real trick is how to tell proactively when MCU server health is starting to decay. MCUs can be collocated on OCS 2007 Standard Edition servers and most large Enterprise Edition customer deployments have several dedicated Audio/Video MCUs, Data Conferencing MCUs and may have IMMCU services running on multiple Frontend servers which I’ve already covered in the post for IM and Presence.

Author: Scott Osborn

Publication date: July 2008

Product version: Office Communications Server 2007 R2

What new information am I covering here?

In addition to listing a good mix of Performance Monitor counters as recommended by the product team below, I’m covering new ground by identifying certain Microsoft Operations Manager thresholds to watch for as MCU health degrades to know exactly when to take remedial actions. Also below are a set of perf counters for A/V and Web Conferencing MCU server roles, some with thresholds that if reached, should trigger action on the part of an Administrator. The resource utilization, user load and server health counters below are directly applicable to Web Conferencing and A/V MCU functionality. As I said in Part 1, IT Admins will need to run resource utilization and user load baseline tests first to determine what is “normal” for their specific deployments. Then once baseline numbers are known for each server role, they’d add applicable health monitoring counters to the overall monitoring scheme and proceed from there.

“Snooping” on your MCUs can be very helpful to enhance a complete strategic monitoring plan.

I was surprised to find out you can run the Snooper tool in the OCS 2007 Resource Kit to perform a diagnostic report on MCUs to determine the state of server health and to identify diagnostic events. Among other useful things it does, Snooper can be used for error analysis. Information can be retrieved about all MCUs in the deployment and a complete diagnostic overview can be obtained. The MCU Health report can be particularly useful because including showing ID, media type, URL and heartbeat status, this report also shows server statistics that can be used to determine MCU load using of number of assigned conferences per MCU and number of connected participants. Snooper’s a great tool but it’s always a good idea to first review event logs and the MMC overviews for all MCUs for clues before starting an in-depth troubleshooting investigation using tools like Snooper.

Figure 1: Screenshot of Snooper UI
From the ‘Reports’ menu select, ‘Conferencing and Presence Reports’ and then from the ‘Report’ drop down list (as below) select, ‘MCU Health’.

 

Recommended baseline counters to test and monitor resource utilization:

Processor; % Processor Time (_Total) [should operate at less than 80% during peak load] (run on each MCU)

Network Interface; Bytes Total/sec ([your NIC]) [should operate at less than 80% capacity of the NIC] (run on each MCU)

*Memory; Pages/sec (---) (run on each MCU))

*Process; % Processor Time (DataMCUSvc)

*Process; % Processor Time (AVMCUSvc)

*Process; Private Bytes (DataMCUSvc) ([peak])

*Process; Private Bytes (AVMCUSvc) ([peak])

· Physical Disk counters are not applicable to MCU functionality

· Pages/sec indicates total “pressure” on the server’s available memory

· No documented baseline rules for individual process or memory utilization

· Network Interface example: 100Mbit/sec NIC should be <80%x12.5Mbytes/sec ~ <10Mbytes/sec

 

Recommended baseline counters to test and monitor user load:

Audio/Video and Web Conferencing MCU: (monitor on each MCU)

AVMCU – 00 - Operations; AVMCU – 000 – Number of Conferences ----

AVMCU – 00 - Operations; AVMCU – 001 – Number of Users ----

LC:DATAMCU – 00 – DataMCU Conferences; DATAMCU – 000 – Conferences ----

vLC:DATAMCU – 00 – DataMCU Conferences; DATAMCU – 002 – Connected Users ----

LC:SipEps – 01 – SipEps Transactions; SipEps – 002 – Incoming Transactions Processed ----

LC:SipEps – 01 – SipEps Transactions; SipEps – 003 – Incoming Transactions Processed/sec ----

 

Recommended counters to monitor for server health:

Audio/Video Conferencing MCU: (monitor on the AV MCU)

MEDIA – 00 - Operations; MEDIA – 000 – Global Health ----

MEDIA – 00 - Operations; MEDIA – 001 – TCP disconnects because remote out of sync ----

MEDIA – 00 - Operations; MEDIA – 002 – Relay allocation failures ----

MEDIA – 00 - Operations; MEDIA – 003 – Number of packets dropped by Secure RTP/sec ----

MEDIA – 01 - Planning; MEDIA – 003 – Number of conferences with NORMAL health ----

MEDIA – 01 - Planning; MEDIA – 004 – Number of conferences with OVERLOADED health ----

MEDIA – 01 - Planning; MEDIA – 005 – Number of packets dropped in flow control ----

MEDIA – 01 - Planning; MEDIA – 006 – Number of failed end to end connectivity checks ----

MEDIA – 02 - Informational; MEDIA – 006 – Average time spent in processing audio packets ----

MEDIA – 02 - Informational; MEDIA – 009 – Conference process rate ----

AVMCU – 04 – MCU Health and Performance; AVMCU – 003 – Thread Pool Health State ----

AVMCU – 04 – MCU Health and Performance; AVMCU – 005 – MCU Health State ----

Web Conferencing MCU: (monitor on the Data MCU)

LC:DATAMCU – 02 – MCU Health and Performance; DATAMCU – 002 – Thread Pool Load ----

LC:DATAMCU – 02 – MCU Health and Performance; DATAMCU – 003 – Thread Pool Health State ----

LC:DATAMCU – 02 – MCU Health and Performance; DATAMCU – 005 – MCU Health State ----

LC:DATAMCU – 02 – MCU Health and Performance; DATAMCU – 006 – MCU Draining State ----

Peers/HTTPS Transport/Focus Factory/Focus: (monitor on the Frontend servers)

LC:SIP – 01 - Peers; SIP - 024 – Flow-controlled Connections Dropped (_Total)

LC:SIP – 01 - Peers; SIP - 025 – Average Flow-Control Delay (_Total)

LC:USrv– 20 – Https Transport; USrv – 002 – Number of failed connection attempts ----

LC:USrv– 20 – Https Transport; USrv – 003 – Number of failed connection attempts / Sec ----

LC:USrv– 20 – Https Transport; USrv – 015 – Number of outgoing requests that timed out ----

LC:USrv– 20 – Https Transport; USrv – 016– Number of outgoing requests that timed out / Sec ----

LC:USrv– 22 – Conference Focus Factory; USrv – 000 – Add Conference requests ----

LC:USrv– 22 – Conference Focus Factory; USrv – 007 – Add Conference requests succeeded ----

LC:USrv– 23 – Conference Control; USrv – 018 – Local C3P success responses ----

LC:USrv– 23 – Conference Control; USrv – 019 – Local C3P pending responses ----

LC:USrv– 25 – Conference Mcu Allocator; USrv – 009 – Factory Unreachable Failures ----

LC:USrv– 25 – Conference Mcu Allocator; USrv – 010 – Factory Calls Timed-Out ----

LC:USrv– 25 – Conference Mcu Allocator; USrv – 016 – Create Conference Mcu Unreachable Failures ----

LC:USrv– 25 – Conference Mcu Allocator; USrv – 017 – Create Conference Requests Timed-Out ----

 

OCS 2007 MOM Pack thresholds from the documentation:

AVMCU - 000 - Number of Conferences [t1] (Warning) (Threshold) (The number of active conferences on the A/V Conferencing Server)

Numeric Threshold Rule triggered when the sampled value is greater than 5001

Causes: The number of active conferences has far exceeded the expected usage and new conferences cannot be created.
Resolutions: If this high number of active conferences persists then the service should be restarted and logging enabled to identify if the rate of conference creation is in line with expected usage.
AVMCU - 004 - Total Picture Freeze/Fast Update Request Sent (Sample)

Numeric Threshold Rule triggered when the sampled value is greater than 1

The current health of the MCU. 0 = Normal. 1 = Loaded. 2 = Full. 3 = Unavailable.

Causes: MCU is overloaded.
Resolutions: This could happen if too many conferences are assigned to this MCU.

(Sample Intervals for all performance counters listed above is: 15 minutes)

DATAMCU - 041 - Session queues state (Warning) (Threshold) (The state of the session queues)

Numeric Threshold Rule triggered when the sampled value is greater than 2

Causes: Data MCU is over loaded.
Resolutions: This should be a temporary condition. If this condition persists, please provision more Data MCU machines to handle the load.

DATAMCU - 041 - Session queues state (Sample) (The state of the session queues)

Numeric Threshold Rule triggered when the sampled value is greater than 1

Causes: MCU is overloaded.
Resolutions: This could happen if too many conferences are assigned to this MCU.
(Sample Intervals for all performance counters listed above is: 15 minutes)

USrv - 004 - Outstanding C3P transactions (Sample) (Per-second rate of CCCP requests sent to MCU that timed out)
Numeric Threshold Rule triggered when the changes in values over 2 samples is greater than 100

Causes: This can happen if the Server and/or one or more MCU(s) in the Pool are overloaded. This can also happen due to Load Balancer and Network connectivity issues.
Resolutions: This might be a temporary condition. If the problem persists, please ensure that hardware and software requirements of the Pool meet the usage characteristics and that the network is functioning correctly.

USrv - 004 - Notifications in processing (Sample) (The average time [in milliseconds] taken to complete a MCU factory call)

Numeric Threshold Rule triggered when the sampled value is greater than 5000

Causes: The Mcu factory might be busy and may not respond immediately.
Resolutions: This might be a temporary condition. If the problem persists please ensure that the hardware and software requirements meet the user usage characteristics.
USrv - 011 - Factory Call Latency (msec) (Error) (Threshold) (The average time [in milliseconds] taken to complete a MCU factory call)

Causes: The Mcu factory might be busy and may not respond immediately.
Resolutions: This might be a temporary condition. If the problem persists please ensure that the hardware and software requirements meet the user usage characteristics.

USrv - 011 - Factory Call Latency (msec) (Sample) (The average time [in milliseconds] taken to complete a create conference call)

Numeric Threshold Rule triggered when the sampled value is greater than 5000

Causes: The Mcu or Backend might be busy and may not respond immediately.
Resolutions: This might be a temporary condition. If the problem persists please ensure that the hardware and software requirements meet the user usage characteristics.

USrv - 013 - Average Outgoing Queue Delay (ms) (Sample) ( Number of C3P transactions currently in processing)

Numeric Threshold Rule triggered when the changes in values over 2 samples is greater than 1000

Causes: This can typically happen if the Server and/or one or more MCU(s) in the Pool are overloaded.
Resolutions: This might be a temporary condition. If the problem persists, please ensure that hardware and software requirements of the Pool meet the usage characteristics

USrv - 019 - Create Conference Latency (msec) (Error) (Threshold) (The average time [in milliseconds] taken to complete a create conference call)

Causes: The Mcu or Backend might be busy and may not respond immediately.
Resolutions: This might be a temporary condition. If the problem persists please ensure that the hardware and software requirements meet the user usage characteristics.

USrv - 019 - Create Conference Latency (msec) (Sample) (The average time [in milliseconds] taken to complete a full Mcu allocation request)
Numeric Threshold Rule triggered when the sampled value is greater than 10000

Causes: The Mcu factory or Mcu or Backend might be busy and may not respond immediately.
Resolutions: This might be a temporary condition. If the problem persists please ensure that the hardware and software requirements meet the user usage characteristics.

USrv - 021 - Allocation Latency (msec) (Error) (Threshold) (The average time [in milliseconds] taken to complete a full Mcu allocation request)

Causes: The Mcu factory or Mcu or Backend might be busy and may not respond immediately.
Resolutions: This might be a temporary condition. If the problem persists please ensure that the hardware and software requirements meet the user usage characteristics.

USrv - 029 - Transactions Timed-Out / sec (Warning) (Threshold) (Per-second rate of requests sent to MCU that timed out)

Causes: This can happen if the Server and/or one or more MCU(s) in the Pool are overloaded. This can also happen due to Load Balancer and Network connectivity issues.
Resolutions: This might be a temporary condition. If the problem persists, please ensure that hardware and software requirements of the Pool meet the usage characteristics and that the network is functioning correctly.
(Sample Intervals for all performance counters listed above is: 15 minutes)

MCU Health is monitored internally by the Pool itself so unhealthy or overloaded MCUs will not be used.

In OCS 2007 the ‘MCU Factory’ component running on the Frontends is responsible for monitoring MCU “health status” and supplying the best available MCU for use during conference creation, whether it is for audio/video conferencing or web conferencing. When an MCU service starts up, it begins sending “health notifications” every 15 seconds to the ‘MCU Factory’ to advertise its ability to take on new conferences or not. So the ‘MCU Factory’ actually keeps a dynamic list of available MCU’s for the corresponding modality (A/V, Data Conferencing) for use in servicing requests and chooses between available MCUs when Conferences are created.
When a request comes in, the actual selection criteria for an MCU is based partly on the overall health of the MCU. (e.g. Normal= healthy; Loaded=marginal; Unavailable=maximum reached or server down) But selecting an MCU is not based solely on its health but randomness is introduced into the selection algorithm to minimize the risk of repeated selection of a single MCU to host most of the conferences.

TechNet resources and whitepapers with more information on MCUs:


For an in-depth resource on Office Communications Server 2007, including detailed troubleshooting tips, refer to the Office Communications Server 2007 Resource Kit, especially Chapter 13: “Monitoring,” available from MS Press at: http://www.microsoft.com/MSPress/books/10482.aspx.

Stu prepared the content for this post prior to transferring to Unify2

Lync Server Resources

We Want to Hear from You