I was helping a colleague who was dealing with a Remote desktop connection problem where Windows 8 client was failing to access a certain Windows 2008 R2 SP1 Terminal server. We first collected some logs while reproducing the problem to better understand why the RDP connection was failing:
1) Please start TCPIP/Winsock ETL tracing on the client and on the Windows 2008 R2 TS from an elevated command prompt:
netsh trace start provider=Microsoft-Windows-TCPIP provider=Microsoft-Windows-Winsock-AFD tracefile=Terminalserver_tcpip_winsock.etl maxsize=500 capture=yes
Note: This command generates a lot of activity so we have to do the testing as fast as possible.
2) Please immediately reproduce the problem from the client by trying RDP connection. Right after the problem is reproduced please immediately stop netsh on the Windows 2008 R2 server first with the below command:
netsh trace stop
ANALYSIS:
========
After the logs were collected, I followed the below steps to further troubleshoot the issue:
=> The problem RDP session in the RDP server side ETL trace including network activity trace:
(ip.addr eq 10.0.107.58 and ip.addr eq 10.7.3.187) and (tcp.port eq 59193 and tcp.port eq 3389)
No. Time Delta Source Destination Protocol Length Info
5781 2013-01-06 09:52:15.407 0.000 10.0.107.58 10.7.3.187 TCP 66 59193 > 3389 [SYN] Seq=3976196062 Win=8192 Len=0 MSS=1380 WS=256
5782 2013-01-06 09:52:15.407 0.000 10.7.3.187 10.0.107.58 TCP 62 3389 > 59193 [SYN, ACK] Seq=3028123794 Ack=3976196063 Win=8192 Len=0 MSS=1460 WS=256
5783 2013-01-06 09:52:15.447 0.039 10.0.107.58 10.7.3.187 TCP 60 59193 > 3389 [ACK] Seq=3976196063 Ack=3028123795 Win=66048 Len=0
5784 2013-01-06 09:52:15.448 0.000 10.0.107.58 10.7.3.187 TPKT 73 CR TPDU src-ref: 0x0000 dst-ref: 0x0000[Unreassembled Packet]
5785 2013-01-06 09:52:15.451 0.003 10.7.3.187 10.0.107.58 TCP 54 3389 > 59193 [ACK] Seq=3028123795 Ack=3976196082 Win=66048 Len=0
5786 2013-01-06 09:52:15.458 0.007 10.7.3.187 10.0.107.58 TPKT 73 CC TPDU src-ref: 0x1234 dst-ref: 0x0000[Unreassembled Packet]
5787 2013-01-06 09:52:15.520 0.062 10.0.107.58 10.7.3.187 TCP 60 59193 > 3389 [ACK] Seq=3976196082 Ack=3028123814 Win=66048 Len=0
5958 2013-01-06 09:52:22.632 7.111 10.0.107.58 10.7.3.187 TPKT 218 Continuation
5959 2013-01-06 09:52:22.632 0.000 10.7.3.187 10.0.107.58 TPKT 912 Continuation
5971 2013-01-06 09:52:22.678 0.045 10.0.107.58 10.7.3.187 TPKT 380 Continuation
5972 2013-01-06 09:52:22.696 0.018 10.7.3.187 10.0.107.58 TPKT 113 Continuation
5975 2013-01-06 09:52:22.752 0.055 10.0.107.58 10.7.3.187 TPKT 235 Continuation
5976 2013-01-06 09:52:22.753 0.000 10.7.3.187 10.0.107.58 TPKT 1211 Continuation
5977 2013-01-06 09:52:22.809 0.055 10.0.107.58 10.7.3.187 TPKT 1434 Continuation
5978 2013-01-06 09:52:22.809 0.000 10.7.3.187 10.0.107.58 TCP 54 3389 > 59193 [ACK] Seq=3028125888 Ack=3976198133 Win=65792 Len=0
5979 2013-01-06 09:52:22.809 0.000 10.0.107.58 10.7.3.187 TPKT 263 Continuation
5980 2013-01-06 09:52:22.810 0.001 10.7.3.187 10.0.107.58 TPKT 315 Continuation
...
8129 2013-01-06 09:52:49.090 0.094 10.0.107.58 10.7.3.187 TCP 60 59193 > 3389 [ACK] Seq=3976208206 Ack=3028242245 Win=64768 Len=0
8154 2013-01-06 09:52:49.526 0.436 10.7.3.187 10.0.107.58 TPKT 139 Continuation
8164 2013-01-06 09:52:49.620 0.093 10.0.107.58 10.7.3.187 TCP 60 59193 > 3389 [ACK] Seq=3976208206 Ack=3028242330 Win=64768 Len=0
8170 2013-01-06 09:52:49.673 0.052 10.0.107.58 10.7.3.187 TPKT 123 Continuation
8180 2013-01-06 09:52:49.776 0.102 10.7.3.187 10.0.107.58 TPKT 683 Continuation
8186 2013-01-06 09:52:49.874 0.097 10.0.107.58 10.7.3.187 TCP 60 59193 > 3389 [ACK] Seq=3976208275 Ack=3028242959 Win=66048 Len=0
8254 2013-01-06 09:52:52.475 2.600 10.7.3.187 10.0.107.58 TPKT 91 Continuation
8257 2013-01-06 09:52:52.567 0.092 10.0.107.58 10.7.3.187 TCP 60 59193 > 3389 [ACK] Seq=3976208275 Ack=3028242996 Win=66048 Len=0
10717 2013-01-06 09:53:40.765 48.198 10.7.3.187 10.0.107.58 TCP 54 3389 > 59193 [RST, ACK] Seq=3028242996 Ack=3976208275 Win=0 Len=0
The RDP server resets the existing RDP session. Interestingly all following RDP session attempts are immediately RESET:
I decided to check why TCPIP driver was reseting all incoming connection attempts from the ETL trace:
61714 8:53:53 AM 1/6/2013 111.1142353 Idle (0) 10.0.107.58 10.7.3.187 TCP TCP:Flags=......S., SrcPort=59210, DstPort=MS WBT Server(3389), PayloadLen=0, Seq=2885139468, Ack=0, Win=8192 ( Negotiating scale factor 0x8 ) = 8192
61715 8:53:53 AM 1/6/2013 111.1142441 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCPIP: NBL 0x0000000039EE1B40 fell off the receive fast path, Reason: Session state is not compatible. Protocol = TCP, Family = IPV4, Number of NBLs = 1 (0x1). SourceAddress = 10.0.107.58 Null. DestAddress = 10.7.3.187 Null.
61716 8:53:53 AM 1/6/2013 111.1142562 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCPIP: Transport (Protocol TCP, AddressFamily = IPV4) dropped 1 (0x1) packet(s) with Source = 10.7.3.187:3389, Destination = 10.0.107.58:59210. Reason = Transport endpoint was not found.
61717 8:53:53 AM 1/6/2013 111.1142853 Idle (0) 10.7.3.187 10.0.107.58 TCP TCP: [Bad CheckSum]Flags=...A.R.., SrcPort=MS WBT Server(3389), DstPort=59210, PayloadLen=0, Seq=0, Ack=2885139469, Win=0
that means there’s no process listening on TCP 3389. When we check the netstat output collected from RDP server, we really don’t see a process listening on that port:
TCP 0.0.0.0:80 0.0.0.0:0 LISTENING 4 InHost
TCP 0.0.0.0:135 0.0.0.0:0 LISTENING 920 InHost
TCP 0.0.0.0:443 0.0.0.0:0 LISTENING 4 InHost
TCP 0.0.0.0:445 0.0.0.0:0 LISTENING 4 InHost
TCP 0.0.0.0:4279 0.0.0.0:0 LISTENING 5796 InHost
TCP 0.0.0.0:5279 0.0.0.0:0 LISTENING 5796 InHost
TCP 0.0.0.0:8530 0.0.0.0:0 LISTENING 4 InHost
TCP 0.0.0.0:8531 0.0.0.0:0 LISTENING 4 InHost
TCP 0.0.0.0:47001 0.0.0.0:0 LISTENING 4 InHost
TCP 0.0.0.0:49152 0.0.0.0:0 LISTENING 636 InHost
TCP 0.0.0.0:49153 0.0.0.0:0 LISTENING 232 InHost
TCP 0.0.0.0:49154 0.0.0.0:0 LISTENING 240 InHost
TCP 0.0.0.0:49155 0.0.0.0:0 LISTENING 740 InHost
TCP 0.0.0.0:49156 0.0.0.0:0 LISTENING 724 InHost
TCP 0.0.0.0:49161 0.0.0.0:0 LISTENING 3048 InHost
TCP 10.7.3.187:80 10.193.10.7:49684 ESTABLISHED 4 InHost
=> Also at the exact time which matches the time of initial RDP connection being reset, we see the following event on RDP server which explains why the current connection is reset and the new ones are reset as well:
Error 1/6/2013 8:54:40 AM Service Control Manager 7034 None
The Remote Desktop Services service terminated unexpectedly. It has done this 3 time(s).
My colleague kept troubleshooting the problem from Terminal server perspective and they found out that http://support.microsoft.com/kb/2621440/en-us was re-published due to an initial problem with the security update. After reinstalling the security update, the issue was resolved.
Hope this helps you when dealing with similar problems.
Thanks,
Murat
Recently I was involved in a network connectivity performance issue and wanted to share the root cause we found. You can find below an example communication of how the slow performance occurs:
10.15.28.40: A Windows application server
10.95.2.49: A UNIX application server
3562 2012-12-12 18:19:53.258784 0.000000 10.15.28.40 10.95.2.40 TCP 377 51122 > 7890 [PSH, ACK] Seq=2654579738 Ack=1586267028 Win=253 Len=323
3563 2012-12-12 18:19:53.359881 0.101097 10.95.2.40 10.15.28.40 TCP 60 7890 > 51122 [ACK] Seq=1586267028 Ack=2654580061 Win=49640 Len=0
3564 2012-12-12 18:19:53.359939 0.000058 10.95.2.40 10.15.28.40 TCP 60 7890 > 51122 [PSH, ACK] Seq=1586267028 Ack=2654580061 Win=49640 Len=6
3566 2012-12-12 18:19:53.550528 0.190589 10.15.28.40 10.95.2.40 TCP 54 51122 > 7890 [ACK] Seq=2654580061 Ack=1586267034 Win=253 Len=0
3567 2012-12-12 18:19:53.551151 0.000623 10.95.2.40 10.15.28.40 TCP 330 7890 > 51122 [PSH, ACK] Seq=1586267034 Ack=2654580061 Win=49640 Len=276
3585 2012-12-12 18:19:53.753770 0.202619 10.15.28.40 10.95.2.40 TCP 54 51122 > 7890 [ACK] Seq=2654580061 Ack=1586267310 Win=252 Len=0
At frame #3562, Windows Application server sends a request to the UNIX application server
At frame #3563, UNIX server sends an ACK back to Windows server (probably the current TCP delayed ACK timer on UNIX server side is around 100 msec)
At frame #3564, shortly after the delayed ACK, UNIX applications first response packet is received (6 bytes - probably the application sends the application layer protocol header and payload seperately) that it delayed until it receives an ACK from the other party.
At frame #3566, TCP delayed ACK timer on Windows application server expires (which is around ~200 msec) and then Windows sends a delayed ACK back to UNIX server. The reason delayed ACK timer expires on Windows server is that it receives a TCP segment from UNIX server but it doesn’t receive any other segments.
At frame #3567, UNIX application server sends the rest of the application layer data after the ACK is received
The reason Windows waits for ~200 msec at frame #3566 is TCP delayed ACK mechanism. The reason UNIX application layer server doesn’t send the 6 bytes or 276 bytes payloads until the previous TCP segment (6 bytes) it sent was ACKed is Nagle algorithm.
As you can see above, TCP delayed ACK + Nagle + application behavior bring some ~100 msec + ~200 msec = ~300 msec average delay in every transaction. Considering that thousands of transactions take place for the duration of the session between the two application servers cause a considerable amount of delay.
The solution here was to turn off TCP delayed ACK on Windows server side since it was the easiest one to implement in customer’s scenario. Another option could be turning Nagle algorithm off on the UNIX side application server which serves the application running on Windows.
The following Wikipedia article also summarizes the Nagle algorithm:
This algorithm interacts badly with TCP delayed acknowledgments, a feature introduced into TCP at roughly the same time in the early 1980s, but by a different group. With both algorithms enabled, applications that do two successive writes to a TCP connection, followed by a read that will not be fulfilled until after the data from the second write has reached the destination, experience a constant delay of up to 500 milliseconds, the "ACK delay". For this reason, TCP implementations usually provide applications with an interface to disable the Nagle algorithm. This is typically called the TCP_NODELAY option.
If possible an application should avoid consecutive small writes in the first place, so that Nagle's algorithm will not be triggered. The application should keep from sending small single writes and buffer up application writes then send (or with the help of writev() call).
What we were observing was exactly the same mentioned above. You can also find more information on how to turn off TCP Delayed ACK on Windows servers by changing the TcpAckFrequency registry key:
http://support.microsoft.com/kb/823764 Slow performance occurs when you copy data to a TCP server by using a Windows Sockets API program
Hope this helps
Today I’ll be talking about a problem where the external Lync 2010 clients fail to sign in through Lync 2010 Edge server intermittently and rebooting the Edge server helps resolve the issue for sometime until the problem reappears. There could be multiple problems that might prevent an external user from signing in successfully like authentication issues, service problems, communication problems so on and so forth. In this instance it was a pure communications problem. Let me explain how I troubleshooted this issue now...
First of all we collected a number of logs from relevant systems to better understand at which point the sign-in process was failing. Taking corrective actions without logs is generally shooting in the dark so the best way to deal with a problem is to cover the problem with appropriate logs. Here is the action plan that was used to collect the logs when the problem occurred:
Action plan:
=========
1) Please wait until the issue occurs and then run the following commands on respective systems once the issue occurs:
=> On Edge server:
- Please start a new network trace
- Please start SIPStack & S4 logging
- Please run the following command from an elevated command prompt:
netsh trace start provider=Microsoft-Windows-TCPIP provider=Microsoft-Windows-Winsock-AFD tracefile=Edge_tcpip_winsock.etl maxsize=500 capture=yes
=> On Frontend server:
netsh trace start provider=Microsoft-Windows-TCPIP provider=Microsoft-Windows-Winsock-AFD tracefile=FE_tcpip_winsock.etl maxsize=500 capture=yes
=> On the internal client:
- Please enable Lync tracing from Lync client
=> On the external client:
3) Now please reproduce the problem with signing in with a user from the external client
4) Now please stop logging on all endpoints. (network traces, Lync client side logging, Lync server side logging). You can also run the following command to stop netsh tracing on Edge and Frontend servers:
Troubleshooting:
==============
After the logs were collected as per the action plan, I analyzed them starting from external client side:
Note: Please note that all IP addresses, server names, user names etc are replaced for privacy purposes.
External client IP: 10.1.1.1
Access Edge IP: 172.16.1.1
Edge internal IP: 192.168.1.1
FE server IP: 192.168.3.10
1) Remote client to Edge server activity
- Remote client has SIP connectivity to Edge server but those sessions are closed shortly after the session is established
Example:
ip.addr==10.1.1.1 and tcp.port==62973
4298 2013-01-08 09:18:53.832 0.000 10.1.1.1 172.16.1.1 TCP 66 62973 > 443 [SYN] Seq=2046472569 Win=8192 Len=0 MSS=1380 WS=4 SACK_PERM=1
4299 2013-01-08 09:18:53.832 0.000 172.16.1.1 10.1.1.1 TCP 62 443 > 62973 [SYN, ACK] Seq=4248548130 Ack=2046472570 Win=8192 Len=0 MSS=1460 SACK_PERM=1
4313 2013-01-08 09:18:54.441 0.609 10.1.1.1 172.16.1.1 TCP 60 62973 > 443 [ACK] Seq=2046472570 Ack=4248548131 Win=16560 Len=0
4316 2013-01-08 09:18:56.020 1.578 10.1.1.1 172.16.1.1 TLSv1 181 Client Hello
4317 2013-01-08 09:18:56.037 0.016 172.16.1.1 10.1.1.1 TLSv1 1434 Server Hello, Certificate[Unreassembled Packet]
4318 2013-01-08 09:18:56.037 0.000 172.16.1.1 10.1.1.1 TLSv1 1434 Ignored Unknown Record
4319 2013-01-08 09:18:56.417 0.380 10.1.1.1 172.16.1.1 TCP 60 62973 > 443 [ACK] Seq=2046472697 Ack=4248550891 Win=16560 Len=0
4320 2013-01-08 09:18:56.417 0.000 172.16.1.1 10.1.1.1 TLSv1 1434 Ignored Unknown Record
4321 2013-01-08 09:18:56.417 0.000 172.16.1.1 10.1.1.1 TLSv1 560 Ignored Unknown Record
4322 2013-01-08 09:18:56.787 0.370 10.1.1.1 172.16.1.1 TCP 60 62973 > 443 [ACK] Seq=2046472697 Ack=4248552777 Win=16560 Len=0
4323 2013-01-08 09:18:56.791 0.004 10.1.1.1 172.16.1.1 TLSv1 380 Client Key Exchange, Change Cipher Spec, Encrypted Handshake Message
4324 2013-01-08 09:18:56.796 0.004 172.16.1.1 10.1.1.1 TLSv1 113 Change Cipher Spec, Encrypted Handshake Message
4327 2013-01-08 09:18:57.193 0.397 10.1.1.1 172.16.1.1 TLSv1 379 Application Data
4328 2013-01-08 09:18:57.196 0.003 172.16.1.1 10.1.1.1 TLSv1 475 Application Data
4333 2013-01-08 09:18:57.599 0.402 10.1.1.1 172.16.1.1 TLSv1 891 Application Data
4337 2013-01-08 09:18:57.802 0.203 172.16.1.1 10.1.1.1 TCP 54 443 > 62973 [ACK] Seq=4248553257 Ack=2046474185 Win=64860 Len=0
4506 2013-01-08 09:19:18.612 20.809 172.16.1.1 10.1.1.1 TLSv1 507 Application Data
4509 2013-01-08 09:19:19.094 0.482 10.1.1.1 172.16.1.1 TCP 60 62973 > 443 [FIN, ACK] Seq=2046474185 Ack=4248553710 Win=15627 Len=0
4510 2013-01-08 09:19:19.094 0.000 172.16.1.1 10.1.1.1 TCP 54 443 > 62973 [ACK] Seq=4248553710 Ack=2046474186 Win=64860 Len=0
4511 2013-01-08 09:19:19.094 0.000 172.16.1.1 10.1.1.1 TCP 54 443 > 62973 [RST, ACK] Seq=4248553710 Ack=2046474186 Win=0 Len=0
Approximately 21 seconds later Edge server returns the response to Client (SIP server timeout) and that the connection is closed. I’ll explain from where that 21 seconds come from later.
=> Looking at the same activity from Edge server SIPStack log:
TL_INFO(TF_PROTOCOL) [0]05E0.0AAC::01/08/2013-09:18:57.600.02e733e9 (SIPStack,SIPAdminLog::TraceProtocolRecord:SIPAdminLog.cpp(125))$$begin_record
Trace-Correlation-Id: 3490147474
Instance-Id: 000AE0DB
Direction: incoming;source="external edge";destination="internal edge"
Peer: 10.1.1.1:62973
Message-Type: request
Start-Line: REGISTER sip:contoso.com SIP/2.0
From: <sip:username@contoso.com>;tag=eee093290f;epid=927c603558
To: <sip:username@contoso.com>
CSeq: 1 REGISTER
Call-ID: 20d82282a7d94f4fb66a0a6df0c0377f
Via: SIP/2.0/TLS 192.168.2.98:62973
Max-Forwards: 70
Contact: <sip:192.168.2.98:62973;transport=tls;ms-opaque=6b23bb065a>;methods="INVITE, MESSAGE, INFO, OPTIONS, BYE, CANCEL, NOTIFY, ACK, REFER, BENOTIFY";proxy=replace;+sip.instance="<urn:uuid:7C426EA7-E80D-53A0-95E6-ED07D8D0CD1E>"
User-Agent: UCCAPI/4.0.7577.4356 OC/4.0.7577.4356 (Microsoft Lync 2010)
Supported: gruu-10, adhoclist, msrtc-event-categories
Supported: ms-forking
Supported: ms-cluster-failover
Supported: ms-userservices-state-notification
ms-keep-alive: UAC;hop-hop=yes
Event: registration
Content-Length: 0
Message-Body:
TL_ERROR(TF_DIAG) [0]05E0.0AAC::01/08/2013-09:19:18.612.02e7707a (SIPStack,SIPAdminLog::TraceDiagRecord:SIPAdminLog.cpp(143))$$begin_record
LogType: diagnostic
Severity: error
Text: Message was not sent because the connection was closed
SIP-Start-Line: REGISTER sip:contoso.com SIP/2.0
SIP-Call-ID: 20d82282a7d94f4fb66a0a6df0c0377f
SIP-CSeq: 1 REGISTER
Peer: fepool1.contoso.com:5061
and the response back to client:
TL_INFO(TF_PROTOCOL) [0]05E0.0AAC::01/08/2013-09:19:18.613.02e7734c (SIPStack,SIPAdminLog::TraceProtocolRecord:SIPAdminLog.cpp(125))$$begin_record
Instance-Id: 000AE0E9
Direction: outgoing;source="local";destination="external edge"
Message-Type: response
Start-Line: SIP/2.0 504 Server time-out
To: <sip:username@contoso.com>;tag=586991893A5520DA53F83BE6C4EBF4F9
ms-user-logon-data: RemoteUser
Via: SIP/2.0/TLS 192.168.2.98:62973;received=10.1.1.1;ms-received-port=62973;ms-received-cid=2E6400
Server: RTC/4.0
2) Communication between the Edge and Frontend servers:
=> Edge server side trace:
- Edge server try to connect to Frontend server at TCP port 5061 but it fails to do so after three attempts:
106631 11:18:57 AM 1/8/2013 713.5078058 RTCSrv.exe (1504) 192.168.1.1 192.168.3.10 TCP TCP:Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=0, Win=8192 ( ) = 8192
106984 11:19:00 AM 1/8/2013 716.5111523 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:[SynReTransmit #106631]Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=0, Win=8192 ( ) = 8192
107347 11:19:06 AM 1/8/2013 722.5093069 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:[SynReTransmit #106631]Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=0, Win=8192 ( ) = 8192
=> Now let’s see the same activity from TCPIP/AFD drivers perspective:
- RTCSrv process on Edge server creates a stream type socket and binds to it (TCPIP driver assigns a local port 50375) and then requests for a connect to 192.168.3.10:5061
106612 11:18:57 AM 1/8/2013 713.4871675 RTCSrv.exe (1504) Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:socket: 0 (0x0): Process 0x0000000007AE3060 (0x00000000000005E0), Endpoint 0x0000000007A05010, Family 2 (0x2), Type SOCK_STREAM, Protocol 6 (0x6), Seq 1006 (0x3EE), Status Success
106613 11:18:57 AM 1/8/2013 713.4872298 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: endpoint (Family=IPV4 PID=1504 (0x5E0)) created.
106614 11:18:57 AM 1/8/2013 713.4872311 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: endpoint 0x00000000082D19B0 (Family=IPV4, PID=1504 (0x5E0)) created with status = Success.
106615 11:18:57 AM 1/8/2013 713.4872401 RTCSrv.exe (1504) Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:socket: 1 (0x1): Process 0x0000000007AE3060 (0x00000000000005E0), Endpoint 0x0000000007A05010, Family 0 (0x0), Type Unknown value: 0, Protocol 0 (0x0), Seq 1013 (0x3F5), Status Success
106617 11:18:57 AM 1/8/2013 713.5075474 RTCSrv.exe (1504) 192.168.1.1 Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:bind: 0 (0x0): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Address 192.168.1.1:0, Seq 7010 (0x1B62), Status Success
106618 11:18:57 AM 1/8/2013 713.5075751 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: endpoint/connection 0x00000000082D19B0 acquired port number 50357 (0xC4B5).
106619 11:18:57 AM 1/8/2013 713.5075798 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: endpoint (sockaddr=192.168.1.1:50357) bound.
106620 11:18:57 AM 1/8/2013 713.5075837 RTCSrv.exe (1504) 192.168.1.1 Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:bind: 1 (0x1): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Address 192.168.1.1:50357, Seq 7022 (0x1B6E), Status Success
106621 11:18:57 AM 1/8/2013 713.5076201 RTCSrv.exe (1504) Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:Socket option: 4 (0x4): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Option FIONBIO, Value 1 (0x1), Seq 11002 (0x2AFA), Status Success
106622 11:18:57 AM 1/8/2013 713.5076435 RTCSrv.exe (1504) 192.168.3.10 Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:connect: 0 (0x0): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Address 192.168.3.10:5061, Seq 5023 (0x139F), Status Success
106623 11:18:57 AM 1/8/2013 713.5076564 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 transition from ClosedState to SynSentState, SndNxt = 0 (0x0).
106624 11:18:57 AM 1/8/2013 713.5076832 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 attempted to acquire weak reference on port number 50357 (0xC4B5) inherited from endpoint 0x00000000082D19B0. Successful = TRUE.
106625 11:18:57 AM 1/8/2013 713.5076863 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Tcb 0x000000000829C860 (local=192.168.1.1:50357 remote=192.168.3.10:5061) requested to connect.
106626 11:18:57 AM 1/8/2013 713.5077111 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Inspect Connect has been completed on Tcb 0x000000000829C860 with status = Success.
106627 11:18:57 AM 1/8/2013 713.5077296 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Tcb 0x000000000829C860 is going to output SYN with ISN = 2232090326 (0x850AFED6), RcvWnd = 8192 (0x2000), RcvWndScale = 0 (0x0).
106628 11:18:57 AM 1/8/2013 713.5077335 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 (local=192.168.1.1:50357 remote=192.168.3.10:5061) connect proceeding.
106629 11:18:57 AM 1/8/2013 713.5077421 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 SRTT measurement started (seq = 2232090326 (0x850AFED6), tick = 233642658 (0xDED1AA2)).
106630 11:18:57 AM 1/8/2013 713.5077433 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Connection 0x000000000829C860 RetransmitTimer timer started. Scheduled to expire in 3000 (0xBB8) ms.
=> As a result of the upper layer activity, the first SYN sent to the wire:
- After 3 seconds, the second SYN is sent because the retransmit timer has expired:
106979 11:19:00 AM 1/8/2013 716.5110434 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Connection 0x000000000829C860 RetransmitTimer timer has expired.
106980 11:19:00 AM 1/8/2013 716.5110467 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860: Send Retransmit round with SndUna = 2232090326 (0x850AFED6), Round = 1 (0x1), SRTT = 0 (0x0), RTO = 300 (0x12C).
106981 11:19:00 AM 1/8/2013 716.5110498 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 (local=192.168.1.1:50357 remote=192.168.3.10:5061) retransmitting connect attempt, RexmitCount = 1 (0x1).
106982 11:19:00 AM 1/8/2013 716.5110733 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860: SRTT measurement cancelled.
106983 11:19:00 AM 1/8/2013 716.5110861 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Connection 0x000000000829C860 RetransmitTimer timer started. Scheduled to expire in 6000 (0x1770) ms.
=> The second SYN sent to the wire:
- After 6 seconds, the third SYN is sent because the retransmit timer has expired:
107342 11:19:06 AM 1/8/2013 722.5091976 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Connection 0x000000000829C860 RetransmitTimer timer has expired.
107343 11:19:06 AM 1/8/2013 722.5092010 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860: Send Retransmit round with SndUna = 2232090326 (0x850AFED6), Round = 2 (0x2), SRTT = 0 (0x0), RTO = 300 (0x12C).
107344 11:19:06 AM 1/8/2013 722.5092043 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 (local=192.168.1.1:50357 remote=192.168.3.10:5061) retransmitting connect attempt, RexmitCount = 2 (0x2).
107345 11:19:06 AM 1/8/2013 722.5092289 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860: SRTT measurement cancelled.
107346 11:19:06 AM 1/8/2013 722.5092418 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Connection 0x000000000829C860 RetransmitTimer timer started. Scheduled to expire in 12000 (0x2EE0) ms.
=> The third SYN sent to the wire:
- After 12 seconds, TCPIP driver doesn’t send anymore TCP SYNs and terminates the connection attempt and the socket is closed.
107780 11:19:18 AM 1/8/2013 734.5061235 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Connection 0x000000000829C860 RetransmitTimer timer has expired.
107781 11:19:18 AM 1/8/2013 734.5061285 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 (local=192.168.1.1:50357 remote=192.168.3.10:5061) terminating: retransmission timeout expired.
107782 11:19:18 AM 1/8/2013 734.5061305 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 (local=192.168.1.1:50357 remote=192.168.3.10:5061) shutdown initiated (0xC00000B5 - STATUS_IO_TIMEOUT). PID = 1504 (0x5E0).
107783 11:19:18 AM 1/8/2013 734.5061347 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 transition from SynSentState to ClosedState, SndNxt = 2232090327 (0x850AFED7).
107784 11:19:18 AM 1/8/2013 734.5061433 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000829C860 (local=192.168.1.1:50357 remote=192.168.3.10:5061) connect attempt failed with status = 0xC00000B5 - STATUS_IO_TIMEOUT.
107785 11:19:18 AM 1/8/2013 734.5062093 System (4) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: endpoint/connection 0x000000000829C860 released port number 50357 (0xC4B5). WeakReference = TRUE.
107786 11:19:18 AM 1/8/2013 734.5062386 RTCSrv.exe (1504) Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:connect: 1 (0x1): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Seq 5024 (0x13A0), Status 0xC00000B5 - STATUS_IO_TIMEOUT
107787 11:19:18 AM 1/8/2013 734.5065252 RTCSrv.exe (1504) Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:socket cleanup: 0 (0x0): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Seq 2002 (0x7D2), Status Success
107788 11:19:18 AM 1/8/2013 734.5065297 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: endpoint (sockaddr=192.168.1.1:50357) closed.
107789 11:19:18 AM 1/8/2013 734.5065339 RTCSrv.exe (1504) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: endpoint/connection 0x00000000082D19B0 released port number 50357 (0xC4B5). WeakReference = FALSE.
107790 11:19:18 AM 1/8/2013 734.5065425 RTCSrv.exe (1504) Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:socket cleanup: 1 (0x1): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Seq 2003 (0x7D3), Status Success
107791 11:19:18 AM 1/8/2013 734.5065490 RTCSrv.exe (1504) Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:closesocket: 0 (0x0): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Seq 2000 (0x7D0), Status Success
107792 11:19:18 AM 1/8/2013 734.5065498 RTCSrv.exe (1504) Wscore_MicrosoftWindowsWinsockAFD Wscore_MicrosoftWindowsWinsockAFD:closesocket: 1 (0x1): Process 0x0000000007AE3060, Endpoint 0x0000000007A05010, Seq 2001 (0x7D1), Status Success
- So the connection times out after 21 seconds. This also explains why the external client to Edge server connection is terminated after 21 seconds. Since Edge server cannot access Frontend server, it returns a SIP 504 and then the client closes the session.
- Also the Edge server SIPSTack ETL trace confirms that behavior:
TL_ERROR(TF_CONNECTION) [0]05E0.0AAC::01/08/2013-09:19:18.611.02e77050 (SIPStack,SIPAdminLog::TraceConnectionRecord:SIPAdminLog.cpp(160))
$$begin_record
LogType: connection
Text: Failed to complete outbound connection
Peer-IP: 192.168.3.10:5061
Peer-FQDN: fepool01.contoso.com
Connection-ID: 0x2E6603
Transport: TLS
Result-Code: 0x8007274c WSAETIMEDOUT
Data: fqdn=" fepool01.contoso.com ";peer-type="InternalServer";winsock-code="10060"
$$end_record
- Also a telnet test confirms the same behavior:
Edge:
C:\Users\Administrator>telnet 192.168.3.10 5061
Connecting To 192.168.3.10...Could not open connection to the host, on port 5061: Connect failed
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
=> Frontend server side trace:
- Frontend server see the incoming connection requests like below:
606227 11:19:07 AM 1/8/2013 727.4974204 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=2948578651, Win=8192 ( ) = 8192
606232 11:19:07 AM 1/8/2013 727.4975478 Idle (0) 192.168.3.10 192.168.1.1 TCP TCP: [Bad CheckSum]Flags=...A..S., SrcPort=5061, DstPort=50357, PayloadLen=0, Seq=154470907, Ack=2232090327, Win=8192 ( Scale factor not supported ) = 8192
606233 11:19:07 AM 1/8/2013 727.4980222 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:Flags=...A.R.., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090327, Ack=154470908, Win=8192
612948 11:19:10 AM 1/8/2013 730.5001959 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=2948578651, Win=8192 ( ) = 8192
612953 11:19:10 AM 1/8/2013 730.5003515 Idle (0) 192.168.3.10 192.168.1.1 TCP TCP: [Bad CheckSum]Flags=...A..S., SrcPort=5061, DstPort=50357, PayloadLen=0, Seq=155953667, Ack=2232090327, Win=8192 ( Scale factor not supported ) = 8192
612967 11:19:10 AM 1/8/2013 730.5007007 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:Flags=...A.R.., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090327, Ack=155953668, Win=8192
623999 11:19:16 AM 1/8/2013 736.4983502 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=2948578651, Win=8192 ( ) = 8192
624004 11:19:16 AM 1/8/2013 736.4985345 Idle (0) 192.168.3.10 192.168.1.1 TCP TCP: [Bad CheckSum]Flags=...A..S., SrcPort=5061, DstPort=50357, PayloadLen=0, Seq=157301184, Ack=2232090327, Win=8192 ( Scale factor not supported ) = 8192
624005 11:19:16 AM 1/8/2013 736.4987452 Idle (0) 192.168.1.1 192.168.3.10 TCP TCP:Flags=...A.R.., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090327, Ack=157301185, Win=8192
- It gets the TCP SYN from Edge, sends a TCP SYN ACK back to it but very shortly afterwards it gets a TCP RST from Frontend. In reality that TCP RST is not sent by the Edge server as we have seen above. Most likely that TCP RST comes from a network device running in between. The sender MAC address for that TCP RST is the following (but it doesn’t necessarily mean it’s that device sending the RST:)
Frame: Number = 606233, Captured Frame Length = 161, MediaType = NetEvent
+ NetEvent:
+ MicrosoftWindowsNDISPacketCapture: Packet Fragment (60 (0x3C) bytes)
- Ethernet: Etype = Internet IP (IPv4),DestinationAddress:[ 00-50-56-33-22-11],SourceAddress:[ 00-0F-8F-11-22-33]
+ DestinationAddress: VMWare, Inc. 8C7B52 [00-50-56-33-22-11]
+ SourceAddress: Cisco Systems 112233 [00-0F-8F-11-22-33]
EthernetType: Internet IP (IPv4), 2048(0x800)
UnknownData: Binary Large Object (6 Bytes)
+ Ipv4: Src = 192.168.1.1, Dest = 192.168.3.10, Next Protocol = TCP, Packet ID = 50541, Total IP Length = 40
+ Tcp: Flags=...A.R.., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090327, Ack=154470908, Win=8192
- Now let’s check the above activity at TCPIP/AFD driver level:
=> The first TCP SYN received from the network:
=> TCPIP driver sends a TCP SYN ACK:
612949 11:19:10 AM 1/8/2013 730.5002398 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCPIP: NBL 0x000000000CCBE7F0 fell off the receive fast path, Reason: Session state is not compatible. Protocol = TCP, Family = IPV4, Number of NBLs = 1 (0x1). SourceAddress = 192.168.1.1 Null. DestAddress = 192.168.3.10 Null.
612950 11:19:10 AM 1/8/2013 730.5002968 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x0000000011D4DCF0 transition from ListenState to SynRcvdState, SndNxt = 0 (0x0).
612951 11:19:10 AM 1/8/2013 730.5003286 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x0000000011D4DCF0 SRTT measurement started (seq = 155953667 (0x94BAA03), tick = 1684327497 (0x6464CC49)).
612952 11:19:10 AM 1/8/2013 730.5003309 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Connection 0x0000000011D4DCF0 RetransmitTimer timer started. Scheduled to expire in 3000 (0xBB8) ms.
612953 11:19:10 AM 1/8/2013 730.5003515 Idle (0) 192.168.3.10 192.168.1.1TCP TCP: [Bad CheckSum]Flags=...A..S., SrcPort=5061, DstPort=50357, PayloadLen=0, Seq=155953667, Ack=2232090327, Win=8192 ( Scale factor not supported ) = 8192
- It gets a TCP RST from Edge server side right after this: (this wasn’t sent by the Edge server by the way)
612967 11:19:10 AM 1/8/2013 730.5007007 Idle (0) 192.168.1.1192.168.3.10 TCP TCP:Flags=...A.R.., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090327, Ack=155953668, Win=8192
612968 11:19:10 AM 1/8/2013 730.5007382 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x0000000011D4DCF0 (local=192.168.3.10:5061 remote=192.168.1.1:50357) connection terminated: received RST.
612969 11:19:10 AM 1/8/2013 730.5007404 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x0000000011D4DCF0 (local=192.168.3.10:5061 remote=192.168.1.1:50357) shutdown initiated (0xC000020D - STATUS_CONNECTION_RESET). PID = 2640 (0xA50).
612970 11:19:10 AM 1/8/2013 730.5007432 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x0000000011D4DCF0 transition from SynRcvdState to ClosedState, SndNxt = 155953668 (0x94BAA04).
- The same action is repeated two more times:
623999 11:19:16 AM 1/8/2013 736.4983502 Idle (0) 192.168.1.1192.168.3.10 TCP TCP:Flags=......S., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090326, Ack=2948578651, Win=8192 ( ) = 8192
624000 11:19:16 AM 1/8/2013 736.4983890 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCPIP: NBL 0x000000000CCE1E20 fell off the receive fast path, Reason: Session state is not compatible. Protocol = TCP, Family = IPV4, Number of NBLs = 1 (0x1). SourceAddress = 192.168.1.1 Null. DestAddress = 192.168.3.10 Null.
624001 11:19:16 AM 1/8/2013 736.4984630 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000EDC7960 transition from ListenState to SynRcvdState, SndNxt = 0 (0x0).
624002 11:19:16 AM 1/8/2013 736.4985060 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000EDC7960 SRTT measurement started (seq = 157301184 (0x96039C0), tick = 1684328097 (0x6464CEA1)).
624003 11:19:16 AM 1/8/2013 736.4985125 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: Connection 0x000000000EDC7960 RetransmitTimer timer started. Scheduled to expire in 3000 (0xBB8) ms.
624004 11:19:16 AM 1/8/2013 736.4985345 Idle (0) 192.168.3.10 192.168.1.1TCP TCP: [Bad CheckSum]Flags=...A..S., SrcPort=5061, DstPort=50357, PayloadLen=0, Seq=157301184, Ack=2232090327, Win=8192 ( Scale factor not supported ) = 8192
624005 11:19:16 AM 1/8/2013 736.4987452 Idle (0) 192.168.1.1192.168.3.10 TCP TCP:Flags=...A.R.., SrcPort=50357, DstPort=5061, PayloadLen=0, Seq=2232090327, Ack=157301185, Win=8192
624006 11:19:16 AM 1/8/2013 736.4987656 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000EDC7960 (local=192.168.3.10:5061 remote=192.168.1.1:50357) connection terminated: received RST.
624007 11:19:16 AM 1/8/2013 736.4987684 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000EDC7960 (local=192.168.3.10:5061 remote=192.168.1.1:50357) shutdown initiated (0xC000020D - STATUS_CONNECTION_RESET). PID = 2640 (0xA50).
624008 11:19:16 AM 1/8/2013 736.4987706 Idle (0) TCPIP_MicrosoftWindowsTCPIP TCPIP_MicrosoftWindowsTCPIP:TCP: connection 0x000000000EDC7960 transition from SynRcvdState to ClosedState, SndNxt = 157301185 (0x96039C1).
SUMMARY:
The behavior we see above points to a problem somewhere between at or below the NIC layer on Frontend server and at or below the NIC layer on Edge server including any network devices sitting in the path in between the two servers like LAN switches, firewalls etc.
As such, I have made the following recommendations to our customer
a) Involving their networking team for further investigations on the network devices running in between the Edge and frontend servers
b) Making some improvements on the endpoints (edge/frontend servers) from vmware perspective since both servers are virtualized on vmware ESX server(s)
=> The NIC could be configured as a VMXNET3 driver in the vmware config:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1001805
Choosing a network adapter for your virtual machine
=> Vmware recommends turning STP protocol off on the switchport to which VMware ESX server is connected:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003804
STP may cause temporary loss of network connectivity when a failover or failback event occurs
=> We can check if the current physical NIC is on the compatibility list of Vmware:
http://partnerweb.vmware.com/comp_guide2/search.php?
(Please choose your vmware server version and network adapter brand/model from the above link)
=> Vmware releases patches on releated components. For example the following is one of the latest releases for ESXi 4.0. It is best to check if your Vmware server binaries are recent.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2011768
=> Also it’s suggested to run virtual machine version 7 for the guest:
http://www.vm-help.com/esx40i/upgrade_virtual_hardware/virtual_hardware_upgrade.php
Hope this helps you in troubleshooting similar problems...
I would like to talk about another Lync 2010 problem where desktop sharing was failing between two federated contacts. Network traces and Lync ETL traces were collected while reproducing the problem. Here’s the troubleshooting steps that I followed:
Note: Please note that usernames, IP addresses, server names etc are all replaced for privacy purposes.
Client IP: 10.98.2.1 (this is the client that initiates desktop sharing)
FE server IP: 10.10.10.1
Edge server internal IP: 10.3.0.34
Edge server external IP: 10.3.0.40
Edge server external IP (NATed): 2.2.2.2
Remote Edge server IP: 1.1.1.1
1) I started with Lync client side ETL trace. It negotiates the application sharing media port with the help of Edge server:
=> Application sharing request sent by the internal client:
12/10/2012|10:43:06.703 24F8:2648 INFO :: Sending Packet - 10.10.10.1:5061 (From Local Address: 10.98.2.1:51647) 2520 bytes:
12/10/2012|10:43:06.703 24F8:2648 INFO :: INVITE sip:user2@fabrikam.com SIP/2.0
Via: SIP/2.0/TLS 10.98.2.1:51647
From: <sip:user1@contoso.com>;tag=2cd64ff7dc;epid=6750f01948
To: <sip:user2@fabrikam.com>
Call-ID: a622d7b9f6464be3a455e6d98ec700b3
CSeq: 1 INVITE
Contact: <sip:user1@contoso.com;opaque=user:epid:lxH-F4F2KFaOdejdehenrlAAA;gruu>
Supported: ms-dialog-route-set-update
Supported: timer
Supported: histinfo
Supported: ms-safe-transfer
Supported: ms-sender
Supported: ms-early-media
Supported: ms-conf-invite
Ms-Conversation-ID: Ac3Wuri73DV0XRhGRBSO/PHZU2QhhQ==
Allow: INVITE, BYE, ACK, CANCEL, INFO, UPDATE, REFER, NOTIFY, BENOTIFY, OPTIONS
ms-subnet: 10.98.22.0
ms-endpoint-location-data: NetworkScope;ms-media-location-type=Intranet
v=0
o=- 0 0 IN IP4 2.2.2.2
s=session
c=IN IP4 2.2.2.2
b=CT:99980
t=0 0
m=applicationsharing 58138 TCP/RTP/AVP 127
a=ice-ufrag:PFb1
a=candidate:1 1 TCP-PASS 2120613887 10.98.2.1 21916 typ host
a=candidate:1 2 TCP-PASS 2120613374 10.98.2.1 21916 typ host
a=candidate:2 1 TCP-ACT 2121006591 10.98.2.1 18574 typ host
a=candidate:2 2 TCP-ACT 2121006078 10.98.2.1 18574 typ host
a=candidate:3 1 TCP-PASS 6556159 2.2.2.2 58138 typ relay raddr 10.98.2.1 rport 13211
a=candidate:3 2 TCP-PASS 6556158 2.2.2.2 58138 typ relay raddr 10.98.2.1 rport 13211
=> Final response:
12/10/2012|10:43:10.418 24F8:2648 INFO :: Data Received - 10.10.10.1:5061 (To Local Address: 10.98.2.1:51647) 4311 bytes:
12/10/2012|10:43:10.418 24F8:2648 INFO :: SIP/2.0 200 OK
Authentication-Info: TLS-DSK qop="auth", opaque="2957C393", srand="ACCCF975", snum="68", rspauth="8dad7b71a068c9ec073868375fcd97955d8dd4bf", targetname="fepool01.contoso.com", realm="SIP Communications Service", version=4
Via: SIP/2.0/TLS 10.98.2.1:51647;ms-received-port=51647;ms-received-cid=657300
From: "user1 @ contoso.com"<sip:user1@contoso.com>;tag=2cd64ff7dc;epid=6750f01948
To: <sip:user2@fabrikam.com>;epid=c23c88d032;tag=fef0c3e438
Contact: <sip:user2@fabrikam.com;opaque=user:epid:wyPrDAU4lVq4GhmnrWqEFgAA;gruu>
Session-Expires: 720;refresher=uac
ms-client-diagnostics: 51007;reason="Callee media connectivity diagnosis info";CalleeMediaDebug="application-sharing:ICEWarn=0x0,LocalSite=10.166.24.59:8541,LocalMR=1.1.1.1:50704,RemoteSite=10.98.2.1:21916,RemoteMR=2.2.2.2:58138,PortRange=50040:50059,LocalMRTCPPort=50704,RemoteMRTCPPort=58138,LocalLocation=1,RemoteLocation=2,FederationType=1"
ms-endpoint-location-data: NetworkScope;ms-media-location-type=Internet
o=- 0 0 IN IP4 1.1.1.1
c=IN IP4 1.1.1.1
m=applicationsharing 50704 TCP/RTP/SAVP 127
a=ice-ufrag:G/Yw
a=ice-pwd:EwhnzgwKgaoRuI2T2gKRRqYz
a=candidate:1 1 TCP-PASS 2120613887 10.26.10.50 50058 typ host
a=candidate:1 2 TCP-PASS 2120613374 10.26.10.50 50058 typ host
So we expect the media session to flow between 1.1.1.1:50704 <---> 2.2.2.2:58138 (between the two Edge servers: Edge server @ contoso.com and Edge server @ fabrikam.com)
On the other hand, the internal client from where the user (user1@contoso.com) signed in will be accessing the Edge server @ contoso.com at TCP 443 (Mediarelay service listens on internal interface of the Edge server as well)
In summary, the communication should be as follows:
Flow1:
Internal client <--- > Edge server @ consoto.com internal interface:443
Flow2:
Edge server external interface (contoso.com) <---> Edge server external interface (fabrikam.com) (1.1.1.1:50704 <---> 2.2.2.2:58138)
a) When I check the trace collected on Edge server, I see that the client has a successful session with the internal interface of the Edge server @ TCP 443:
344 2012-12-10 10:43:49.130 0.000 10.98.2.1 10.3.0.34 TCP 66 13211 > 443 [SYN] Seq=2226089608 Win=8192 Len=0 MSS=1460 WS=256 SACK_PERM=1
345 2012-12-10 10:43:49.130 0.000 10.3.0.34 10.98.2.1 TCP 66 443 > 13211 [SYN, ACK] Seq=3794450141 Ack=2226089609 Win=8192 Len=0 MSS=1460 WS=256 SACK_PERM=1
346 2012-12-10 10:43:49.131 0.000 10.98.2.1 10.3.0.34 TCP 60 13211 > 443 [ACK] Seq=2226089609 Ack=3794450142 Win=65536 Len=0
347 2012-12-10 10:43:49.181 0.049 10.98.2.1 10.3.0.34 TLSv1 104 Client Hello
348 2012-12-10 10:43:49.181 0.000 10.3.0.34 10.98.2.1 TLSv1 137 Server Hello, Server Hello Done
349 2012-12-10 10:43:49.234 0.053 10.98.2.1 10.3.0.34 TLSv1 178 Ignored Unknown Record
350 2012-12-10 10:43:49.234 0.000 10.3.0.34 10.98.2.1 TLSv1 189 Ignored Unknown Record
354 2012-12-10 10:43:49.286 0.051 10.98.2.1 10.3.0.34 TLSv1 240 Ignored Unknown Record
355 2012-12-10 10:43:49.286 0.000 10.3.0.34 10.98.2.1 TLSv1 178 Ignored Unknown Record
358 2012-12-10 10:43:49.480 0.194 10.98.2.1 10.3.0.34 TCP 60 13211 > 443 [ACK] Seq=2226089969 Ack=3794450484 Win=65280 Len=0
420 2012-12-10 10:43:53.067 3.586 10.98.2.1 10.3.0.34 TLSv1 328 Ignored Unknown Record
b) But the Edge server @ contoso.com fails to connect to Edge server @ fabrikam.com at the negotiated port:
Note: Since this trace is collected on Edge server, we see the private IP address (10.3.0.40) assigned to external interface of the AV Edge service on Edge server.
1299 2012-12-10 10:43:59.456 0.000 10.3.0.40 1.1.1.1 TCP 66 58138 > 50704 [SYN] Seq=3074395695 Win=8192 Len=0 MSS=1460 WS=256 SACK_PERM=1
1300 2012-12-10 10:43:59.499 0.042 1.1.1.1 10.3.0.40 TCP 60 50704 > 58138 [RST, ACK] Seq=0 Ack=3074395696 Win=0 Len=0
1353 2012-12-10 10:44:00.002 0.503 10.3.0.40 1.1.1.1 TCP 66 58138 > 50704 [SYN] Seq=3074395695 Win=8192 Len=0 MSS=1460 WS=256 SACK_PERM=1
1356 2012-12-10 10:44:00.044 0.042 1.1.1.1 10.3.0.40 TCP 60 50704 > 58138 [RST, ACK] Seq=0 Ack=3074395696 Win=0 Len=0
1360 2012-12-10 10:44:00.548 0.503 10.3.0.40 1.1.1.1 TCP 62 58138 > 50704 [SYN] Seq=3074395695 Win=8192 Len=0 MSS=1460 SACK_PERM=1
1361 2012-12-10 10:44:00.590 0.042 1.1.1.1 10.3.0.40 TCP 60 50704 > 58138 [RST, ACK] Seq=0 Ack=3074395696 Win=0 Len=0
it gets a TCP RST to its connection attempts (like frame #1300, frame #1356, frame #1361). So the media session (application sharing) couldn’t be established.
We have to allow Edge server so that it could establish outbound TCP connections within this range for media communications:
http://technet.microsoft.com/en-us/library/gg425891(v=ocs.14).aspx Reference Architecture 1: Port Summary for Single Consolidated Edge
Summary:
After further investigation by our customer’s networking team, it was found out that the checkpoint firewall was not allowing outbound TCP connections in 50000-59999 port range and hence application sharing was failing.
Hope this helps you in resolving similar problems...
In this post, I want to talk about how I troubleshooted a direct access connectivity problem hoping that it might also help you in troubleshooting similar problems.
The issue was that the IPHTTPS client cannot connect to UAG DA server. We first checked static logs collected from a problem client (GPO output, firewall outputs, etc) and all seemed to be fine:
- Required services were running on the client (Windows Firewall/IKE and AuthIP/IP Helper)
- Firewall was chosing the correct firewall (Public profile):
-------------------------------------
netsh advfirewall show currentprofile
Public Profile Settings:
----------------------------------------------------------------------
State ON
Firewall Policy BlockInbound,AllowOutbound
LocalFirewallRules N/A (GPO-store only)
LocalConSecRules N/A (GPO-store only)
InboundUserNotification Enable
RemoteManagement Disable
UnicastResponseToMulticast Enable
Logging:
LogAllowedConnections Disable
LogDroppedConnections Disable
FileName C:\Windows\System32\LogFiles\Firewall\pfirewall.log
MaxFileSize 4096
Ok.
- All categories including connection security was owned by Windows Firewall:
Categories:
BootTimeRuleCategory Windows Firewall
FirewallRuleCategory Windows Firewall
StealthRuleCategory Windows Firewall
ConSecRuleRuleCategory Windows Firewall
- gpresult output collected confirmed that the client was getting the required policy which is UAG DirectAccess: Clients (UAGServer-FQDN)
Then I asked our customer to implement the below action plan to collect dynamic data while reproducing the problem
Action plan for log collection:
=========================
The outline of the action plan is below:
- Start WFP/DirectAccess/network traces on the DA client
- Start WFP/DirectAccess/network traces on the UAG DA server
- Stop and restart the relevant services (like IKE/ip helper) to trigger the tunnel establishment and see things from scratch
- Once we observe that the client cannot access any internal resources we’ll stop logs on the DA client and UAG DA server
1) Please install Network Monitor 3.4 on a problem DA client and on the UAG DA server. You can download the tool at the following link:
http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=983b941d-06cb-4658-b7f6-3088333d062f
2) Please disable and stop “IKE and AuthIP IPSec Keying Modules” service with the below commands on the DA client:
Note: All commands need to be run from an elevated command prompt.
a) We first disable the service:
sc config ikeext start= disabled
(Note: There must be a SPACE between “=” and “disabled”, otherwise the command doesn’t work)
b) We stop the service:
net stop ikeext
3) Now we start the network capture on the DA client and on the UAG server: (from an elevated command prompt)
=> On the DA client:
nmcap /network * /capture /file DAClient.chn:100M
=> On the UAG DA server:
nmcap /network * /capture /file DAServer.chn:100M
Note: You have to run the nmcap command from an elevated command prompt on Windows Vista/Windows 7/Windows 2008/Windows 2008 R2
Note: nmcap will create a new capture file once the first one if full (200 MB) and so on. So please make sure that you have enough free disk space on the related drive.
Note: If you get the error message "'nmcap' is not recognized as an internal or external command, operable program or batch file", you may need to run the nmcap command inside "%ProgramFiles%\Microsoft Network Monitor 3" folder
Note: Network trace could be stopped any time by pressing Ctrl + C
4) Let’s start tracing for various components at a different elevated command prompt on the DA client and on the UAG DA server:
netsh wfp capture start
netsh trace start scenario=directaccess provider=microsoft-windows-iphlpsvc level=5 capture=yes tracefile=darepro.etl maxsize=350
netsh trace start provider=Microsoft-Windows-DNS-Client provider=Microsoft-Windows-Iphlpsvc-Trace provider=Microsoft-Windows-WFP provider=Microsoft-Windows-NCSI provider=Microsoft-Windows-NlaSvc provider=Microsoft-Windows-NetworkProfile provider=Microsoft-Windows-WinINet provider=Microsoft-Windows-WinHttp provider=Microsoft-Windows-WebIO provider="Microsoft-Windows-Windows Firewall With Advanced Security" tracefile=c:\logs\UAGDAServer.etl maxsize=400
5) Now we’re going to reproduce the problem from the DA client side.We’re going to stop iphlpsvc on the DA client: (the service which implements the IPv6 transition technologies)
net stop iphlpsvc
6) Just before we start re-enabling the relevant services on the DA client, we run the following ping on the DA client: (with 22 bytes ping)
ping -l 22 -n 2 IP-address-of-default-gateway-configured-on-the-client
Note: Please replace IP-address-of-default-gateway-configured-on-the-client with the real default gateway address configured on the client
7) Now please run the following commands in the given order on the DA client:
net start iphlpsvc
sc config ikeext start= auto
net start ikeext
(Note: There must be a SPACE between “=” and “disabled” in "sc config" command, otherwise the command doesn’t work)
8) Let’s please wait for 30 seconds or so and then try the following commands on the DA client:
ping -l 33 an-internal-server1
ping -l 44 an-internal-server2
net view \\an-internal-server1
net view \\an-internal-server2
As an additional test, you can try to reach an internal web server/sharepoint portal etc if any exists from Internet explorer on the DA client (like accessing http://internalserver3/portal)
NOTE: Please replace the “an-internal-server” with real ones
NOTE: Please write down the full commands that you run
NOTE: Please take a screenshot of each access attempts
9) After you get errors for the above commands, please run the following commands (all from an elevated command prompt) on the DA client and on the UAG DA server to stop the logs running:
a) To stop WFP tracing:
netsh wfp capture stop
b) To stop Directaccess tracing:
c) To stop Network trace on the UAG DA server:
Please hit CTRL+C at the command prompt where nmcap tool runs
a) To mark the end of problem reproduce (with 55 bytes ping)
ping -l 55 -n 2 IP-address-of-default-gateway-configured-on-the-client
b) To stop WFP tracing:
c) To stop Directaccess tracing:
d) To stop Network trace on the DA client:
ANALYSIS 1:
==========
After analyzing the logs collected as per the above action plan, I got the following findings:
=> IPHTTPS interafce was not connected with the error number 0x800b0109
Interface IPHTTPSInterface (Group Policy) Parameters
------------------------------------------------------------
Role : client
URL : https://UAGDAServer-FQDN:443/IPHTTPS
Last Error Code : 0x800b0109
Interface Status : failed to connect to the IPHTTPS server. Waiting to reconnect
err 0x800b0109
# for hex 0x800b0109 / decimal -2146762487 :
CERT_E_UNTRUSTEDROOT winerror.h
# A certificate chain processed, but terminated in a root
# certificate which is not trusted by the trust provider.
# 1 matches found for "0x800b0109"
=> IPHTTPS client doesn’t accept the certificate provided because it doesn’t trust the issuer
=> When I checked the network activity, I saw the following communications:
2084 15:13:43.441664 463.087187 0.002359 192.168.99.5 192.168.99.100 DNS Standard query A UAGDAServer-FQDN
2085 15:13:43.442251 463.087774 0.000587 192.168.99.100 192.168.99.5 DNS Standard query response A 1.1.1.2
Now client tries to establish the TLS tunnel:
No. Time Time since Delta Source Destination Protocol Identification S-Port D-Port Sequence TCP Len TCP Ack TCP Win Info
2086 15:13:43.443150 463.088673 0.000000 192.168.99.5 1.1.1.2 TCP 0x58ba (22714) 57134 443 1683802426 0 8192 57134 > 443 [SYN] Seq=1683802426 Win=8192 [TCP CHECKSUM INCORRECT] Len=0 MSS=1460 WS=256 SACK_PERM=1
2087 15:13:43.479192 463.124715 0.036042 1.1.1.2 192.168.99.5 TCP 0x61f4 (25076) 443 57134 536292141 0 1683802427 8192 443 > 57134 [SYN, ACK] Seq=536292141 Ack=1683802427 Win=8192 Len=0 MSS=1460 WS=256 SACK_PERM=1
2088 15:13:43.479356 463.124879 0.000164 192.168.99.5 1.1.1.2 TCP 0x58bb (22715) 57134 443 1683802427 0 536292142 65536 57134 > 443 [ACK] Seq=1683802427 Ack=536292142 Win=65536 [TCP CHECKSUM INCORRECT] Len=0
2089 15:13:43.491000 463.136523 0.011644 192.168.99.5 1.1.1.2 TLSv1 0x58bc (22716) 57134 443 1683802427 119 536292142 65536 Client Hello
2092 15:13:43.530437 463.175960 0.039437 1.1.1.2 192.168.99.5 TLSv1 0x61f5 (25077) 443 57134 536292142 853 1683802546 65536 Server Hello, Certificate, Server Key Exchange, Server Hello Done
2093 15:13:43.569422 463.214945 0.038985 192.168.99.5 1.1.1.2 TLSv1 0x58be (22718) 57134 443 1683802546 134 536292995 64768 Client Key Exchange, Change Cipher Spec, Encrypted Handshake Message
Certificate (id-at-commonName=Default Web Site) => Returned certificate is not the one expected by the client
2094 15:13:43.608227 463.253750 0.038805 1.1.1.2 192.168.99.5 TLSv1 0x61f6 (25078) 443 57134 536292995 59 1683802680 65536 Change Cipher Spec, Encrypted Handshake Message
2095 15:13:43.608869 463.254392 0.000642 192.168.99.5 1.1.1.2 TCP 0x58bf (22719) 57134 443 1683802680 0 536293054 64768 57134 > 443 [FIN, ACK] Seq=1683802680 Ack=536293054 Win=64768 [TCP CHECKSUM INCORRECT] Len=0
2096 15:13:43.645152 463.290675 0.036283 1.1.1.2 192.168.99.5 TCP 0x61f7 (25079) 443 57134 536293054 0 1683802681 0 443 > 57134 [RST, ACK] Seq=536293054 Ack=1683802681 Win=0 Len=0
=> On the other hand when checking the certificate bindings on the UAG server side, I saw that UAG DA server IPHTTPS server was bound to 1.1.1.1:443 with the appropriate certificate
c:\> netsh http show sslcert
IP:port : 1.1.1.1:443
Certificate Hash : 8691087835614a5f09b5f11c403d595c461ff603
Application ID : {5d8e2743-ef20-4d38-8751-7e400f200e65}
Certificate Store Name : MY
Verify Client Certificate Revocation : Enabled
Verify Revocation Using Cached Client Certificate Only : Disabled
Usage Check : Enabled
Revocation Freshness Time : 0
URL Retrieval Timeout : 0
Ctl Identifier :
Ctl Store Name :
DS Mapper Usage : Enabled
Negotiate Client Certificate : Disabled
c:\> certutil -store -v my
================ Certificate 2 ================
Serial Number: 5485200900020000015c
Issuer: CN=CA-name, DC=company, DC=com
NotBefore: 18/04/2012 16:49
NotAfter: 18/04/2014 16:49
Subject: CN=UAGDAServer-FQDN
Non-root Certificate
Template: WebServer3, Web Server 3
Cert Hash(sha1): 86 91 08 78 35 61 4a 5f 09 b5 f1 1c 40 3d 59 5c 46 1f f6 03
Key Container = d53d6468acfdb333ca5698ae67f3549e_6dd37665-5fa6-46ec-89de-8857879f2500
Simple container name: le-WebServer3-afc38199-0bb5-4cb9-b043-50851171183d
Provider = Microsoft Base Cryptographic Provider v1.0
Encryption test passed
The certificate seems to be the correct one but since UAGDAServer-FQDN is not mapped to 1.1.1.1 (it’s mapped to 1.1.1.2), another certificate is returned to the client in the TLS sessions and IPHTTPS connection fails as well. That was also evident in the network trace:
I informed my customer about the problem and they changed the DNS settings. But afterwards my customer informed me that IPHTTPS connectivity problem was still in place but this time with a different error number. We re-run the previous action plan and collected logs one more time:
ANALYSIS 2:
This time the problem was a bit different. As per the log outputs from the client and UAG DA server, there was a packet drop issue between the client and server, client’s IPHTTPS traffic wasn’t arriving at UAG DA server so the IPHTTPS tunnel establishment was failing:
c:\> netsh interface httpstunnel show interface
Last Error Code : 0x274c
err 0x274c
# for hex 0x274c / decimal 10060 :
WSAETIMEDOUT winerror.h
# A connection attempt failed because the connected party did
# not properly respond after a period of time, or established
# connection failed because connected host has failed to
# respond.
WSAETIMEDOUT winsock2.h
# 2 matches found for "0x274c"
=> And we can really see that the DA client could not connect to TCP 443:
No. Time Time since Delta Source Destination Protocol Info
2657 14:43:33.170638 199.176755 197.498302 192.168.99.7 1.1.1.1 TCP 54007 > 443 [SYN] Seq=2446300399 Win=8192 Len=0 MSS=1460 WS=256 SACK_PERM=1
2693 14:43:33.563827 199.569944 0.393189 192.168.99.7 1.1.1.1 TCP 54008 > 443 [SYN] Seq=4024014335 Win=8192 Len=0 MSS=1460 SACK_PERM=1
2736 14:43:36.583165 202.589282 0.258635 192.168.99.7 1.1.1.1 TCP 54008 > 443 [SYN] Seq=4024014335 Win=8192 Len=0 MSS=1460 SACK_PERM=1
2839 14:43:42.577128 208.583245 0.259486 192.168.99.7 1.1.1.1 TCP 54008 > 443 [SYN] Seq=4024014335 Win=8192 Len=0 MSS=1460 SACK_PERM=1
3482 14:44:24.615646 250.621763 18.289976 192.168.99.7 1.1.1.1 TCP 54023 > 443 [SYN] Seq=3215921282 Win=8192 Len=0 MSS=1460 WS=256 SACK_PERM=1
3516 14:44:27.621705 253.627822 3.006059 192.168.99.7 1.1.1.1 TCP 54023 > 443 [SYN] Seq=3215921282 Win=8192 Len=0 MSS=1460 WS=256 SACK_PERM=1
3585 14:44:33.627033 259.633150 6.005328 192.168.99.7 1.1.1.1 TCP 54023 > 443 [SYN] Seq=3215921282 Win=8192 Len=0 MSS=1460 SACK_PERM=1
4432 14:45:36.642904 322.649021 4.472205 192.168.99.7 1.1.1.1 TCP 54035 > 443 [SYN] Seq=55761659 Win=8192 Len=0 MSS=1460 WS=256 SACK_PERM=1
4472 14:45:39.638130 325.644247 2.995226 192.168.99.7 1.1.1.1 TCP 54035 > 443 [SYN] Seq=55761659 Win=8192 Len=0 MSS=1460 WS=256 SACK_PERM=1
4539 14:45:45.653327 331.659444 6.015197 192.168.99.7 1.1.1.1 TCP 54035 > 443 [SYN] Seq=55761659 Win=8192 Len=0 MSS=1460 SACK_PERM=1
6202 14:48:00.662720 466.668837 41.395992 192.168.99.7 1.1.1.1 TCP 54064 > 443 [SYN] Seq=2998968172 Win=8192 Len=0 MSS=1460 WS=256 SACK_PERM=1
6239 14:48:03.672009 469.678126 3.009289 192.168.99.7 1.1.1.1 TCP 54064 > 443 [SYN] Seq=2998968172 Win=8192 Len=0 MSS=1460 WS=256 SACK_PERM=1
6309 14:48:09.673882 475.679999 6.001873 192.168.99.7 1.1.1.1 TCP 54064 > 443 [SYN] Seq=2998968172 Win=8192 Len=0 MSS=1460 SACK_PERM=1
=> Also none of the above requests seen on the client side trace were present on the UAG DA server side trace which meant that UAG DA server never saw them
My customer further checked the connectivity with their service provider and it was found out that due to an incorrect routing configuration traffic sent to 1.1.1.1:443 was really dropped/routed somewhere else and hence the IPHTTPS tunnel wasn’t established. After fixing the issue with the router configuration, the problem was resolved and IPHTTPS tunnel was successfully established.
I was collaborating with a colleague of mine on a problem where SCCM client push installation was failing. They suspected network connectivity problems and collected simultaneous network traces from SCCM server and from a problem client machine and involved me in for further analysis.
When I check the SCCM server and client side traces, I saw that SCCM server was successfully accessing the client through TCP port 135
=> SCCM server side trace:
- TCP three way handshake between SCCM server and client:
5851 14:42:47 05.09.2012 34.0337296 10.0.9.149 CLIENTNAME.company.com TCP TCP: [Bad CheckSum]Flags=......S., SrcPort=51763, DstPort=DCE endpoint resolution(135), PayloadLen=0, Seq=2250995253, Ack=0, Win=8192 ( Negotiating scale factor 0x8 ) = 8192 {TCP:861, IPv4:843}
5852 14:42:47 05.09.2012 34.0364843 CLIENTNAME.company.com 10.0.9.149 TCP TCP:Flags=...A..S., SrcPort=DCE endpoint resolution(135), DstPort=51763, PayloadLen=0, Seq=1315818582, Ack=2250995254, Win=65535 ( Negotiated scale factor 0x0 ) = 65535 {TCP:861, IPv4:843}
5853 14:42:47 05.09.2012 34.0365076 10.0.9.149 CLIENTNAME.company.com TCP TCP: [Bad CheckSum]Flags=...A...., SrcPort=51763, DstPort=DCE endpoint resolution(135), PayloadLen=0, Seq=2250995254, Ack=1315818583, Win=258 (scale factor 0x8) = 66048 {TCP:861, IPv4:843}
- SCCM server binds to SCMActivator and activates WMI component:
5877 14:42:47 05.09.2012 34.0610846 10.0.9.149 CLIENTNAME.company.com MSRPC MSRPC:c/o Bind: IRemoteSCMActivator(DCOM) UUID{000001A0-0000-0000-C000-000000000046} Call=0x3 Assoc Grp=0xBB15 Xmit=0x16D0 Recv=0x16D0 {MSRPC:865, TCP:861, IPv4:843}
5880 14:42:47 05.09.2012 34.0642128 CLIENTNAME.company.com 10.0.9.149 TCP TCP:Flags=...A...., SrcPort=DCE endpoint resolution(135), DstPort=51763, PayloadLen=0, Seq=1315818583, Ack=2250996747, Win=65535 (scale factor 0x0) = 65535 {TCP:861, IPv4:843}
5882 14:42:47 05.09.2012 34.0748352 CLIENTNAME.company.com 10.0.9.149 MSRPC MSRPC:c/o Bind Ack: Call=0x3 Assoc Grp=0xBB15 Xmit=0x16D0 Recv=0x16D0 {MSRPC:865, TCP:861, IPv4:843}
5883 14:42:47 05.09.2012 34.0750212 10.0.9.149 CLIENTNAME.company.com MSRPC MSRPC:c/o Alter Cont: IRemoteSCMActivator(DCOM) UUID{000001A0-0000-0000-C000-000000000046} Call=0x3 {MSRPC:865, TCP:861, IPv4:843}
5884 14:42:47 05.09.2012 34.0785470 CLIENTNAME.company.com 10.0.9.149 MSRPC MSRPC:c/o Alter Cont Resp: Call=0x3 Assoc Grp=0xBB15 Xmit=0x16D0 Recv=0x16D0 {MSRPC:865, TCP:861, IPv4:843}
5885 14:42:47 05.09.2012 34.0786863 10.0.9.149 CLIENTNAME.company.com DCOM DCOM:RemoteCreateInstance Request, DCOM Version=5.7 Causality Id={FEEE1975-B61E-42EB-B500-939EA5EE4B2A} {MSRPC:865, TCP:861, IPv4:843}
Frame: Number = 5885, Captured Frame Length = 923, MediaType = ETHERNET
+ Ethernet: Etype = Internet IP (IPv4),DestinationAddress:[00-22-90-E3-B7-80],SourceAddress:[00-22-64-08-91-A6]
+ Ipv4: Src = 10.0.9.149, Dest = 10.102.0.230, Next Protocol = TCP, Packet ID = 639, Total IP Length = 909
+ Tcp: [Bad CheckSum]Flags=...AP..., SrcPort=51763, DstPort=DCE endpoint resolution(135), PayloadLen=869, Seq=2250996924 - 2250997793, Ack=1315818870, Win=257 (scale factor 0x8) = 65792
+ Msrpc: c/o Request: IRemoteSCMActivator(DCOM) {000001A0-0000-0000-C000-000000000046} Call=0x3 Opnum=0x4 Context=0x1 Hint=0x318
- DCOM: RemoteCreateInstance Request, DCOM Version=5.7 Causality Id={FEEE1975-B61E-42EB-B500-939EA5EE4B2A}
+ HeaderReq: DCOM Version=5.7 Causality Id={FEEE1975-B61E-42EB-B500-939EA5EE4B2A}
+ AggregationInterface: NULL
- ActivationProperties: OBJREFCUSTOM - {000001A2-0000-0000-C000-000000000046}
+ MInterfacePointerPtr: Pointer To 0x00020000
- Interface: OBJREFCUSTOM - {000001A2-0000-0000-C000-000000000046}
+ Size: 744 Elements
InterfaceSize: 744 (0x2E8)
Signature: 1464812877 (0x574F454D)
Flags: OBJREFCUSTOM - Represents a custom marshaled object reference
MarshaledInterfaceIID: {000001A2-0000-0000-C000-000000000046}
- Custom:
ClassId: {00000338-0000-0000-C000-000000000046}
ExtensionSize: 0 (0x0)
ObjectReferenceSize: 704 (0x2C0)
- ActivationProperties:
TotalSize: 688 (0x2B0)
Reserved: 0 (0x0)
+ CustomHeader:
- Properties: 6 Property Structures
+ Special:
- Instantiation:
+ Header:
InstantiatedObjectClsId: {8BC3F05E-D86B-11D0-A075-00C04FB68820} => This is WMI
ClassContext: 20 (0x14)
ActivationFlags: 2 (0x2)
FlagsSurrogate: 0 (0x0)
- Server responds with success and provides the endpoint information for WMI service:
5886 14:42:47 05.09.2012 34.0848992 CLIENTNAME.company.com 10.0.9.149 DCOM DCOM:RemoteCreateInstance Response, ORPCFLOCAL - Local call to this computer {MSRPC:865, TCP:861, IPv4:843}
- ScmReply:
+ Ptr: Pointer To NULL
+ RemoteReplyPtr: Pointer To 0x00106E98
- RemoteReply:
ObjectExporterId: 13300677357152346811 (0xB8957F961925A2BB)
+ OxidBindingsPtr: Pointer To 0x00102FF0
IRemUnknownInterfacePointerId: {0000B400-0580-0000-9A5E-C2357038B9DF}
AuthenticationHint: 4 (0x4)
+ Version: DCOM Version=5.7
- OxidBindings:
+ Size: 378 Elements
- Bindings:
WNumEntries: 378 (0x17A)
WSecurityOffsets: 263 (0x107)
- StringBindings:
TowerId: 15 (0xF)
NetworkAddress: \\\\CLIENTNAME[\\PIPE\\atsvc]
NetworkAddress: \\\\CLIENTNAME[\\PIPE\\wkssvc]
NetworkAddress: \\\\CLIENTNAME[\\pipe\\keysvc]
NetworkAddress: \\\\CLIENTNAME[\\PIPE\\srvsvc]
NetworkAddress: \\\\CLIENTNAME[\\pipe\\trkwks]
NetworkAddress: \\\\CLIENTNAME[\\PIPE\\W32TIME]
NetworkAddress: \\\\CLIENTNAME[\\PIPE\\ROUTER]
TowerId: 7 (0x7)
NetworkAddress: CLIENTNAME[1431]
NetworkAddress: 10.102.0.230[1431]
Terminator1: 0 (0x0)
+ SecurityBindings:
Terminator2: 0 (0x0)
- Since WMI listens on TCP 1431, SCCM server tries to connect to that endpoint to access WMI subsystem:
8980 14:43:08 05.09.2012 55.1014127 10.0.9.149 CLIENTNAME.company.com TCP TCP: [Bad CheckSum]Flags=......S., SrcPort=51785, DstPort=1431, PayloadLen=0, Seq=1764982397, Ack=0, Win=8192 ( Negotiating scale factor 0x8 ) = 8192 {TCP:1203, IPv4:843}
9390 14:43:11 05.09.2012 58.1101896 10.0.9.149 CLIENTNAME.company.com TCP TCP:[SynReTransmit #8980] [Bad CheckSum]Flags=......S., SrcPort=51785, DstPort=1431, PayloadLen=0, Seq=1764982397, Ack=0, Win=8192 ( Negotiating scale factor 0x8 ) = 8192 {TCP:1203, IPv4:843}
11236 14:43:17 05.09.2012 64.1163158 10.0.9.149 CLIENTNAME.company.com TCP TCP:[SynReTransmit #8980] [Bad CheckSum]Flags=......S., SrcPort=51785, DstPort=1431, PayloadLen=0, Seq=1764982397, Ack=0, Win=8192 ( Negotiating scale factor 0x8 ) = 8192 {TCP:1203, IPv4:843}
- But this TCP session request fails because SCCM server doesn’t get a response to TCP SYN requests.
- When we check the client side network trace, we cannot see any of those TCP SYNs sent by the SCCM server.
This is most of the time a hardware router/firewall filtering problem. After our customer made the necessary configuration changes in the firewall, SCCM client push installation started working properly.
Since WMI is assigned a random TCP port from dynamic RPC port range at every startup, network/firewall administrators need to allow that range as well in addition to allowing TCP 135 activity towards the clients. One other alternative in this instance could be fixing the TCPIP port than WMI subsytem obtains at each startup. You can see the below article for more information on this:
http://support.microsoft.com/kb/897571 FIX: A DCOM static TCP endpoint is ignored when you configure the endpoint for WMI on a Windows Server 2003-based computer
In this post, I would like to talk about a TMG safe search problem which I dealt with recently. The problem was that the TMG safesearch feature wasn’t working properly even though the all the requirements were met. We decided to collect dynamic logs while reproducing the problem (TMG data packager + client side logs). Then I started analyzing the logs:
- The Safesearch related rule was already enabled:
SafeSearch Enabled
Action Allow
Applies Always true
Default false
Description The SafeSearch rule allows access to the "Search Engines" URL category, and filters out adult content from search results of supported search engines.
ID
Logging Enabled
Source Port Limits Disabled
Type Access Rule
From
Network Internal, Included, Array scope
Applies to all content
Protocols Specified Protocols
HTTP, Included, Array scope
HTTPS, Included, Array scope
UrlCategory Search Engines, Included, Enterprise Scope
Users All Users, Included, Array scope
- It already included "All users"
- TMG server version was the latest (SP2 + rollup2)
- Safesearch XML file was in place with the correct settings:
<Configuration>
<provider domainPattern=".google." safeSearchSuffix="&safe=strict">
<searchQuery pattern="/search?"/>
<searchQuery pattern="/images?"/>
</provider>
<provider domainPattern=".yahoo.com" safeSearchSuffix="&vm=r">
<searchQuery pattern="/search;"/>
<searchQuery pattern="/search/images?"/>
<searchQuery pattern="/search/images;"/>
<searchQuery pattern="/search/video?"/>
<searchQuery pattern="/search/video;"/>
<provider domainPattern="www.bing.com" safeSearchSuffix="&adlt=strict">
</Configuration>
- It's the same as the one provided in ISA blog:
http://blogs.technet.com/b/isablog/archive/2010/09/21/new-in-forefront-tmg-update-1-safesearch-enforcement.aspx
<provider domainPattern="www.bing.com" safeSearchSuffix="&adlt=strict" >
<searchQuery pattern="/search?" />
=> Even though all configuration was correct, TMG server was still not forwarding the correct search URL towards Bing (which would help Bing filter search results)
- We see the search requests sent by the client on the client side network trace. Some examples:
2857 12:46:22.524738 50.167033 0.000000 10.96.96.6 10.110.0.121 HTTP GET http://www.bing.com/search?q=xxx&qs=n&form=QBLH&pq=xxx&sc=0-0&sp=-1&sk= HTTP/1.1
3642 12:46:44.659253 72.301548 0.054843 10.96.96.6 10.110.0.121 HTTP GET http://www.bing.com/images/search?q=xxx&FORM=HDRSC2 HTTP/1.1
3832 12:46:48.339288 75.981583 3.680035 10.96.96.6 10.110.0.121 HTTP GET http://www.bing.com/search?q=xxx&FORM=HDRSC1 HTTP/1.1
=> Relevant session in TMG ETL Logs: (collected as a result of TMG data packager)
[0]197c.1cc8 10/03/2012-10:46:48.780 [1a2c7b91 1a2c7d03] [WP proxyext...] Info: WPPISAPUBLIC:Context property:WPPISAPUBLIC:HTTP URL = http://www.bing.com/search?q=xxx&FORM=HDRSC1
=> Safesearch filter decides that there's no need for safesearch analysis:
[0]197c.1cc8 10/03/2012-10:46:48.781 [1a2c7b91 1a2c7d03] [HTTPFLT...] Entering CHttpFilterSafeSearchEnforcer::IsSafeSearchAnalysisRequired
[0]197c.1cc8 10/03/2012-10:46:48.781 [1a2c7b91 1a2c7d03] [HTTPFLT...] Info:SafeSearch analysis is not required, exiting
=> Network trace collected on the external interface of TMG server:
- We can see that the request is being sent as is without any modifications towards the Bing server:
256368 06:46:49.2700830 875.6370830 10.110.7.54 212.252.126.59 HTTP HTTP:Request, GET /search, Query:q=xxx&FORM=HDRSC1 {HTTP:3723, TCP:3721, IPv4:3651}
256370 06:46:49.3080860 875.6750860 212.252.126.59 10.110.7.54 TCP TCP:Flags=...A...., SrcPort=HTTP(80), DstPort=58199, PayloadLen=0, Seq=3768470858, Ack=3665294676, Win=25515 (scale factor 0x0) = 25515 {TCP:3721, IPv4:3651}
256372 06:46:49.3600890 875.7270890 212.252.126.59 10.110.7.54 HTTP HTTP:Response, HTTP/1.1, Status: Ok, URL: /search {HTTP:3723, TCP:3721, IPv4:3651}
http://www.bing.com/search?q=sex&FORM=HDRSC1
=> Normally it should have been sent something like below so that adult content wouldn't have been returned by Bing:
http://www.bing.com/search?q=sex&FORM=HDRSC1&adlt=strict
At this point everything seemed to be correctly configured but for some reason the safesearch filter wasn’t kicking in. After some more research and with the help of an escalation engineer from TMG team, we found out that in order for safesearch filter to kick in, the request itself also should be matching the Safesearch rule which is automatically created when Safesearch is enabled.
In my customer scenario, problem was that the access request sent by the user was hitting an enterprise level access rule and hence the safesearch rule wasn’t hit and the filter wasn’t activated as a result. After my customer re-arranged the enterprise level rules so that search engine related requests don’t hit any enterprise level access rule but hits the array level “Safesearch” rule created automatically, the problem was resolved.
I would like to go through an integration problem between Lync phone edition devices and Exchange 2010 that I worked on a while ago. Since the integration wasn’t working properly, users couldn’t access call logs, recorded voice mails, calendar information etc from their desktop phones (HP4120).
To understand the problem in more details, we asked our customer to collect Exchange side configuration information, phone edition logs from a problem device, network trace from Lync FE server etc.
Note: You can find full details on how to collect and analyze CELog data in the following Microsoft whitepaper:
http://www.microsoft.com/en-us/download/details.aspx?id=15668 Understanding and Troubleshooting Microsoft Exchange Server Integration
===============
Note: IP addresses, server names, URLs etc are replaced for privacy purposes
1) Lync phone edition device succeeds to obtain autodiscovery information:
0:01:43.203.610 : Raw data 211 (char), UCD_LOG_INFO: 10/19/2012|14:50:27 Aries: 10/19/2012|14:50:27.214 4EC0006:5250012 INFO :: NAutoDiscover::DnsAutodiscoverTask::ParseSoapResponse: InternalEwsUrl is https://casarray.contoso.com/EWS/Exchange.asmx
0:01:43.204.593 : Raw data 212 (char), UCD_LOG_INFO: 10/19/2012|14:50:27 Aries: 10/19/2012|14:50:27.215 4EC0006:5250012 INFO :: NAutoDiscover::DnsAutodiscoverTask::ParseSoapResponse: ExternalEwsUrl is https://autodiscover.contoso.com/EWS/Exchange.asmx
2) But the Lync phone edition device fails to access the EWS site with the following error (the same error is seen in all device logs) so the integration error occurs:
0:01:43.840.224 : Raw data 206 (char), UCD_LOG_ERROR: 10/19/2012|14:50:27 Aries: 10/19/2012|14:50:27.850 4EC0006:5250012 ERROR :: WebServices::CSoapTransport::ExecuteSoapOperation: Insecure server. errorCode=12045, status=014C0220, hr=80f10043
0:01:43.840.617 : Raw data 196 (char), UCD_LOG_ERROR: 10/19/2012|14:50:27 Aries: 10/19/2012|14:50:27.851 4EC0006:5250012 ERROR :: WebServices::CSoapTransport::SendSoapRequest: ExecuteSoapOperation failed. status=014C0220, hr=80f10043
0:01:43.840.963 : Raw data 225 (char), UCD_LOG_INFO: 10/19/2012|14:50:27 Aries: 10/19/2012|14:50:27.851 4EC0006:5250012 INFO :: WebServices::WebRequestImpl::ExecuteCommon: after SendSoapRequest. hr=0x80f10043, url=https://casarray.contoso.com/EWS/Exchange.asmx
0:01:43.841.216 : Raw data 191 (char), UCD_LOG_INFO: 10/19/2012|14:50:27 Aries: 10/19/2012|14:50:27.851 4EC0006:5250012 INFO :: WebServices::CSoapTransport::GetHttpHeaderInfoFromHandle: hSoapHandle=014C0220, headerInfo=013E2708
0:01:43.841.548 : Raw data 197 (char), UCD_LOG_INFO: 10/19/2012|14:50:27 Aries: 10/19/2012|14:50:27.852 4EC0006:5250012 INFO :: WebServices::CSoapTransport::GetCredentialsInfoFromHandle: hSoapHandle=014C0220, credentialsInfo=013E2670
0:01:43.842.135 : Raw data 165 (char), UCD_LOG_INFO: 10/19/2012|14:50:27 Aries: 10/19/2012|14:50:27.852 4EC0006:5250012 INFO :: WebServices::CSoapTransportStatus::~CSoapTransportStatus: status=014C0220
0:01:43.842.492 : Raw data 182 (char), UCD_LOG_ERROR: 10/19/2012|14:50:27 Aries: 10/19/2012|14:50:27.853 4EC0006:5250012 ERROR :: WebServices::WebRequestImpl::ExecuteCommon: Could not execute SOAP request. hr=0x80f10043
err 014C0220
# as an HRESULT: Severity: SUCCESS (0), Facility: 0x14c, Code 0x220
# for hex 0x220 / decimal 544 :
SE_AUDITID_IPSEC_AUTH_FAIL_CERT_TRUST msaudite.h
# IKE security association establishment failed because peer
# could not authenticate.
# The certificate trust could not be established.%n
# Peer Identity: %n%1%n
err 0x80f10043
# (user-to-user)
MIDIERR_NOTREADY mmsystem.h
NMERR_INVALID_PACKET_LENGTH netmon.h
ERROR_BAD_NET_NAME winerror.h
# The network name cannot be found.
LDAP_NOT_ALLOWED_ON_RDN winldap.h
So the problem looks like a certificate issue.
=> When I check the certificate assigned to CAS array, I see the following:
AccessRules : {System.Security.AccessControl.CryptoKeyAccessRule, System.Security.AccessControl.CryptoKeyAccessRule}
CertificateDomains : {casarray.contoso.com, cas01.contoso.com}
HasPrivateKey : True
IsSelfSigned : False
Issuer : CN=Issuing-CA for contoso, DC=contoso, DC=com
NotAfter : 8/8/2014 3:36:18 PM
NotBefore : 8/8/2012 3:36:18 PM
PublicKeySize : 1024
RootCAType : Enterprise
SerialNumber : 362763723200000000524
Services : IMAP, POP, IIS
Status : Valid
Subject : CN=casarray.contoso.com
Thumbprint : EF32873628362BDA8326108875301F38504AB
Issuer is Issuing-CA for contoso, DC=contoso, DC=com
=> On the other hand, the CA that issues the Lync frontend certificate is the following: (this is taken from Lync server side network trace)
+ Tcp: Flags=...A...., SrcPort=5061, DstPort=50855, PayloadLen=1460, Seq=2076240595 - 2076242055, Ack=1109389286, Win=256 (scale factor 0x8) = 65536
TLSSSLData: Transport Layer Security (TLS) Payload Data
- TLS: TLS Rec Layer-1 HandShake: Server Hello. Certificate.
- TlsRecordLayer: TLS Rec Layer-1 HandShake:
ContentType: HandShake:
+ Version: TLS 1.0
Length: 4380 (0x111C)
- SSLHandshake: SSL HandShake Certificate(0x0B)
HandShakeType: ServerHello(0x02)
Length: 77 (0x4D)
+ ServerHello: 0x1
HandShakeType: Certificate(0x0B)
Length: 3423 (0xD5F)
- Cert: 0x1
CertLength: 3420 (0xD5C)
- Certificates:
CertificateLength: 1574 (0x626)
+ Signature: Sha1WithRSAEncryption (1.2.840.113549.1.1.5)
+ Issuer: Issuing CA for contoso,com
+ Validity: From: 09/10/12 10:05:05 UTC To: 09/10/14 10:05:05 UTC
+ Subject: casarray.contoso.com
+ SubjectPublicKeyInfo: RsaEncryption (1.2.840.113549.1.1.1)
+ Tag3:
+ Extensions:
+ SignatureAlgorithm: Sha1WithRSAEncryption (1.2.840.113549.1.1.5)
+ Signature:
Certificates:
=> So the issuers of certificates used by Lync server itself and casarray.contoso.com servers are different CAs:
a) CA that issues Lync FE certificate:
Issuing CA for contoso,com
b) CA that issues CAS array certificate:
Issuing-CA for contoso, DC=contoso, DC=com
Apparently two different CAs with similar names issued Lync FE and Exchange CAS array certificates.
Under normal circumstances, if those two CAs are enterprise CAs, they automatically publish their own CA certificates to AD so the clients can download and use them while verifying the certificate chain. But after some internal research I found out that phone edition devices only trust the same CA that assigned the certificate to Lync FE pool to which phone edition device signs in. (except public certificates listed in http://technet.microsoft.com/en-us/library/gg398270(OCS.14).aspx)
RESULTS:
After issuing a new certificate to CAS array from the same enterprise CA (Issuing CA for contoso,com) that issues Lync FE certificate as well, the integration problem was resolved.
Hope this helps you when dealing with similar problems...
In this blog post, I’ll be talking about a response group problem where the response group members phones keep ringing even if the user calling from pstn side hangs up the phone. I would like to talk about how I troubleshooted this issue:
We first collected ETL traces and network traces from the mediation server to better understand the problem (actually mediation server and frontend server were running on the same machine). Some more details about the response group:
Response group number: 8900
Response group members: user1@contoso.com / user2@contoso.com / user3@contoso.com / user4@contoso.com / user5@contoso.com
In the S4 ETL trace, we can see that once the call is made to response group number, response group application starts ringing each response group member with 15 seconds intervals while it keeps playing music on hold to the caller:
Since none of the response group members answer, response group application goes through each member in a loop fashion. (actually the exact behavior here is identified by response group charateristics like workflow, routing method etc)
Interestingly, even the user who initiated the call to response group hangs up the phone, the rtp traffic from voip gateway keeps coming towards the mediation server:
98312 2012-11-23 08:59:05.383 0.015 172.16.1.7 172.17.1.100RTP 214 PT=ITU-T G.711 PCMA, SSRC=0x3FF62C47, Seq=19584, Time=964130050
98313 2012-11-23 08:59:05.388 0.004 172.17.1.100172.16.1.7 RTP 214 PT=ITU-T G.711 PCMA, SSRC=0x6ED4D5BE, Seq=33285, Time=1945495369
98320 2012-11-23 08:59:05.403 0.015 172.16.1.7 172.17.1.100RTP 214 PT=ITU-T G.711 PCMA, SSRC=0x3FF62C47, Seq=19585, Time=964130210
98322 2012-11-23 08:59:05.408 0.004 172.17.1.100172.16.1.7 RTP 214 PT=ITU-T G.711 PCMA, SSRC=0x6ED4D5BE, Seq=33286, Time=1945495529
98323 2012-11-23 08:59:05.423 0.015 172.16.1.7 172.17.1.100RTP 214 PT=ITU-T G.711 PCMA, SSRC=0x3FF62C47, Seq=19586, Time=964130370
98324 2012-11-23 08:59:05.428 0.004 172.17.1.100172.16.1.7 RTP 214 PT=ITU-T G.711 PCMA, SSRC=0x6ED4D5BE, Seq=33287, Time=1945495689
98326 2012-11-23 08:59:05.443 0.015 172.16.1.7 172.17.1.100RTP 214 PT=ITU-T G.711 PCMA, SSRC=0x3FF62C47, Seq=19587, Time=964130530
So we should have got a SIP BYE from VoIP gateway side when the remote caller hung up the phone...
This looks like a VoIP gateway side problem. We then made some internal testing to verify the expected behavior. Here are the results from the internal testing:
Expected behavior:
Note: The test response group has one member only so the response group application keeps ringing the same user
- The remote party calls the Response group number
- Response group answers the call and then it starts ringing the response group member (joe@contoso.com)
- Since the agent doesn’t pick up the phone for 15 seconds, it drops that session with a SIP CANCEL to the group member and it starts another session with the same agent (because there’s only one agent)
- The person calling from behind the VoIP gateway hangs up the phone after 23 seconds or so after the initial call
- So the second dialog with the agent is terminated
- The second call to the agent is made at 13:36:17
- The remote caller hangs up the phone at 13:36:22 (23 seconds after the response group answers the call)
- And the mediation server gets a SIP BYE from VoIP gateway since the remote caller hangs up the phone:
TL_INFO(TF_PROTOCOL) [0]0C4C.1410::11/23/2012-13:36:22.847.0002b91f (S4,SipMessage.DataLoggingHelper:sipmessage.cs(686))[3311788563]
<<<<<<<<<<<<Incoming SipMessage c=[<SipTlsConnection_386DD23>], 192.168.1.20:5071<-192.168.1.20:50807
BYE sip:napool.contoso.com:5071;grid;ms-fe=nafe02.contoso.com SIP/2.0
FROM: <sip:18006934501;phone-context=DefaultProfile@contoso.com;user=phone>;epid=8D443A6D50;tag=928547717d
TO: <sip:17042563016;phone-context=DefaultProfile@contoso.com;user=phone>;epid=14B9F3DDE9;tag=872d3a4ad0
CSEQ: 3828 BYE
CALL-ID: d9789202-c026-4661-88d8-adfc12c37995
MAX-FORWARDS: 69
VIA: SIP/2.0/TLS 192.168.1.20:50807;branch=z9hG4bK6FF51B0B.C998AF162BD50886;branched=FALSE
VIA: SIP/2.0/TLS 192.168.1.15:54616;branch=z9hG4bK97e1a6aa;ms-received-port=54616;ms-received-cid=726C00
CONTACT: <sip:nafe01.contoso.com@contoso.com;gruu;opaque=srvr:MediationServer:gJSXgywGXl2uoBnZnm4R7QAA;grid=1792a8d837324735acd180c81887b8b0>;isGateway
CONTENT-LENGTH: 0
SUPPORTED: ms-dialog-route-set-update
SUPPORTED: gruu-10
USER-AGENT: RTCC/4.0.0.0 MediationServer
P-ASSERTED-IDENTITY: "PBX 01"<sip:18006934501;phone-context=DefaultProfile@contoso.com;user=phone>
ms-diagnostics: 10027;source="nafe01.contoso.com";reason="Normal termination response from gateway after the call was established";component="MediationServer"
ms-diagnostics-public: 10027;reason="Normal termination response from gateway after the call was established";component="MediationServer"
ms-endpoint-location-data: NetworkScope;ms-media-location-type=intranet
------------EndOfIncoming SipMessage
Right after the SIP BYE is received from VoIP gateway, response group stops ringing the agent with a SIP CANCEL request:
We can see the same behavior in network trace as well:
=> The call starts and RTP traffic starts flowing:
=> At 15:36:23, VoIP gateway (Asteriks PBX) sends a SIP BYE to mediation server since the remote caller hangs up the phone, and the RTP traffic also stops at this point:
We advised our customer to further check their VoIP gateway to better understand why it doesn’t send a SIP BYE to mediation server when the remote caller hangs up the phone (and because the SIP BYE is not received, response group member phones keep ringing)
I was involved in a slow Sharepoint portal access through an ISA 2006 array problem and I wanted to talk about how I dealt with that issue and found the root cause of the slowness issue. The problem was that the portal home page was loaded very slowly and sometimes it wasn’ loaded at all from external clients. The performance was fine when it was accessed from internal clients
Our customer had a network setup similar to below:
We collected network traces from ISA servers while reproducing the problem. Network trace analysis revealed that ISA servers were responding to the requests immediately but there were delays in the incoming HTTP requests from the external clients and in some occasions external client didn’t complete the TCP three way handshake:
=> Examples of such huge gaps between requests from the external client seen in the network trace:
Note: 1.1.1.1 is the external client (real IP address was replaced)
Note: 192.168.1.36 is the web listener IP address to which sharepoint publishing rule is bound
17324 14:11:33.045375 11.859375 0.015625 1.1.1.1 192.168.1.36 HTTP GET /UserControls/TopMenu/TopMenu_Sep.gif HTTP/1.1 17326 14:11:33.045375 11.859375 0.000000 192.168.1.36 1.1.1.1 HTTP HTTP/1.1 304 Not Modified 17522 14:11:33.232875 12.046875 0.187500 1.1.1.1 192.168.1.36 HTTP GET /_layouts/1033/ie66up.js?rev=Ni7%2Fj2ZvA%33D HTTP/1.1 17524 14:11:33.232875 12.046875 0.000000 192.168.1.36 1.1.1.1 HTTP HTTP/1.1 304 Not Modified 17529 14:11:33.248500 12.062500 0.015625 1.1.1.1 192.168.1.36 TCP 1273 > 80 [ACK] Seq=2525698284 Ack=1206119216 Win=65664 Len=0 17533 14:11:33.264125 12.078125 0.015625 1.1.1.1 192.168.1.36 TCP 1269 > 80 [ACK] Seq=640090889 Ack=3984935419 Win=65340 Len=0 17593 14:11:33.451625 12.265625 0.187500 1.1.1.1 192.168.1.36 TCP 1268 > 80 [ACK] Seq=1670250900 Ack=2758693123 Win=65576 Len=0 There’s a huge time gap (around 83 seconds) in between the last response from the ISA server and the next request from the external client: 153758 14:12:56.998500 95.812500 83.546875 1.1.1.1 192.168.1.36 TCP 1268 > 80 [FIN, ACK] Seq=1670250900 Ack=2758693123 Win=65576 Len=0 153759 14:12:56.998500 95.812500 0.000000 192.168.1.36 1.1.1.1 TCP 80 > 1268 [ACK] Seq=2758693123 Ack=1670250901 Win=64684 [TCP CHECKSUM INCORRECT] Len=0 153760 14:12:56.998500 95.812500 0.000000 1.1.1.1 192.168.1.36 TCP 1269 > 80 [FIN, ACK] Seq=640090889 Ack=3984935419 Win=65340 Len=0 153761 14:12:56.998500 95.812500 0.000000 192.168.1.36 1.1.1.1 TCP 80 > 1269 [ACK] Seq=3984935419 Ack=640090890 Win=65535 [TCP CHECKSUM INCORRECT] Len=0 153762 14:12:56.998500 95.812500 0.000000 192.168.1.36 1.1.1.1 TCP 80 > 1268 [FIN, ACK] Seq=2758693123 Ack=1670250901 Win=64684 [TCP CHECKSUM INCORRECT] Len=0
158541 14:13:06.639125 105.453125 0.015625 192.168.1.36 1.1.1.1 HTTP HTTP/1.1 304 NOT MODIFIED 158576 14:13:06.842250 105.656250 0.203125 1.1.1.1 192.168.1.36 TCP 1290 > 80 [ACK] Seq=2442652609 Ack=3279316510 Win=65376 Len=0 Another time gap which is around 11 seconds in between the last response from the ISA server and the next request from the external client: 178031 14:13:18.014125 116.828125 11.171875 1.1.1.1 192.168.1.36 TCP 1292 > 80 [SYN] Seq=2506205172 Win=8192 Len=0 MSS=1380 WS=4 SACK_PERM=1 178032 14:13:18.014125 116.828125 0.000000 192.168.1.36 1.1.1.1 TCP 80 > 1292 [SYN, ACK] Seq=1749486699 Ack=2506205173 Win=16384 Len=0 MSS=1460 WS=1 SACK_PERM=1 178039 14:13:18.029750 116.843750 0.015625 1.1.1.1 192.168.1.36 TCP 1292 > 80 [ACK] Seq=2506205173 Ack=1749486700 Win=66240 Len=0 178042 14:13:18.029750 116.843750 0.000000 1.1.1.1 192.168.1.36 HTTP GET /Style%20Library/Images/leftCol_02.gif HTTP/1.1 178043 14:13:18.029750 116.843750 0.000000 192.168.1.36 1.1.1.1 HTTP HTTP/1.1 304 NOT MODIFIED 178202 14:13:18.248500 117.062500 0.218750 1.1.1.1 192.168.1.36 TCP 1292 > 80 [ACK] Seq=2506206023 Ack=1749486987 Win=65952 Len=0 255223 14:13:58.092250 156.906250 39.843750 1.1.1.1 192.168.1.36 TCP 1289 > 80 [RST, ACK] Seq=3712234005 Ack=2023647270 Win=0 Len=0 => Another indication of a connectivity problem between the external client and the ISA 2006 server: No. Time Time since Delta Source Destination Protocol Info 153775 14:12:57.014125 95.828125 0.000000 1.1.1.1 192.168.1.36 TCP 1288 > 80 [SYN] Seq=2013470292 Win=64860 Len=0 MSS=1380 153776 14:12:57.014125 95.828125 0.000000 192.168.1.36 1.1.1.1 TCP 80 > 1288 [SYN, ACK] Seq=535425124 Ack=2013470293 Win=16384 Len=0 MSS=1460 154809 14:12:59.342250 98.156250 2.328125 192.168.1.36 1.1.1.1 TCP 80 > 1288 [SYN, ACK] Seq=535425124 Ack=2013470293 Win=16384 Len=0 MSS=1460 158232 14:13:05.904750 104.718750 6.562500 192.168.1.36 1.1.1.1 TCP 80 > 1288 [SYN, ACK] Seq=535425124 Ack=2013470293 Win=16384 Len=0 MSS=1460 ISA receives TCP SYN from the external client and sends back a TCP SYN ACK immediately. Normally the client should have continued with a TCP ACK to finish the TCP 3-way handshake and then send the HTTP request on top of that TCP session but that doesn’t happen and ISA 2006 re-sends the TCP SYN ACK and resends one more time. For example this activity alone causes some 10 seconds to be lost.
Just to eliminate any network device issues we decided to access the same sharepoint portal just from the front of ISA array’s external interface (192.168.1.32/27 subnet) and the client was assigned an IP address from that subnet. After some testing from that client it was possible to quickly access the portal (and the test machine has no difference from an external client out in the internet from ISA perspective), it confirmed that the problem is not with ISA array but it’s a problem somewhere between ISA array’s external interface and our customer’s internet connection including the Cisco ASA device/internet routers.
Our customer’s network team was involved in and after the issue with the router/firewall was fixed, external clients also started accessing the portal very quickly.