This post is about an issue I worked on several days ago.
My customer had a TMG array with two nodes running with NLB. The problem they faced was that from time to time some TMG node couldn't process traffic anymore: requests to the virtual IP (VIP) failed and only rebooting the TMG machine eliminated the issue.
IE was configured to use the NLB virtual IP (VIP) as proxy address. When the issue happened users couldn’t browse internet pages - the browser didn't show any error page, it just stayed with a blank page.
First of all, I looked at the TMG Web proxy log and found an error logged there with the result code of 11001 (WSAHOST_NOT_FOUND) for the request for “contoso.com”, which we tried to browse in order to reproduce the issue.
Then I took a look at a captured network monitor trace in order to find an appropriate network conversation between the client and TMG:
Time Delta time Source Destination Protocol Length Info
11:07:26.566261000 1.376436000 CLIENT_IP TMG_VIP HTTP 433 GET http://contoso.com/ HTTP/1.1
11:07:26.570933000 0.004672000 TMG_VIP CLIENT_IP HTTP 1440 HTTP/1.1 502 Proxy Error ( The host was not found. ) (text/html)
In a web proxy scenario, the web proxy is supposed to resolve the name, however, by some reason the node didn't manage to resolve the remote host.
As the next step, I filtered out the trace for DNS traffic and didn't find DNS packets on the node at all – this looked very suspicious.
Therefore I looked at Netstat and its output showed around ~62000 lines similar to the followings, whereas 2580 was the pid of wspsrv.exe:
UDP 0.0.0.0:4002 *:* 2580
UDP 0.0.0.0:4003 *:* 2580
UDP 0.0.0.0:4004 *:* 2580
UDP 0.0.0.0:4005 *:* 2580
UDP 0.0.0.0:4006 *:* 2580
UDP 0.0.0.0:4007 *:* 2580
UDP 0.0.0.0:4020 *:* 2580
UDP 0.0.0.0:4021 *:* 2580
UDP 0.0.0.0:4022 *:* 2580
UDP 0.0.0.0:4023 *:* 2580
UDP 0.0.0.0:4024 *:* 2580
UDP 0.0.0.0:4025 *:* 2580
UDP 0.0.0.0:4026 *:* 2580
So TMG seemed to use up all UDP ports – hence there was no free UDP port left to use as source port for dns queries. This explained why name resolution failed and why TMG returned 502.
Why was such a great amount of ports used by TMG?
My guess was that it might be doing it on behalf of a client. Therefore, I looked at the Firewall log which showed that there were huge amount of UDP requests from a single client to different external hosts.
Luckily TMG firewall client was installed on the client machine which gave us an application name:
7:27:51 Establish CLINET2_IP 2058 EXTERNAL_HOST_1 54150 0x0 Access Internet UDP domain\user smax4pnp.exe:3:5.1 TMG1 Unidentified IP Traffic
7:27:52 Establish CLINET2_IP 2033 EXTERNAL_HOST_2 9362 0x0 Access Internet UDP domain\user smax4pnp.exe:3:5.1 TMG1 Unidentified IP Traffic
7:27:53 Terminate CLINET2_IP 2026 EXTERNAL_HOST_3 59866 0x80074e20 Access Internet UDP domain\user smax4pnp.exe:3:5.1 TMG1 Unidentified IP Traffic
So TMG used up all the available ports due to the excessive amount of request from this client/application.
In order to resolve the issue the customer disabled the client that has been causing this huge UDP traffic.
Vasily Kobylin, Senior Support Engineer, EMEA Forefront Edge Team
Balint Toth, Senior Support Escalation Engineer, EMEA Forefront Edge Team
As a side note, flood mitigation could have helped here as well, but this was not enabled on the TMG machine in question. Thanks Arpad for pointing out.