Michael here again to give some counsel on data collection when troubleshooting an issue. That title may have been a little misleading since depending on a time crunch, there may be too much, but usually, that rule holds true. Let me explain.
One of my favorite tools, due to my years in networking support, is the network capture. It can provide a plethora of information about what exactly was transpiring when servers were trying (and possibly failing) to communicate. I don’t really concern myself with the tool used, be it Network Monitor, Wireshark, Sniffer or any other tool. My biggest point to stress is what I mentioned previously – it shows the communication on the network. The important point to get from that is that collecting a trace from a single point doesn’t provide the full picture. While I will take a single-sided trace over no trace at all, the best scenario is to get it from all points involved in the transaction. With something like SharePoint, this could be a number of machines – the client running the browser, the web front end, the SQL back end and then multiple domain controllers. It sounds like a daunting task to get the captures from every location, but I would rather have too much data rather than too little. To add to that point, please don’t apply a capture filter unless absolutely necessary! By only capturing data between two particular points, you could be omitting the critical information.
Following is a perfect example of both of these points. I was engaged to troubleshoot an issue that was described as a problem with a SharePoint web front end talking to the SQL server. I got the captures from the two servers, which fortunately were not filtered. If I just went on the problem description, I would typically have opened the capture from the SQL box, applied the ipv4.address==WFE IP (ipv4 because I was using Network Monitor – it would be ip.addr== for you Wireshark aficionados) to locate the traffic from that box, and seen that all traffic to and from the WFE appeared completely normal.
9873 9:37:54 AM WFE 49346 (0xC0C2) SQL 1433 (0x599) TCP TCP:Flags=...A...., PayloadLen=0, Seq=3198716784, Ack=438404416, Win=510
10093 9:37:55 AM WFE 49346 (0xC0C2) SQL 1433 (0x599) TDS TDS:RPCRequest, SPID = 0, PacketID = 1, Flags=...AP..., PayloadLen=201, Seq=3198716784 - 3198716985, Ack=438404416, Win=510
10094 9:37:55 AM SQL 1433 (0x599) WFE 49346 (0xC0C2) TDS TDS:Response, SPID = 117, PacketID = 1, Flags=...AP..., SrcPort=1433, DstPort=49346, PayloadLen=61, Seq=438404416 - 438404477, Ack=3198716985, Win=255
10188 9:37:55 AM WFE 49346 (0xC0C2) SQL 1433 (0x599) TCP TCP:Flags=...A...., PayloadLen=0, Seq=3198716985, Ack=438404477, Win=509
To me, it looked like clean SQL traffic, moving quickly and without errors. All good so I needed to look elsewhere. To move on, it’s important to know what other types of things will happen when using SharePoint. Other than the SQL traffic, the WFE will also have to communicate with the client, perform name resolution and communicate with a domain controller. I first applied the filter “dns or nbtns” (Again, this was Network Monitor, although I typically use both in my analysis) and again, everything looked “clean.” I then moved on to examine the authentication traffic. I applied the filter “Kerberosv5” and lo and behold, I find the issue. Appearing over and over in the trace was this:
97 9:38:46 AM 0.0000000 WFE 52882 (0xCE92) DC 88 (0x58) TCP TCP:Flags=......S., SrcPort=52882, DstPort=Kerberos(88), PayloadLen=0, Seq=2542638417, Ack=0, Win=8192 ( Negotiating scale factor 0x8 ) = 8192
98 9:38:46 AM 0.0004965 DC 88 (0x58) WFE 52882 (0xCE92) TCP TCP:Flags=...A..S., SrcPort=Kerberos(88), DstPort=52882, PayloadLen=0, Seq=4098142762, Ack=2542638418, Win=65535 ( Negotiated scale factor 0x1 ) = 131070
99 9:38:46 AM 0.0000200 WFE 52882 (0xCE92) DC 88 (0x58) TCP TCP:Flags=...A...., SrcPort=52882, DstPort=Kerberos(88), PayloadLen=0, Seq=2542638418, Ack=4098142763, Win=513 (scale factor 0x8) = 131328
100 9:38:46 AM 0.0000599 WFE 52882 (0xCE92) DC 88 (0x58) KerberosV5 KerberosV5:AS Request Cname: farmsvc Realm: CONTOSO.COM Sname: krbtgt/CONTOSO.COM
102 9:38:46 AM 0.0022497 DC 88 (0x58) WFE 52882 (0xCE92) KerberosV5 KerberosV5:KRB_ERROR - KDC_ERR_CLIENT_REVOKED (18)
KRB_ERROR - KDC_ERR_CLIENT_REVOKED means that the client account has been locked out. We checked active directory and sure enough, the account used for the WFE service was locked. We then learned that they had recently changed the password for that service account, which resulted in said lockout. One thing to note about Network Monitor (and you can do this with Wireshark, as well) is that I actually had all Kerberos traffic highlighted in green so it stood out quickly.
So what did we learn? We know that if the trace had just been taken from the SQL server, I wouldn’t have found the issue. We also know that if the WFE trace had been filtered to just include SQL traffic or SQL and client traffic, I wouldn’t have found the issue. Remember, more is better! Even if I get gigabytes of captures, I can always parse them or break them into smaller, bite-sized (no pun intended) chunks for faster filtering. In a future blog, I will address that exact topic – tips and tricks to work through copious amounts of network capture data.
Until then, happy tracing!