There are a whole host of issues that are simply never seen unless you have a large distributed environment. I know that sounds startling but here’s a hypothetical example. Imagine that you are an online retailer and for every identity that you are transacting business with an object in a AD LDS/ADAM database is created or updated.

If you have many business transactions (generally a good thing from a business perspective) then the number of client connections and updates swells accordingly. In an ideal world, the IT staff has an opportunity to create a test environment and scale it out as a proof of concept-verifying that the solution won’t run into any surprise problems.

In reality sudden business success or adoption can often outpace prior testing results. This is when you discover any potential scaling issues or performance bottlenecks. Of course the moral of this introduction is to do everything you can to scale out your test environments and place them under the most severe load you can conceive of prior to going “into production”.

Let’s talk about one scenario that we’ve seen a time or two.

I have a distributed application that has to read, write, update and remove many very large (hundreds of megabytes each) files from a set of a file servers. When I say distributed I generally mean that the client application could be running on many different computers. Think hundreds. In contrast, the set of file servers which are providing the server side of this scenario are a much smaller in number. There is no hard and fast rule about this and what the threshold for the issue may be but let’s say that in our scenario we are seeing a ratio of 20:1-twenty clients to 1 server.

The issue occurs during peak hours-Monday at 9AM-when everyone in the main office arrives to work and begins to use the application to do whatever it is they are doing.

What happens is that-after connecting successfully to a particular file in order to update it-Joe User has noticed that he cannot update the file anymore and instead gets a “file not found” error. Joe dutifully calls the helpdesk and reports said error.

The helpdesk folks notice several calls at about the same time from other users. All of the calls mention the same file, around the same time, with the same “not found” behavior. When they examine logs they notice that the client connections are going to the same server. What is even more interesting is that the clients-the same clients-are able to successfully update other files on the same server at that same time. Even files in the same destination directory on that server. 

After a few days of seeing this issue being reported they see that it consistently goes away after a few minutes without any user or administrative action. In other words the workaround is to wait a few minutes and try accessing the file again, in which case the access would then succeed.  To add to the confusion the file could be seen to be present in the directory at that time if viewed in a local session on the server.

So what do you do in this situation?  Network captures are always a good idea in order to get a thorough understanding of what is happening in the client to server communication.  In this case though you simply see the “not found” message in its primeval form on the wire:

1681 0.509984 {SMB:190, SMBOverTCP:181, TCP:27, IPv4:26} 192.168.1.8 192.168.1.9 SMB SMB:C; Transact2, Query Path Info, Query File Basic Info, Pattern = \repository\ilovescotch\singlemalts.lnot

1682 0.510083 {SMB:190, SMBOverTCP:181, TCP:27, IPv4:26} 192.168.1.9 192.168.1.8 SMB SMB:R; Transact2, Query Path Info - NT Status: System - Error, Code = (52) STATUS_OBJECT_NAME_NOT_FOUND

Some folks may have thoughts of exclusive handle locks or problems with opportunistic locking bouncing around in their heads right now.  Sadly, in this type of scenario neither are the culprit (Process Monitor’s handle functionality can reveal the former and the capture should reveal the latter).

The best thing to do in this situation is to gather Perfmon data from the server side starting prior to the issue occurring and ending after the issue has resolved itself.  What should you gather?  The usual suspects-disk, memory, process, processor and network objects, all counters are a good start. 

There are disk bottlenecks to look out for but rarely do you see “not found” type messages as the result of them, but here is an instance where you can.    Split IO can result in this behavior when a large number of hefty files are accessed simultaneously, ultimately leading the server side redirector (a victim as much as the client in this case) to pass along a STATUS_OBJECT_NAME_NOT_FOUND error since it truly couldn’t find that file-though it was there. 

MSDN describes this counter thus:

Split IO/Sec
Shows the rate at which that I/O requests to the disk were split into multiple requests. A split I/O may result from requesting data in a size that is too large to fit into a single I/O or that the disk is fragmented on single-disk systems.

How much is bad?  Well, zero is the optimal number. So if you see a number other than that (example below) this should raise bottleneck concerns.

image

In its natural environment you will spot split IO as it appears to sympathetically rise with performance spikes in disk queue length, writes and reads such as below.  Notice that the split IO (thick and scary red looking line) rises when disk queue length is at it’s highest.  This is a bad sign- though barring additional symptoms like the “not found” errors would not come to someone's attention as something in dire need of fixing.  And the “not found” scenario is much more likely in the case of multiple large single file requests than it would be if the files themselves contained less data each.

image

image

In this scenario-if you are running into it-you have one of three different things occurring.  First, if you are lucky, you simply need to defragment the hosting volumes for the data on that server.

Or it could be that you have reached a milestone in your business growth by needing to scale into a storage area network (SAN) or get a better performing RAID configuration or hardware set.  The key to keep in mind in this choice is that the I in RAID stands for Inexpensive.  The even less costly alternatives are to somehow distribute or lighten the load on the servers in order to crawl along beneath the performance threshold the hardware is imposing.

Why am I blogging about something like this, where there wasn’t even an “access denied” error or AD replication mention in the entire post?  Because Directory Services is the Bermuda Triangle of difficult technical issues at Microsoft.  We are presented with the unexplained phenomena of the IT world.  It’s what we do.

Let’s christen a new category and file this in it: Unexplained Phenomena.