Understanding (the Lack of) Distributed File Locking in DFSR

Understanding (the Lack of) Distributed File Locking in DFSR

  • Comments 23
  • Likes

Ned here again. Today’s post is probably going to generate some interesting comments. I’m going to discuss the absence of a multi-host distributed file locking mechanism within Windows, and specifically within folders replicated by DFSR.

Some Background

  • Distributed File Locking – this refers to the concept of having multiple copies of a file on several computers and when one file is opened for writing, all other copies are locked. This prevents a file from being modified on multiple servers at the same time by several users.
  • Distributed File System Replication DFSR operates in a multi-master, state-based design. In state-based replication, each server in the multi-master system applies updates to its replica as they arrive, without exchanging log files (it instead uses version vectors to maintain “up-to-dateness” information). No one server is ever arbitrarily authoritative after initial sync, so it is highly available and very flexible on various network topologies.
  • Server Message Block - SMB is the common protocol used in Windows for accessing files over the network. In simplified terms, it’s a client-server protocol that makes use of a redirector to have remote file systems appear to be local file systems. It is not specific to Windows and is quite common – a well known non-Microsoft example is Samba, which allows Linux, Mac, and other operating systems to act as SMB clients/servers and participate in Windows networks.

It’s important to make a clear delineation of where DFSR and SMB live in your replicated data environment. SMB allows users to access their files, and it has no awareness of DFSR. Likewise, DFSR (using the RPC protocol) keeps files in sync between servers and has no awareness of SMB. Don’t confuse distributed locking as defined in this post and Opportunistic Locking.

So here’s where things can go pear-shaped, as the Brits say.

Since users can modify data on multiple servers, and since each Windows server only knows about a file lock on itself, and since DFSR doesn’t know anything about those locks on other servers, it becomes possible for users to overwrite each other’s changes. DFSR uses a “last writer wins” conflict algorithm, so someone has to lose and the person to save last gets to keep their changes. The losing file copy is chucked into the ConflictAndDeleted folder.

Now, this is far less common than people like to believe. Typically, true shared files are modified in a local environment; in the branch office or in the same row of cubicles. They are usually worked on by people on the same team, so people are generally aware of colleagues modifying data. And since they are usually in the same site, the odds are much higher that all the users working on a shared doc will be using the same server. Windows SMB handles the situation here. When a user has a file locked for modification and his coworker tries to edit it, the other user will get an error like:

clip_image002

And if the application opening the file is really clever, like Word 2007, it might give you:

clip_image004

DFSR does have a mechanism for locked files, but it is only within the server’s own context. As I’ve discussed in a previous post, DFSR will not replicate a file in or out if its local copy has an exclusive lock. But this doesn’t prevent anyone on another server from modifying the file.

Back on topic, the issue of shared data being modified geographically does exist, and for some folks it’s pretty gnarly. We’re occasionally asked why DFSR doesn’t handle this locking and take of everything with a wave of the magic wand. It turns out this is an interesting and difficult scenario to solve for a multi-master replication system. Let’s explore.

Third-Party Solutions

There are some vendor solutions that take on this problem, which they typically tackle through one or more of the following methods*:

  • Use of a broker mechanism

Having a central ‘traffic cop’ allows one server to be aware of all the other servers and which files they have locked by users. Unfortunately this also means that there is often a single point of failure in the distributed locking system.

image

  • Requirement for a fully routed network

Since a central broker must be able to talk to all servers participating in file replication, this removes the ability to handle complex network topologies. Ring topologies and multi hub-and-spoke topologies are not usually possible. In a non-fully routed network, some servers may not be able to directly contact each other or a broker, and can only talk to a partner who himself can talk to another server – and so on. This is fine in a multi-master environment, but not with a brokering mechanism.

image

  • Are limited to a pair of servers

Some solutions limit the topology to a pair of servers in order to simplify their distributed locking mechanism. For larger environments this is may not be feasible.

  • Make use of agents on clients and servers
  • Do not use multi-master replication
  • Do not make use of MS clustering
  • Make use of specialty appliances

* Note that I say typically! Please do not post death threats because you have a solution that does/does not implement one or more of those methods!

Deeper Thoughts

As you think further about this issue, some fundamental issues start to crop up. For example, if we have four servers with data that can be modified by users in four sites, and the WAN connection to one of them goes offline, what do we do? The users can still access their individual servers – but should we let them? We don’t want them to make changes that conflict, but we definitely want them to keep working and making our company money. If we arbitrarily block changes at that point, no users can work even though there may not actually be any conflicts happening! There’s no way to tell the other servers that the file is in use and you’re back at square one.

image

Then there’s SMB itself and the error handling of reporting locks. We can’t really change how SMB reports sharing violations as we’d break a ton of applications and clients wouldn’t understand new extended error messages anyways. Applications like Word 2007 do some undercover trickery to figure out who is locking files, but the vast majority of applications don’t know who has a file in use (or even that SMB exists. Really.). So when a user gets the message ‘This file is in use’ it’s not particularly actionable – should they all call the help desk? Does the help desk have access to all the file servers to see which users are accessing files? Messy.

Since we want multi-master for high availability, a broker system is less desirable; we might need to have something running on all servers that allows them all to communicate even through non-fully routed networks. This will require very complex synchronization techniques. It will add some overhead on the network (although probably not much) and it will need to be lightning fast to make sure that we are not holding up the user in their work; it needs to outrun file replication itself - in fact, it might need to actually be tied to replication somehow. It will also have to account for server outages that are network related and not server crashes, somehow.

image

And then we’re back to special client software for this scenario that better understands the locks and can give the user some useful info (“Go call Susie in accounting and tell her to release that doc”, “Sorry, the file locking topology is broken and your administrator is preventing you from opening this file until it’s fixed”, etc). Getting this to play nicely with the millions of applications running in Windows will definitely be interesting. There are plenty of OS’s that would not be supported or get the software – Windows 2000 is out of mainstream support and XP soon will be. Linux and Mac clients wouldn’t have this software until they felt it was important, so the customer would have to hope their vendors made something analogous.

The Big Finish

Right now the easiest way to control this situation in DFSR is to use DFS Namespaces to guide users to predictable locations, with a consistent namespace. By correctly configuring your DFSN site topology and server links, you force users to all share the same local server and only allow them to access remote computers when their ‘main’ server is down. For most environments, this works quite well. Alternative to DFSR, SharePoint is an option because of its check-out/check-in system. BranchCache (coming in Windows Server 2008 R2 and Windows 7) may be an option for you as it is designed for easing the reading of files in a branch scenario, but in the end the authoritative data will still live on one server only – more on this here. And again, those vendors have their solutions.

We’ve heard you loud and clear on the distributed locking mechanism though, and just because it’s a difficult task does not mean we’re not going to try to tackle it. You can feel free to discuss third party solutions in our comments section, but keep in mind that I cannot recommend any for legal reasons. Plus I’d love to hear your brainstorms – it’s a fun geeky topic to discuss, if you’re into this kind of stuff.

- Ned ‘Be Gentle!’ Pyle

  • PingBack from http://www.ditii.com/2009/02/21/understanding-distributed-file-locking-in-dfsr/

  • Ned, As we implemented DFS originally, then DFSR (Bravo on DFSR by the way!) at our company, our IT team had more than a few discussions with upper management about file locking and the horror of overwriting files accidentally with DFSR.

    We've been using DFSR for a few years now without any major headaches.  Every now and then, we have a user complain that his changes were overwritten on a particular file.   It's really just a matter of looking in the ConflictAndDeleted folder and patching the file back together including everyone's changes.

    The global file locking ability would be fantastic, but for the time being, we've been managing without it.

  • We are about to go live with a DFS setup spread across multiple sites, and have turned to Peerlock to solve our file locking issues. While most of our data won't be worked on collaboratively, quite a bit of our AutoCAD drawings will be.

    However I would really look forward to a solution built into DFS, although it's unfortunate that it will probably require an OS purchase and upgrade to implement.

  • Thanks jmiles. If you get a chance, we'd love to hear your feedback here in a month or so with how that implementation worked out for you.

  • Well, we've been running with Peerlock for about 3 weeks now, and everything appears to be working great. We had to ensure that the temporary locking files weren't being replicated, and we had to turn off the saving of deleted files in the Conflict&Deleted folders, but once that's done, it appears to work excellent. CPU usage is up to ~35% consistently, where it used to be 5%, on a file server serving 800 GB to about 100 users.

    I'd be more than willing to answer any questions anyone has about it.

  • @jmiles:

    Could you share some more information - especially on how many servers do you do the locking, how are these servers interconnected, and how many files do you replicate?

    There seems to be a lot of links to this product when Googling for a sort like solution, but almost no user experiences unfortunately so your experience would be very interessting :)

  • I agree, there are hardly any user experience reports.

    We currently have 3 servers participating in DFS replication with Peerlock. This will expand to 4 in the coming months. Currently two servers are in our head office site, and the third is about 500KM away in a branch office.

    DFSR is configured in a hub/spoke topology, with Server1 (Windows Server 2003 R2) as the hub, and the two spokes currently Server 2008 x64. There is a 3Mb up/down Bonded-T1 connection in our head office, and a flaky business ADSL in the branch office.

    Peerlock must be installed on each server participating in replication. We have the locking system configured in a Full Mesh topology. Basically you specify a source folder, which is local. This would be your shared folder target for DFS. Then your peerlock targets would be the other folder target for the other DFSR partners. For us these are hidden shares (ie: \\server2\files$).

    As far as the actual operation, it is working perfectly. A user opens a file on Server1, and it is locked on Server2 and Server3. These locks are removed when the file is closed.

    There are a couple downsides to Peerlock however:

  • Whoops, hit Tab and then space, and it entered the comment.

    Downsides:

    - Peerlock can run as a GUI, or a service, but not both. This means to run on a server that is logged off, you need to use the service, but to make configuration changes, you have to stop the service, start the GUI, make changes, close the GUI, start the service. A hassle, but apparently changing in a future version.

    - Logging isn't terrible, but isn't great either. You have options to change log file size, but not type of logging, or verbosity.

    - CPU usage. As mentioned before, the program is using about 35% CPU constantly, on brand new PowerEdge 2900 servers with a single Quad Core CPU.

    - Price - it was somewhere around $990 CAD per license, and you need one for each server.

    As far as sizing goes, we are replicating about 250GB in one folder group, and 500 GB in another. These are mostly AutoCAD/Office documents, and are accesses by around 120 users daily. There's about 5 GB of changes on a busy day according to our incrementals.

    As always, feel free to ask for more info.

  • Thank you for your information! :)

  • For me, file locking as an issue has been replaced by replication and and sharing violations when using Autodesk files.

    Has anyone found a fix for this?

  • I have not, nor have any Autodesk customers that I have spoken to. The sharing violation part is pretty well understood, but there's nothing I've found to mitigate it. The other issue I've seen, where sometimes older versions of a file appearing to be replicated, appears to be partially caused by how AutoCAD swaps around transactional versions of their files. From exploring their support forums, it even appears to happen sometimes without even having replication in place at all... :-/

    If anyone finds out final words on this from that vendor, please let me know. I will be enternally in your debt.

    - Ned

  • I've got an issue with the file locking wrt namespace configuration, etc.

    I've tried to configure such that the namespace to go to the main server in office as it is the first server and should go to the lowest cost.  The remote office has a higher cost as it is off-site. This works most of the time, but the following scenario has been giving me problem

    Whenever a MS Office file is opened on the primary server, subsequent attempt at opening this file will be redirected to the off-site server.  This was documented in the DFS-R help file Win2k3R2. Can this behavior be changed or did I forget something?  I don't mind if the local copy is read-only, but this behavior of opening the file on the remote site is causing a lot of update conflict.

  • Can you paste in the help file text that says this?

    As an interim fix, you can always explore using Target Priority. That will make your preferred server always at the top of the referral list and the other server would only be used if the prioritized server was completely unavailable (down, dead network, etc).

  • Hi Ned - Hope you're well.

    I have a question over file locking.

    Essentially the setup is relatively simple as follows:

    All servers are running 2003R2.

    There are 3 sites say A, B, C.

    A= remote site

    B= remote site

    C= data center

    Users only access servers in sites A or B only. DFS has been setup so that DFSR only replicates files between A&C or B&C so folders on A & B servers are unique and are only replicated to servers in site C.

    If two users access the same file in Site A will a file lock occur for the second (and subsequent users)?

    Essentially I am asking if any part of DFS (like DFSR)turns off file locking completely on a server!

    I hope this makes sense.

    Thanks in advance for your help.

    Best Regards

  • Yes, after the first user on A locks a file for WRITE, any subsequent users connecting to A will not be able to modify the file. At this point you are not dealing with DFSR in any way, but instead SMB behavior.

    And thanks, I too hope are well. :)