Implementing Content Freshness protection in DFSR

Implementing Content Freshness protection in DFSR

  • Comments 10
  • Likes

Hi all, Ned here again. Starting in Windows Server 2008 and continuing in Windows Server 2008 R2, DFSR supports a protective mechanism called “Content Freshness”. Today I’ll discuss this protection, how to implement it, and what to do when it swings into operation.

Background

Content Freshness is an admin-defined setting that you can set on a per-computer basis when using DFSR on Win2008 or Win2008 R2 – it does not exist on Windows Server 2003 R2. The DFSR database has a record for each Replicated Folder (RF) called CONTENT_SET_RECORD. This record contains a timestamp called “LastConnected”. We store this record on a per-Replicated-Folder basis because it’s possible for a replicated folder to be current when it’s connected to other members in that replication group. At the same time, another replicated folder can be stale because it is not connected with other members in its replication group. Every day, DFSR updates this timestamp to show the opportunity for replication occurred. When attempting replication for an RF between computers, the DFSR service checks if the last time replication was allowed is older than the freshness date. If the last-allowed-replicated date is newer, it replicates. If it’s not, we block replication.

By now, you’re asking yourself “why would I want to block replication.” Good question. DFSR has a JET database just like Active Directory, and it uses multi-master replication just like AD. This means that it must implement tombstones to deleted items to replicate. When a file is deleted in DFSR, the local database records the deletion as a tombstone in the database – a logical deletion. After 60 days DFSR garbage collects the record from the database and it is truly gone – a physical deletion. Online defragmentation of the database can now reclaim that whitespace. The 60 days allows all the replication partners to learn about the deletion and act on it.

And herein lays the problem. If a DFSR server cannot replicate an RF for more than 60 days, but then replication is allowed later, it can replicate out old deletions for files that are actually live or replicate out stale data and overwrite existing files. If you’ve ever worked on an Active Directory “lingering object” issue, you have seen what can happen when a DC that was offline for months is brought back up. This is why Strict Replication Consistency was invented for AD – Content Freshness protection is the same thing.

Being “unable to replicate” can mean any one of these scenarios:

  • Disabling the replication connections.
  • Deleting the replication connections (either one-way or in both directions).
  • Stopping the DFSR service.
  • Closing the schedule (i.e. setting “no replication”)
  • Keeping the server shut off.

This whole content freshness idea is novel enough that we went to the trouble of applying for a patent on it.

Implementing Content Freshness Protection

Content Freshness protection is not enabled by default. To turn it on you simply modify the DfsrMachineConfig setting for MaxOfflineTimeInDays on each DFSR server with:

wmic.exe /namespace:\\root\microsoftdfs path DfsrMachineConfig set MaxOfflineTimeInDays=<some value>

The recommendation is to set the value to 60:

wmic.exe /namespace:\\root\microsoftdfs path DfsrMachineConfig set MaxOfflineTimeInDays=60

Remember, this has to be done on all DFSR servers, as this change only affects the computer itself. This value is not stored in a central AD location, but instead in the DfsrMachineConfig.XML file that resides in the hidden operating system folder “%systemdrive%\system volume information\dfsr\config”:

image

You can also view your existing MaxOfflineTimeInDays with:

wmic.exe /namespace:\\root\microsoftdfs path DfsrMachineConfig get MaxOfflineTimeInDays

Remember, by default this protection is OFF and be assumed to be zero if there are no entries in the DfsrMachineConfig.xml.

Note: Sharp-eyed admins may notice that we actually have an AD attribute stamped on every Replication Group called ms-DFSR-TombstoneExpiryInMin that appears to control tombstone lifetime. It even has the value - in minutes - for 60 days. Sorry to disappoint you, but this attribute is never read by DFSR and changing it has no effect – tombstone lifetime garbage collection is always hard-coded to 60 days in the service and cannot be changed.

Protection in Action

Let’s see how all this works. My repro environment:

  • A pair of Windows Server 2008 R2 computers named 2008r2-fresh-01 and 2008r2-fresh-02
  • Replicating in a Replication Group named “RG1”
  • Using a Replicated Folder named “RF1”
  • Keeping a few user files in sync.
  • MaxOfflineTimeInDays set to 60 on 2008r2-fresh-02

Important note: I am going to simulate the offline time by rolling clocks forward. Never ever do this in production – this is for testing and demonstration purposes only. Also, I only set MaxOfflineTimeInDays on one server – you would do this on all servers.

So here’s my data:

image

Now I stop DFSR on 2008r2-fresh-02 and roll time forward to January 1st, 2010 on both servers - about 75 days from this writing. I then make a few changes on 2008r2-fresh-02.

image

And then I start the DFSR service back up on 2008r2-fresh-02.

  • My changed files do not replicate out
  • New files do not replicate in

I now have this event:

Log Name:      DFS Replication
Source:        DFSR
Date:          1/1/2010 3:37:14 PM
Event ID:      4012
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      2008r2-fresh-02.blueyonderairlines.com
Description:
The DFS Replication service stopped replication on the replicated folder at local path c:\rf1. It has been disconnected from other partners for 76 days, which is longer than the MaxOfflineTimeInDays parameter. Because of this, DFS Replication considers this data to be stale, and will replace it with data from other members of the replication group during the next replication. DFS Replication will move the stale files to the local Conflict folder. No user action is required.
Additional Information:
Error: 9061 (The replicated folder has been offline for too long.)
Replicated Folder Name: rf1
Replicated Folder ID: 5856C18F-CA72-4D2D-9D89-4CC1D8042D86
Replication Group Name: rg1
Replication Group ID: BC5976EF-997E-4149-819D-57193F21EC76
Member ID: FAEC4B17-E81F-4036-AAD9-78AA46814606

Note: this event has incorrect wording. The first two sentences in the description are good, but the following sentences are wrong. DFSR does not self-correct this situation, it does not move files into the ConflictAndDeleted folder, and you, the user, have actions you need to take. More on this later.

The DFSR Debug logs will show (edited for brevity):

20100101 15:37:14.410 1008 CSMG 5504 [WARN] ContentSetManager::CheckContentSetState This replicated folder has not connected to other partners for a long time. lastOnlineTime: [*** Logger Runtime Error:-114757888 ***]

20100101 15:37:14.410 1008 CSMG 7492 [ERROR] ContentSetManager::Initialize Failed to initialize ContentSetManager csId:{5856C18F-CA72-4D2D-9D89-4CC1D8042D86} csName:rf1 Error:

+ [Error:9061(0x2365) ContentSetManager::CheckContentSetState contentsetmanager.cpp:5596 1008 C The replicated folder has been offline for too long.]

20100101 15:37:14.410 1008 CSMG 7972 ContentSetManager::Run csId:{5856C18F-CA72-4D2D-9D89-4CC1D8042D86} csName:rf1 state:InitialBuilding

20100101 15:37:14.504 1948 SRTR 957 [WARN] SERVER_EstablishSession Failed to establish a replicated folder session. connId:{5E05AE2A-6117-4206-B745-7785DB316F74} csId:{5856C18F-CA72-4D2D-9D89-4CC1D8042D86} Error:

+ [Error:9028(0x2344) UpstreamTransport::EstablishSession upstreamtransport.cpp:808 1948 C The content set was not found]

The state of the replicated folder will be “In Error” – i.e. set to 5:

wmic.exe /namespace:\\root\microsoftdfs path DfsrReplicatedFolderInfo get ReplicationGroupName,ReplicatedFolderName,State

ReplicatedFolderName   ReplicationGroupName   State
rf1                               rg1                               5

The above is Content Freshness protection in action. It is protecting your DFSR environment from sending divergent data out to the rest of your working servers.

Recovering DFSR from Content Protection

Important note: Before repairing the blocked replication, get a backup of the data on the affected server and its partners. Failure to do will tempt Murphy's Law to disastrous new heights. Understand that by following these steps below, any DFSR data that was on this server and never replicated will be moved to PreExisting and/or ConflictAndDeleted - this server goes through non-authoritative sync again and loses all conflicts with other DFSR servers. You have been warned!!!

Also, whatever is being done to stop replication from working needs to be ironed out - whether it is leaving the service off for months on end or not having any connections. Otherwise this is just going to happen again.

To get things back in order, do the following:

1. Start DFSMGMT.MSC on the affected server.

2. On any affected replication groups this server is a member of, select the computer on the Membership tab and "Disable" it.

image

3. Accept the warning prompt.

image

4. If the reason for replication never occurring was the schedule being set to "no replication" on the RG or RF, or no bi-directional connections being place between servers, fix that situation now.

5. Force AD Replication and verify it has converged.

6. On the affected server, run:

DFSRDIAG.EXE POLLAD

7. Wait for the 4008 and 4114 events being written to the DFSR event log to confirm that the replicated folder(s) are no longer being replicated.

8. In DFSMGMT.MSC, "Enable" the replication again on the affected replicated folders for that server.

9. Force AD replication and POLLAD again.

The server goes through non-authoritative initial sync, as if it was setup the first time. All matching data is unchanged and does not replicate. Any files on the server that do not exist on its authoritative partner are moved to the PreExisting folder. Any files on the server that have been changed locally are moved to the ConflictAndDeleted folder and the authoritative server's copy is replicated inbound.

The Sum Up

Content Freshness protection is a good thing and putting it in place may someday save you some real pain. Trust me – we work cases here where Content Freshness being enabled would have stopped huge problems. All it takes is Windows Server 2008 or later, and a few moments of your time.

- Ned “Kool and the Gang” Pyle

  • Hi Ned. I am having a problem with DFSR.

    We have a main office & 3 branches. Very recently we had power issues at 2 of the branches. Power company problem, power was up & down for a couple of days. Down long enough to drain battery backups.

    All offices have their own WinSvr 08. Replication has been all setup and running fine for several months until the power problems. Now those 2 branches are not replicating. The only error that is showing up in the event viewrs is on the main server and it is event 5002, 'The DFS Replication Service Encountered An Error Communicating With Partner Svr2 For Replication Group Main'.

    There are no problems with the netowrk connections accross the WAN links, as all other programs are running fine, for example, AD replicated just fine. Have restared the services several times. Have also disabled the replication links for each of these offices, updated the AD, the re-enabled the links etc just like the recovery described in the blog here. It will then reconnect, but about 30 minutes later, goes into the same 5002 error.

    The only thing I can think of to try that I haven't yet is totally remove

    these 2 offices from the repication sets, and then set them up all over

    again. Seriously trying to avoid that step.

    Any help would be greatly appreciated.

    Thanks,

    Kenny

  • Hi,

    Recreating the replication group is unlikely to help you. You are likely having a network problem, but not in totality - specific kinds of RPC-ware systems (such as firewalls, intrusion protection, and other products) can make distinctions based on the RPC UID being used. They can also block specific ports, so that only one application will be affected but others will not. The 5002 error isn't enough, I need the extended error. It may say 'access denied' or 'security specific package error has occurred' or 'no more endpoints available' or other things.

    If that extended error you respond with doesn't give an obvious answer, please open a support case to have this investigated; it will require significant analysis, network captures, port examination, installed 3rd party examination, etc etc. Prepare to spend some time on this.

    Ned

  • Thanks for the quick reply Ned.

    the following are the 2 specific errors that show up in the event viewer on the main server (Ntserver). The remote server is named Lebanon.

    This is what comes up first after restarting:

    Log Name:      DFS Replication

    Source:        DFSR

    Date:          12/8/2009 10:36:08 AM

    Event ID:      5014

    Task Category: None

    Level:         Warning

    Keywords:      Classic

    User:          N/A

    Computer:      NTSERVER.csoac.local

    Description:

    The DFS Replication service is stopping communication with partner LEBANON for replication group CSOAC Main due to an error. The service will retry the connection periodically.

    Additional Information:

    Error: 1818 (The remote procedure call was cancelled.)

    Connection ID: 2E4A3296-EACF-4FF2-87AB-A8609C2AA5A0

    Replication Group ID: E900940C-9C99-4682-82A3-873DAE217177

    Event Xml:

    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">

     <System>

       <Provider Name="DFSR" />

       <EventID Qualifiers="32768">5014</EventID>

       <Level>3</Level>

       <Task>0</Task>

       <Keywords>0x80000000000000</Keywords>

       <TimeCreated SystemTime="2009-12-08T15:36:08.000Z" />

       <EventRecordID>7179</EventRecordID>

       <Channel>DFS Replication</Channel>

       <Computer>NTSERVER.csoac.local</Computer>

       <Security />

     </System>

     <EventData>

       <Data>2E4A3296-EACF-4FF2-87AB-A8609C2AA5A0</Data>

       <Data>LEBANON</Data>

       <Data>CSOAC Main</Data>

       <Data>1818</Data>

       <Data>The remote procedure call was cancelled.</Data>

       <Data>E900940C-9C99-4682-82A3-873DAE217177</Data>

     </EventData>

    </Event>

    Followed shartly by this one:

    Log Name:      DFS Replication

    Source:        DFSR

    Date:          12/8/2009 11:06:24 AM

    Event ID:      5002

    Task Category: None

    Level:         Error

    Keywords:      Classic

    User:          N/A

    Computer:      NTSERVER.csoac.local

    Description:

    The DFS Replication service encountered an error communicating with partner LEBANON for replication group CSOAC Main.

    Partner DNS address: Lebanon.csoac.local

    Optional data if available:

    Partner WINS Address: Lebanon

    Partner IP Address: 192.168.5.20

    The service will retry the connection periodically.

    Additional Information:

    Error: 9032 (The connection is shutting down)

    Connection ID: 2E4A3296-EACF-4FF2-87AB-A8609C2AA5A0

    Replication Group ID: E900940C-9C99-4682-82A3-873DAE217177

    Event Xml:

    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">

     <System>

       <Provider Name="DFSR" />

       <EventID Qualifiers="49152">5002</EventID>

       <Level>2</Level>

       <Task>0</Task>

       <Keywords>0x80000000000000</Keywords>

       <TimeCreated SystemTime="2009-12-08T16:06:24.000Z" />

       <EventRecordID>7181</EventRecordID>

       <Channel>DFS Replication</Channel>

       <Computer>NTSERVER.csoac.local</Computer>

       <Security />

     </System>

     <EventData>

       <Data>2E4A3296-EACF-4FF2-87AB-A8609C2AA5A0</Data>

       <Data>LEBANON</Data>

       <Data>CSOAC Main</Data>

       <Data>Lebanon.csoac.local</Data>

       <Data>Lebanon</Data>

       <Data>192.168.5.20</Data>

       <Data>9032</Data>

       <Data>The connection is shutting down</Data>

       <Data>E900940C-9C99-4682-82A3-873DAE217177</Data>

     </EventData>

    </Event>

    As for firewalls, all we are running is the built in in Svr08, and we have it opened up for traffic within the domain.

    We haven't changed anything, just the power problems preceded this issue.

    Thanks,

    Kenny

  • What anti-virus software?

  • Windows Live One Care, and it has been on the main server since it was built.

    Kenny

  • That doesn't mean it wasn't updated automagically and causing issues. We have found 2 other AV vendors in the past that could cause your symptoms.

    Please open a support case and have one of our engineers drill into this with you.

    Good luck,

    Ned

  • Ned, if that was the case, then wouldn't it also affect the 3rd branch? It has continued to replicate fine. It is just the 2 offices that had the power issues that are having the replication problem.

    Thanks,

    Kenny

  • I can't troubleshoot in a vaccuum, sorry - this needs a support case to analyze it. Your error is too generic to handle through this blog.

    The power failure is probably a red herring; it just caused a mass reboot that exposed a different change taking effect.

  • OK, thanks for your time.

    Kenny

  • This looks like a handy feature. Lucky, I haven't had such problems before. But up from now I'd rather keep CF turned on just in case.

    Thanks Ned!