What's Causing that DFSR Change Storm?

What's Causing that DFSR Change Storm?

  • Comments 3
  • Likes

[This is another guest post from our pal Mark in Oz. Even if you don’t care about DFSR, I highly recommend this post; it teaches some very clever log analysis techniques, useful in a variety of troubleshooting scenarios – The Neditor]

Hi there! It’s Mark Renoden – Premier Field Engineer in Sydney, Australia – here again. Today I’m going to talk about an issue where a customer's DFSR environment lost file updates and they’d see regular alerts in SCOM about replication backlog. While I was on site working with them, I came up with a few creative ideas about how to use the DFSR debug logs that led us to the root cause.

The problem at hand was that a large DFS replication backlog would accumulate from time to time. Finding the root cause meant understanding the trend in changes to files in the replica. To do this, we needed to use the debug logs as our data source: to manipulate them so that they would tell the story.

With the aid of some custom scripts and tools, and a lab environment, I’m going to simulate their experience and talk through the investigation.

Test Lab Setup

The test lab I’ve used for this post is pretty simple: I’ve got two file servers, DPS1 and DPS2 configured with a single replication group called RG1 that replicates C:\ReplicatedFolder between the servers. There are 100,000 files in C:\ReplicatedFolder.

Prepare for Battle

In Ned’s previous post Understanding DFSR Debug Logging (Part 1: Logging Levels, Log Format, GUID’s), the various options for debug log verbosity are discussed. For this scenario, the only one I’ll change is the number of debug log files. 1000 is generally a good number to choose for troubleshooting purposes:

image

Harvest the Data

After reproducing the symptoms, we want to harvest the logs from the server that has pending replication. When on site at the customer, I just copied DFSR*.log.gz from the %windir%\debug folder, but the best possible practice would be to stop DFSR, copy the logs and then start the service again. This would prevent log rollover while you harvest the logs.

After you copy the logs for investigation, they need to be un-g-zipped. Use your favourite gzip-aware decompression utility for that.

Understand the Log Format

Before we can mine the debug logs for interesting information, we need to look at what we’re dealing with. Opening up one of the logs files, I want to look for a change and to understand the log format –

20120522 12:39:57.764 2840 USNC  2450 UsnConsumer::UpdateIdRecord LDB Updating ID Record:
+       fid                             0x2000000014429
+       usn                             0x693e0ef8
+       uidVisible                      1
+       filtered                        0
+       journalWrapped                  0
+       slowRecoverCheck                0
+       pendingTombstone                0
+       internalUpdate                  0
+       dirtyShutdownMismatch           0
+       meetInstallUpdate               0
+       meetReanimated                  0
+       recUpdateTime                   20120521 01:04:21.513 GMT
+       present                         1
+       nameConflict                    0
+       attributes                      0x20
+       ghostedHeader                   0
+       data                            0
+       gvsn                            {5442ADD7-04C7-486B-B665-2CB036997A67}-v937024
+       uid                             {5442ADD7-04C7-486B-B665-2CB036997A67}-v615973
+       parent                          {8A6CF487-2D5A-456C-A235-09F312D631C8}-v1
+       fence                           Default (3)
+       clockDecrementedInDirtyShutdown 0
+       clock                           20120522 02:39:57.764 GMT (0x1cd37c42d5a9268)
+       createTime                      20120516 00:41:05.011 GMT
+       csId                            {8A6CF487-2D5A-456C-A235-09F312D631C8}
+       hash                            00000000-00000000-00000000-00000000
+       similarity                      00000000-00000000-00000000-00000000
+       name                            file0000021380.txt
+      
20120522 12:39:59.326 2840 USNC  2453 UsnConsumer::UpdateIdRecord ID record updated from USN_RECORD:
+       USN_RECORD:
+       RecordLength:        96
+       MajorVersion:        2
+       MinorVersion:        0
+       FileRefNumber:       0x2000000014429
+       ParentFileRefNumber: 0x4000000004678
+       USN:                 0x693e0ef8
+       TimeStamp:           20120522 12:39:57.764 AUS Eastern Standard Time
+       Reason:              Close Security Change
+       SourceInfo:          0x0
+       SecurityId:          0x0
+       FileAttributes:      0x20
+       FileNameLength:      36
+       FileNameOffset:      60
+       FileName:            file0000021380.txt

What I can see here is a local database update followed by the USN record update that triggered it. If I can gather together all of the date stamps for USN record updates, perhaps I can profile the change behaviour on the file server…

image

The command above finds every line in every log file that contains USN_RECORD: and then excludes lines that contain a + (thereby eliminating occurrences of + USN_RECORD: as seen in the log excerpt above). Finally, it directs that output into USN.csv.

Let’s open our CSV file in Excel and see what we can do with it.

Graph the Data

I'd ideally like to graph the data that I've got, to make it easy to spot trends. The data I have right now isn't super-easy to work with, so I'm going to sanitize it a bit, and then make a Pivot Table and chart from the sanitized data.

Here is a single column of USN_RECORD: timestamps:

image

I’d like to figure out the rate of change on the file system for files in the replicated folder so I’ll use text to columns. I’m using a fixed width conversion and I’m going to split out my timestamp to the minute (so I can see how many changes per minute I have) and I’ll split USN_RECORD: off the end of the line so that I have something to count:

image

Now I’ve got columns like this:

image

I delete the columns I don’t need (A and C). My result is a column of timestamps down to the minute and a column of identical values (column B) which I can count to understand the rate of change:

image

To do this, I insert a pivot table. I simply select columns A and B and then choose PivotTable from the Insert menu in Excel.

image

Now I configure my PivotTable Field List as follows:

image

After configuring the PivotTable, it looks like this

image

All that’s left for me to do is to click on one of the row labels and to select a chart from the Insert menu. The resulting chart tells us quite a lot:

image

Here I can see that there are constant changes at roughly 230 per minute, and that for a two-hour window, the changes increase to about 1500 per minute.

Conclusions so far

For the entire duration of the logs, a roughly consistent level of change was occurring. However, for a two-hour window, lots of change was occurring. There are two possibilities here: either the cause of change has become more aggressive during this time or this chart represents two different activities.

We need more investigation …

Back to the Debug Logs

I start by skimming a log that contains the timestamps from the two-hour window where we see many changes, and look at the USN record updates. Skimming through, I can see two different types of change:

20120523 10:54:41.249 2840 USNC  2453 UsnConsumer::UpdateIdRecord ID record updated from USN_RECORD:
+       USN_RECORD:
+       RecordLength:        96
+       MajorVersion:        2
+       MinorVersion:        0
+       FileRefNumber:       0x20000000175FA
+       ParentFileRefNumber: 0x4000000004678
+       USN:                 0xbf0ad430
+       TimeStamp:           20120523 10:54:39.827 AUS Eastern Standard Time
+       Reason:              Close Security Change
+       SourceInfo:          0x0
+       SecurityId:          0x0
+       FileAttributes:      0x20
+       FileNameLength:      36
+       FileNameOffset:      60
+       FileName:            file0000031231.txt

And:

20120523 10:54:41.249 2840 USNC  2085 UsnConsumer::UpdateUsnOnly USN-only update from USN_RECORD:
+    USN_RECORD:
+    RecordLength:        96
+    MajorVersion:        2
+    MinorVersion:        0
+    FileRefNumber:       0x2000000019AD4
+    ParentFileRefNumber: 0x4000000004678
+    USN:                 0xbf0ad4f0
+    TimeStamp:           20120523 10:54:39.843 AUS Eastern Standard Time
+    Reason:              Basic Info Change Close
+    SourceInfo:          0x0
+    SecurityId:          0x0
+    FileAttributes:      0x20
+    FileNameLength:      36
+    FileNameOffset:      60
+    FileName:            file0000038828.txt

Skimming a log that covers a timeframe with a low rate of change, I can only seem to find:

20120522 23:28:54.953 2840 USNC  2453 UsnConsumer::UpdateIdRecord ID record updated from USN_RECORD:
+    USN_RECORD:
+    RecordLength:        96
+    MajorVersion:        2
+    MinorVersion:        0
+    FileRefNumber:       0x2000000022440
+    ParentFileRefNumber: 0x4000000004678
+    USN:                 0x7e7f5188
+    TimeStamp:           20120522 23:28:52.984 AUS Eastern Standard Time
+    Reason:              Close Security Change
+    SourceInfo:          0x0
+    SecurityId:          0x0
+    FileAttributes:      0x20
+    FileNameLength:      36
+    FileNameOffset:      60
+    FileName:            file0000072204.txt

Now I have a theory – Basic Info Change Close events only occur during the two-hour window where there are many changes and there’s an underlying and ongoing security change the rest of the time. I can prove this if I extract the timestamps for Basic Info Change Close changes and similarly, extract timestamps for Close Security Change changes.

Looking back at the log entries, I can see I have a time stamp followed by a series of lines that start with a +. I need to parse the log with something (I chose PowerShell) that takes note of the timestamp line and when a Basic Info Change Close or Close Security Change follows soon after, return the timestamp.

Here’s my PS script:

$files = Get-ChildItem *.log

$processingBlock = $False
$usnBlock = $False

foreach ($file in $files)
{
    $content = Get-Content $file
    foreach ($line in $content)
    {
        if (!($line.ToString().Contains("+")))
        {
            $outLine = $line
            $processingBlock = $True
        }
        if ($processingBlock)
        {
            if ($line.ToString().Contains("+    USN_RECORD:"))
            {
                $usnBlock = $True
            }
        }
        if ($usnBlock)
        {
            if ($line.ToString().Contains("+    Reason:              Basic Info Change Close"))
            {
                $outLine.ToString()
                $processingBlock = $False
                $usnBlock = $False
            }
        }
    }
}

And:

$files = Get-ChildItem *.log

$processingBlock = $False
$usnBlock = $False

foreach ($file in $files)
{
    $content = Get-Content $file
    foreach ($line in $content)
    {
        if (!($line.ToString().Contains("+")))
        {
            $outLine = $line
            $processingBlock = $True
        }
        if ($processingBlock)
        {
            if ($line.ToString().Contains("+    USN_RECORD:"))
            {
                $usnBlock = $True
            }
        }
        if ($usnBlock)
        {
            if ($line.ToString().Contains("+    Reason:              Close Security Change"))
            {
                $outLine.ToString()
                $processingBlock = $False
                $usnBlock = $False
            }
        }
    }
}

I run each of these (they take a while) against the debug log files and then chart the results in exactly the same way as I’ve done above.

image

First, Basic Info Change Close (look at the time range covered and number plotted):

image

And Close Security Change, below:

image

This confirms the theory – Basic Info Change Close takes place in the two hours where there’s a high rate of change and Close Security Change is ongoing.

Root Cause Discovery

If this is an ongoing pattern where the high rate of change occurs during the same two hours each day, I can capture both activities using Process Monitor.

Once I have a trace, it’s time to filter it and see what’s happening:

image

Here I’ve reset the filter and added Operation is SetBasicInformationFile then Include. I chose SetBasicInformationFile because it looks like a good fit for the USN record updates labelled Basic Info Change Close. After clicking OK, my filtered trace has the answer…

image

As it turns out, the backup window matches nicely with the storm of Basic Info Change Close updates.

Clearly, this is my own little application replicating behaviour but in the case of my customer, it was actually their backup application causing this change. They were able to talk to their vendor and configure their backup solution so that it wouldn’t manipulate file attributes during backups.

Now all we need to do is identify the source of Close Security Change updates. Once again, I reset the filter and look for an operation that sounds like a good match. SetSecurityFile looks good.

image

What I found this time is that no entries show up in Process Monitor

image

What explains this? Either I chose the wrong operation or the filter is broken in some other way. I can’t see any other sensible operation values to filter with so I’ll consider other options. Looking at the filter, I realize that perhaps System is responsible for the change and right now, Procmon filters that activity out. I remove the exclusion of System activity from my filter and see what happens:

image

Aha! Now I’ve got something:

image

Now I need to understand what System is doing with these files. I right click the path for one of these entries and select “Include C:\ReplicatedFolder\file…”:

image

I also need to remove the filter for SetSecurityFile:

image

In summary, I’m interested in everything that happened to file0000033459.txt:

image

If I look at operations on the file that took place prior to SetSecurityFile, I can see a CreateFile operation. This is where System obtained a handle to the file. Looking at this entry, adding the Details column to Process Monitor and examining the fine print I find:

mark1

System is making this change in the context of the account CONTOSO\ACLingApp that just happens to be the service account of an application used to change permissions on resources in the environment.

Conclusion

The process I've described today is a good example of the need to Understand the System from my earlier post. The Event Logs - and even the debug logs - won’t always tell you the answer straight away. Know what you’re trying to achieve, know how to use the tools in your arsenal, and know how they can be made to produce the outcome you need.

Knowing what I know now, I might have found the root cause by starting with Process Monitor but there’s a chance I’d have missed Close Security Change updates (considering that System is excluded from Process Monitor by default). I may have also missed the Basic Info Change Close updates if the tracing interval wasn’t aligned with the backup window. By mining the debug logs, I was able to establish there were two separate behaviours and the appropriate times to gather Process Monitor logs.

- Mark “Spyglass” Renoden

  • Outstanding post!  Awesome job sluething Mark, and very slick use of a number of different tools to pull the data out!!

  • I tried this but I believe my results may have been skewed due to there being other lines resulting from the findstr beyond just the "UsnConsumer::UpdateIdRecord ID record updated from USN_RECORD", for example:

        "UsnConsumer::CreateNewRecord ID record created from USN_RECORD:"

        "UsnConsumer::ProcessUsnRecord Filtered USN_RECORD:"

        "UsnConsumer::TombstoneOrDelete ID record deleted from USN_RECORD:"

        "UsnConsumer::UpdateUsnOnly USN-only update from USN_RECORD:"

    How should the above lines be handled in terms of the analysis in this blog?  Should any of these also be counted as "changes", or should I do an additional findstr on the output to omit all lines except for lines containing "UpdateIdRecord"?

    Thanks!

  • Hi johnmsanz

    The findstr command used in the example gathers all the time stamps for any USN update (it’s a pretty rough approximation). This includes new files created in replication groups, deletions, file updates and  changes to files that are filtered from replication. If you want to be more specific (for example – just file updates that originate on the server where log harvesting took place), you need to do something trickier – much like the PowerShell parsing scripts later in the post or the additional findstr operation you suggest.

    Gathering what you need starts with understanding the log file format in more detail and then crafting a parser that will isolate the time stamps for the events that are important to you.

    - Mark