Hello our names are Ray Zabilla and Rick Bergman. Ray and I are Dedicated Premier Field Engineers with Microsoft and work with the same customer. We were both away from our customer during the same week attending training, when we had an issue pop up that needed our help. We were pulled into a situation where we needed to help quickly find a solution to identify what or who was changing the Active Directory Universal Principal Name, UPN, value for our customer Contoso.
The blog post for this interesting issue is going to be discussed with the solution and details in a two part blog post. Part I will include the issue definition and approach to solving the problem and in Part II we will share the details and lessons learned.
Several months ago Contoso began a Migration to Office 365 and the design requirements required the use of the Active Directory “User Principal Name” attribute for authenticating to Office 365 with ADFS. Contoso design requires that the Active Directory UPN must match the Primary SMTP Address. Initially the UPN for the Contoso user accounts did not match the Primary SMTP address, and prior to the start of the migration they ran a script to update the UPN for all users to match the Primary SMTP address.
Shortly after the first pilot users’ mailboxes were migrated to Office 365, the pilot users’ UPN value began mysteriously reverting back to the original value. This change prevented any of the migrated users from logging into their mailbox in the cloud with their UPN and accessing their mail. We spent some time trying to identify the source of the changes but we weren’t successful. This included pulling reports from their enterprise audit collection system which collected the security logs from each DC. In the interest of continuing the pilot, the decision was made to just run a script to correct the UPNs as a scheduled job.
Now they wanted to start aggressively migrating mailboxes to Office 365, but put the project on hold because they couldn’t find what was changing the UPN value.
Some unknown process, running on some unknown computer, at some unknown time was changing the UPN on the Active Directory user accounts. Let’s stop and really think about the challenge statement.
Question: Where can a user object have its UPN valued changed?
Answer: On any writeable domain controller.
Question: How domain controllers are in Contoso’s environment and how many are writable?
Answer: There are more than 60 domain controllers in the environment all being writeable.
Question: What versions are the domain controller OS and what is their Forest Functional Level and Domain Functional Level?
Answer: The domain controllers are Windows Server 2003 X64 SP2, which means the FFL and DFL cannot be above 2003.
Question: How come you couldn’t get information out of the enterprise security log collection system?
Answer: We asked the team which ran the enterprise security log collection system to provide us a report of all the users who had their UPN values changed and we never received a report that contained a proper Event ID 642.
What do we do now? What is the approach for finding who is changing the UPN values?
We are going share the approach on what we did to find the mystery machine/process/account that is changing the UPN value in AD. While we needed to be aware of the all the technical considerations, we also needed to take into account the “political considerations” which we refer to as layers 8 & 9 of the OSI model.
Top priority, “political consideration,” was to keep the pilot moving forward and keep the pilot users working as expected. The initial approach used by the IT group at Contoso was to try to modify the ACL on the UPN attribute to limit the accounts that could modify the attribute. This approached worked to some extent but proved problematic for a number of reasons, most importantly because it did not prevent the attribute from being modified. The IT group at Contoso continued to investigate the source of the UPN change and had focused on their identity provisioning system as the likely culprit, but they were unable to correlate any of the UPN changes to any events in the provisioning system.
Since Contoso is running Windows Server 2003 R2 X64 Domain Controllers, we recommended they search the Security event log for Event ID 642 which indicates a successful “User Account Change”. The Event ID includes information that identifies the attribute which was changed and the “calling account” initiating the change. This means that each domain controller will have to be scanned for the Event ID 642, because you never know on which writable DC the change is going to be made.
Contoso uses an enterprise auditing and collection system so the logical thing to do was to use the tool to search for the 642 Event ID versus searching each DC independently. Contoso IT made an inquiry to their security auditing team to give us all event ID 642 from all DC’s in the environment from their enterprise collection system and we would search through them. This effort turned out to be unsuccessful since for some reason the archived logs did not contain all the data and they were only able to provide part of the data needed, which of course did not contain any of the specific UPN change events we were hoping to find.
At this point, Ray and I needed to come up with a better plan, because the approach we recommended seemed like a relatively straight forward approach which should theoretically identify the source of the change is relatively short order. However in reality this was not the case. Contoso has over 60 domain controllers in multiple sites worldwide. Further, Contoso’s security department required significant auditing which, given the size of the security logs, only provided about 15 minutes of data before the log file wrapped. Contoso also had a tool to archive the log files and while it did discover and few isolated UPN change events and the associated accounts making the change, they were unable to provide enough detail to identify the source of the mysterious UPN changes. So, how to identify an unknown process, using an unknown account, making a change to a UPN attribute, on any one of over 60 domain controllers at some unknown time?!
NOTE: On Windows 2003 Domain Controller, the Event ID 642 in the security event log indicates the new value of the attribute and it does not show failures. By default on a Windows 2008/R2 Domain Controller in the security event log, the event numbers have changed to Event ID 4738. For more information on Window Server 2008/R2 User Account Management event IDs, go to TechNet. http://technet.microsoft.com/en-us/library/dd772693(WS.10).aspx
Below is a sample from Windows Server 2003 R2
NOTE: All the following examples are from our lab for Office 365 testing.
Figure 1 – Windows Server 2003 Event ID 642
Figure 2 – Event ID 642 in Text form
One of the topics during Ray’s Active Directory training class that week was on Active Directory replication and that provided the inspiration that led Ray to call upon the AD tool of choice, REPADMIN. Without getting into a lot of detail about how Active Directory replication works, we will save that for another blog post, at a high level changes in Active Directory occur at the attribute level on an originating Domain Controller identified with a Unique Sequence Number (USN), corresponding version number and a date/time stamp. In order to keep an attribute change in Active Directory from replicating around in an endless loop the originating USN from the originating Domain Controller for the attribute is stored in the objects’ metadata.
The light bulb went on! After a little creative thinking and with an understanding of the Active Directory replication process it occurred to me the same attributes maintained by Active Directory to manage replication would provide us the answer to when the attribute was changed in the entire AD Forest and give us the originating domain controller for the change. Yup, REPADMIN was tool to do it.
Rick’s Comment – After Ray suggested using Repadmin /showobjmetatdata, I started drinking V-8 juice, because why didn’t I think of that? J
REPADMIN is one of the primary and most helpful tools for troubleshooting Active Directory replication. REPADMIN has quite a few options to display information about Active Directory replication status , some of which most of you are probably pretty familiar with such as “/showrepl” or “/replsummary” . However, for this problem we going to use one of the options you may not be so familiar with and that is the “/showobjmetadata” option.
According the help displayed from the command line the “/showobjmeta” option “Displays the replication metadata for a specified object stored in Active Directory, such as attribute ID, version number, originating and local Update Sequence Number (USN), and originating server's GUID and Date and Time stamp.” See the full help listing below in Figure 3.
Figure 3 - REPADMIN /showobjmeta help
We could use “REPADMIN /showobjmeta” to display the metadata for a user object where UPN had been changed. This would identify the originating domain controller as well as the date/time of the change. You can see from the output of the REPADMIN /showobjmeta in the figure below, you supply the DN of the AD user and the tool provides the information in the metadata. In this example, the UPN for user “test5455” was changed on DC02NA at 14:27:31 on Feb 3, 2012. Success!
Figure 4 - REPADMIN /showobjmeta output
First we needed more time to be able to get/pull the Event ID 642 from the DC where the change was made. Since all the domain controllers in this domain were running Windows Server2003 x64, the first thing the Contoso IT team did was increase the Security Event log size to 2GB. We had done some research and found there had been a few incidents reported on Server 2003 x64 domain controllers when the log files were set to the maximum 4GB so we recommend Contoso set the log file size to 2GB which should give us enough data to capture the 642 event but be well below the max size. After increasing the log file size some quick analysis found they were now had 3.5 – 4 hours before the Security event log wrapped
With the information provided by the REPADMIN /showobjmeta we would only have to search the Security log on the domain controller where the change was made to find the Event ID 642 for the object that had changed. We would also know the precise time when the changed occurred in AD so our search would be specific and limited. Once we found the 642 event in the appropriate Security log we would know the AD account that made the change and we would now have identified 4 of the 5 key variables (who, where, when, what), which would hopefully provide enough information to lead us to the process making the change.
In the example 642 event text listed below, you can see the change was made to test5455 on DC02NA which we knew from the metadata. But now we know the other key piece of missing information, the AD account initiating the change, which in this case is Administrator.
Figure 5 - REPADMIN /showobjmeta output
Armed with this knowledge, we created a couple of PowerShell scripts to identify user accounts where the UPN that had been set to an incorrect value. The scripts we created for finding the Event Id 642, will further discussed in the second post.
The Contoso IT team was working in parallel to our efforts by tracking down all of the applications, processes and scripts that were making changes to Active Directory. This was a good exercise because based on the research the Contoso IT team had already done, once they knew the AD account making the change, they were able to identify the offending process in about 15 minutes and get the migration back on track.
The culprit was found using the tools and solid data gathering approach. Who was the culprit? It was a script that someone had written over 7 years ago to keep the UPN in the same format as sAMAccountName@contoso.com. The script and the server the script ran on had been forgotten about when there was staff turnover. In fact, the server went through hardware upgrades and had been P2V’d, moved from a hardware server to a virtualized server.
There are a number of lessons learned in this post.
1. The tools and approach that can be used to help you discover what is making the change to the UPN values or other AD attributes.
2. Keeping track of what is accessing and making updates to your Active Directory. There should be some tool in which items are tracked that makes updates to Active Directory. The tool can be a SharePoint site, a component of your Enterprise Management tool or something as simple as the “Enterprise Spreadsheet”. Tracking this information would help drive good operational discipline in the organization.
3. The last and really important lesson is using a data driven trouble shooting approach. The key to be being able to solve difficult complex problems such as this one is using a systematic, careful, detailed, data driven troubleshooting methodology. Too many times we see customers struggling to solve problems by unstructured methods and processes from guessing to trying things without any logical process or results tracking which sometimes can lead to more problems.
The second part of the post will include details on the scripts we wrote, performance issues we ran into, and methodology we used working through the issues.
Part II - How to Track the Who, What, When and Where of Active Directory Attribute Changes
We hope you found this blog post helpful and is something you can use in your environment.
Ray Zabilla and Rick Bergman, Microsoft PFE
Great post. You can't beat a good detective story like this one, especially when someone else goes through it first.
No need to develop new solution. Consider Audit Collection Servces.
Maybe to your amazement, its from MS and part of System Center Ops Manager which can keep logs for months even.
You are right and we are well aware of SCOM ACS and its capabilities. The customer we were working with did not want to implement it, which left us with few options and a short timeline. Setting up a temporary ACS environemnt would have been our preference for finding the Event ID 642's.
Thanks for bringing it up.
Good for you the UPN was being changed. In case somebody had been "playing" with mail attribute, tracking the origin would be impossible using this method, because mail is not an audited attribute.
After reading the article I thought it might be done in a different way, using psexec+logparser, which in this case are pre-requisites. It would take longer for the first run, but after thatit should require considerably less time per DC because of checpoint file usage, so even with 60 DCs it wouldn't take much time.
Schedule it every 5-10 on one machine, share some folder to write data to (commented out in a script), and thats it!
Here you are:
netdom query dc|sort|find /v "The command completed successfully"|find /v "List of domain controllers with accounts in the domain" > tmpTest.txt
for /f "tokens=1 delims= " %%x in (tmpTest.txt) do (
echo logparser -o:CSV -i:evt -iCheckpoint:tmp_%%x.lpc "select computername,timewritten,message into tmp_%%x.txt from security where eventid=4738" > tmp_%%x.bat
copy /y tmp_%%x.bat \\%%x\c$\tmp_%%x.bat
psexec \\%%x -h -s -d -w c:\temp\ C:\tmp_%%x.bat
rem type tmp_%%x.txt | find /v "ComputerName,TimeWritten,Message" >> \\SERVER\share\central-change-log.txt
what will happen if usn value reaches 2^64??