Learn about Windows PowerShell
Summary: Learn how to use Windows PowerShell to easily remove duplicates from a CSV file.
Hey, Scripting Guy! I have a comma-separated value (CSV) file that comes from a Microsoft Excel spreadsheet. We use this CSV file to create new users automatically in Active Directory. The problem is that every once in a while, duplicates end up in the CSV file. As you can well imagine, this creates havoc and the import fails. I began to think about how I could ensure that the data in the CSV file is unique. I was thinking of the old-fashioned dictionary object trick I used back in the VBScript days, and I tried some things using the Windows PowerShell hashtable. The problem is it quickly got rather complicated. Next, I decided to try a similar thing using an array, and once again, things went pear-shaped. I looked on the Scripting Guys Script Repository and found a pretty cool VBScript that uses a recordset and ADO. I am guessing I could modify it, but I am hoping for something easier.
Microsoft Scripting Guy Ed Wilson here. Some things are simple, such as today’s date: 11-1-11. (That’s how we present it in the United States anyway.) In fact, this year has been one of simple dates. It began with 1-1-11, and in 10 days, it will be 11-11-11. Other things appear complicated, and then a sudden revelation shows simplicity and elegance of design.
One of the hardest things for people who formerly used VBScript as they move to Windows PowerShell is to quit thinking in terms of what they would have done in VBScript. The overriding principle in Windows PowerShell is that everything is an object. “Oh yeah,” you might say, “I know that.” But until you really know that, and until you use that as fundamental in your approach, you will not actually get it. After you really know that everything is an object, you begin to look for solutions in a new and different way—one that was not possible in VBScript or some other scripting language.
A perfect case in point, JB, is your problem with needing to remove duplicates from a CSV file. First, if I am going to work with a CSV file, I need to import it. I then need to see which properties are available. To do this, I use the Import-CSV cmdlet and the Get-Member cmdlet. In the output that follows, I see four noteproperties that correspond to the column headers from the CSV file.
PS C:\> Import-Csv C:\fso\UsersConsolidated.csv | Get-Member
Name MemberType Definition
Equals Method bool Equals(System.Object obj)
GetHashCode Method int GetHashCode()
GetType Method type GetType()
ToString Method string ToString()
Fname NoteProperty System.String Fname=Ed
Group NoteProperty System.String Group=Engineering
Lname NoteProperty System.String Lname=Banti
OU NoteProperty System.String OU=Atlanta
I now need to decide which duplicates I want to remove. Obviously, I do not want to remove duplicate group names or OU names. I only want to remove duplicate first name and last name combinations. It is okay to have a duplicate first name, and it is okay to have a duplicate last name—just not a duplicate first name/last name combination. I therefore need to sort the output by lname (last name) and fname(first name) columns. To do this, I use the Sort-Object cmdlet. This command is shown here (sort is actually an alias for Sort-Object):
Import-Csv C:\fso\UsersConsolidated.csv | sort lname,fname
The command to sort an imported CSV file and the associated output is shown in the following figure.
After I have the contents of the CSV file sorted, I use the unique switch to return only unique rows from the file:
Import-Csv C:\fso\UsersConsolidated.csv | sort lname,fname –Unique
The command and associated output are shown in the following figure.
JB, that is all there is to using the Sort-Object cmdlet and the Import-CSV cmdlet to remove duplicates from a CSV file. Join me tomorrow when we continue to explore the cool things that Windows PowerShell offers.
I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at firstname.lastname@example.org, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace
Ed Wilson, Microsoft Scripting Guy
sort unique is a good thing to remove duplicate records.
BUT: It definitely will eliminate randomly entries that are duplicates with respect to two of four criteria in this case and you will end up having left over a random group and OU property for those records left over. This behaviour may not be desired in all cases! So I think that removing completely duplicated records is OK, but removing records based on only few criteria may not result in a perfect solution to the problem.
@Klaus Schulte you are correct ... it might be possible to have a match with more than two of the criteria :-( It is possible to use to continue to sort three or even all four properties ... In addition, when having more random data, it could really be more problematic. Testing on a copy of the actual data to be processed is definately called for. Thank you for point this out.
I recently did this same process but took a different approach
upon export of the users obtain the users SID. this is the unique value that "JB" needed during the export to ensure that the entries are never duplicates.
Sort by SID and this scenario will not be possible.
here is a example of what I used with the quest active roles objects.
sorry but the webpage will probably mess up the formatting.
If ( (Get-PSSnapin -Name Quest.ActiveRoles.ADManagement -ErrorAction SilentlyContinue) -eq $null )
Set-QADPSSnapinSettings -DefaultSizeLimit 0
Set-QADProgressPolicy -ShowProgress $false | Out-Null
$Fetchusers = Get-QADUser -Enabled -IncludedProperties sid, samaccountname, distinguishedname | select-object sid, samaccountname, distinguishedname
$myCol = @()
foreach ($users in $Fetchusers)
$myUsers = New-Object System.Object
$sid = ($users.sid.value).tostring()
$myUsers | Add-Member -memberType Noteproperty -name Sid -value $sid
$myUsers | Add-Member -memberType Noteproperty -name SamAccountName -value $users.samaccountname
$myUsers | Add-Member -memberType Noteproperty -name DN -value $users.distinguishedname
$myCol += $myUsers
$myCol | Export-Csv -NoTypeInformation -Path $psUsers
Those are not duplicates. They were two different locations and departments. Could be two different people.
i have a folder in my bookmarks,loaded with duplicates.
i copied the folder as an html. i re-saved it as a csv file.
for example: i have www.yahoo.com in my folder. i certainly dont need www.yahoo.com/weather
so, i am trying to remove like dupes.
i followed your example, using the following code.
when i used the following code,only one line gets printed out.
Import-Csv C:\users\blrman\downloads\ja-py-doc\bookmarks1.csv | Get-Member
Import-Csv c:\users\blrman\downloads\ja-py-doc\bookmarks1.csv | Sort-Object
Import-Csv c:\users\blrman\downloads\ja-py-doc\bookmarks1.csv |Sort-Object-Unique