Use PowerShell to Remove Duplicate Lines from a CSV File

Use PowerShell to Remove Duplicate Lines from a CSV File

  • Comments 3
  • Likes

Summary: Learn how to use Windows PowerShell to easily remove duplicates from a CSV file.

 

Hey, Scripting Guy! QuestionHey, Scripting Guy! I have a comma-separated value (CSV) file that comes from a Microsoft Excel spreadsheet. We use this CSV file to create new users automatically in Active Directory. The problem is that every once in a while, duplicates end up in the CSV file. As you can well imagine, this creates havoc and the import fails. I began to think about how I could ensure that the data in the CSV file is unique. I was thinking of the old-fashioned dictionary object trick I used back in the VBScript days, and I tried some things using the Windows PowerShell hashtable. The problem is it quickly got rather complicated. Next, I decided to try a similar thing using an array, and once again, things went pear-shaped. I looked on the Scripting Guys Script Repository and found a pretty cool VBScript that uses a recordset and ADO. I am guessing I could modify it, but I am hoping for something easier.

—JB

 

Hey, Scripting Guy! AnswerHello JB,

Microsoft Scripting Guy Ed Wilson here. Some things are simple, such as today’s date: 11-1-11. (That’s how we present it in the United States anyway.) In fact, this year has been one of simple dates. It began with 1-1-11, and in 10 days, it will be 11-11-11. Other things appear complicated, and then a sudden revelation shows simplicity and elegance of design.

One of the hardest things for people who formerly used VBScript as they move to Windows PowerShell is to quit thinking in terms of what they would have done in VBScript. The overriding principle in Windows PowerShell is that everything is an object. “Oh yeah,” you might say, “I know that.” But until you really know that, and until you use that as fundamental in your approach, you will not actually get it. After you really know that everything is an object, you begin to look for solutions in a new and different way—one that was not possible in VBScript or some other scripting language.

A perfect case in point, JB, is your problem with needing to remove duplicates from a CSV file. First, if I am going to work with a CSV file, I need to import it. I then need to see which properties are available. To do this, I use the Import-CSV cmdlet and the Get-Member cmdlet. In the output that follows, I see four noteproperties that correspond to the column headers from the CSV file.

PS C:\> Import-Csv C:\fso\UsersConsolidated.csv | Get-Member

 

 

   TypeName: System.Management.Automation.PSCustomObject

 

Name                           MemberType                                         Definition

Equals                           Method                                                 bool Equals(System.Object obj)

GetHashCode                Method                                                 int GetHashCode()

GetType                        Method                                                 type GetType()

ToString                        Method                                                 string ToString()

Fname                           NoteProperty                                        System.String Fname=Ed

Group                           NoteProperty                                        System.String Group=Engineering

Lname                           NoteProperty                                        System.String Lname=Banti

OU                               NoteProperty                                        System.String OU=Atlanta

 

I now need to decide which duplicates I want to remove. Obviously, I do not want to remove duplicate group names or OU names. I only want to remove duplicate first name and last name combinations. It is okay to have a duplicate first name, and it is okay to have a duplicate last name—just not a duplicate first name/last name combination. I therefore need to sort the output by lname (last name) and fname(first name) columns. To do this, I use the Sort-Object cmdlet. This command is shown here (sort is actually an alias for Sort-Object):

Import-Csv C:\fso\UsersConsolidated.csv | sort lname,fname

The command to sort an imported CSV file and the associated output is shown in the following figure.

Image of command and associated output

After I have the contents of the CSV file sorted, I use the unique switch to return only unique rows from the file:

Import-Csv C:\fso\UsersConsolidated.csv | sort lname,fname –Unique

The command and associated output are shown in the following figure.

Image of command and associated output

JB, that is all there is to using the Sort-Object cmdlet and the Import-CSV cmdlet to remove duplicates from a CSV file. Join me tomorrow when we continue to explore the cool things that Windows PowerShell offers.

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace

Ed Wilson, Microsoft Scripting Guy

 

 

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Hi Ed,

    sort unique is a good thing to remove duplicate records.

    BUT: It definitely will eliminate randomly entries that are duplicates with respect to two of four criteria in this case and you will end up having left over a random group and OU property for those records left over. This behaviour may not be desired in all cases! So I think that removing completely duplicated records is OK, but removing records based on only few criteria may not result in a perfect solution to the problem.

    Klaus.

  • @Klaus Schulte you are correct ... it might be possible to have a match with more than two of the criteria :-( It is possible to use to continue to sort three or even all four properties ... In addition, when having more random data, it could really be more problematic. Testing on a copy of the actual data to be processed is definately called for. Thank you for point this out.

  • I recently did this same process but took a different approach

    upon export of the users obtain the users SID. this is the unique value that "JB" needed during the export to ensure that the entries are never duplicates.

    Sort by SID and this scenario will not be possible.

    here is a example of what I used with the quest active roles objects.

    sorry but the webpage will probably mess up the formatting.

    If ( (Get-PSSnapin -Name Quest.ActiveRoles.ADManagement -ErrorAction SilentlyContinue) -eq $null )

    {

       Add-PsSnapin Quest.ActiveRoles.ADManagement

    }

    Set-QADPSSnapinSettings -DefaultSizeLimit 0

    Set-QADProgressPolicy -ShowProgress $false | Out-Null

    $psUsers= "c:\Scripts\users.csv"

    Function Get-Users

    {

    $Fetchusers = Get-QADUser -Enabled -IncludedProperties sid, samaccountname, distinguishedname | select-object sid, samaccountname, distinguishedname

    $myCol = @()

    foreach ($users in $Fetchusers)

    {

    If($users.distinguishedname.contains("OU=OHP Users"))

    {

    $myUsers = New-Object System.Object

    $sid = ($users.sid.value).tostring()

    $myUsers | Add-Member -memberType Noteproperty -name Sid -value $sid

    $myUsers | Add-Member -memberType Noteproperty -name SamAccountName -value $users.samaccountname

    $myUsers | Add-Member -memberType Noteproperty -name DN -value $users.distinguishedname

    $myCol += $myUsers

    }

    }

    $myCol | Export-Csv -NoTypeInformation -Path $psUsers

    }

    Get-Users