Use a PowerShell Hash Table to Simplify File Backups

Use a PowerShell Hash Table to Simplify File Backups

  • Comments 8
  • Likes

Summary: Learn how to use a Windows PowerShell hash table to simplify backing up unique files.

 

Hey, Scripting Guy! QuestionHey, Scripting Guy! Your series on hash tables has been pretty interesting. I am wondering if you have some practical uses for hash tables. Can you provide a few examples of how using a hash table would be useful?

—RE

 

Hey, Scripting Guy! AnswerHello RE,

Microsoft Scripting Guy Ed Wilson here. This morning the Scripting Wife received an email message from one of the hotels where we had stayed during our last trip to Australia. They were advertising their winter specials. The temperature around here has hovered near 100 degrees Fahrenheit (37.7 degrees Celsius according to my Windows PowerShell conversion module). One of the things I love about Australia (besides Tim Tams, Lamingtons, great scuba diving, awesome scenery, and especially wonderful people) is the fact that when the weather is oppressively hot in the Deep South, it is winter down under. A quick flight to Brisbane and one has escaped the hot sticky weather of Charlotte, North Carolina, in July. 

RE, to answer directly your question let me begin by detailing a scenario. Suppose there is a directory structure that contains a number of folders and files. Inside the different folders, each file name is, of course, unique. However, inside the subfolders, there are duplicate files. A sample file structure containing duplicate files in nested folders is shown in the following figure.

Image of sample file structure containing duplicate files in nested folders

If I need to flatten this directory structure (move it from several nested folders, some of which have duplicate files, to a single folder with no duplicates), there is a variety of approaches I can use. One is to use the graphical user interface of the Windows Explorer tool. The problem with this approach is it is interactive, and involves a lot of clicking, “Yes I know there is an existing file and I want to write over that file.” This makes things proceed rather slowly. In addition, I might want to have a job that I can schedule on a nightly basis for backup purposes. Writing a bit of Windows PowerShell code will allow me the flexibility to solve the problem in multiple ways. In addition, I can avoid multiple mouse clicks. A hash table will simplify the coding that is required.

If I use the name property (of the System.IO.FileInfo object) as the key for my hash table, and the fullname property (of the same object) as the value that is associated with the key, I can filter out all of the duplicate files. This filtering occurs because the key property of a hash table must be unique.

In the command that follows, I first create an empty hash table and store it in the $hash variable. I use a semicolon to allow me to continue another command on the same line. Next, I use dir to get a directory listing of my current directory (the C:\hsgtest folder is the working directory as indicated by the Windows PowerShell prompt). I use the recurse switch to cause the command to work through all of the nested directories. The dir command is an alias for the Get-ChildItem cmdlet. I pipe the results to the Where-Object cmdlet (? Is an alias for the Where-Object cmdlet). Inside the script block (the braces), I use the ! operator (means not) to cause the Where-Object to return only items that are not containers (in other words, files). I pipe the files to the ForEach-Object cmdlet (% is an alias for the ForEach-Object cmdlet). Inside the script block (braces) associated with the ForEach-Object cmdlet I add the name property and the fullname property from the fileinfo objects to the hash table by using the add method. The name of each file becomes the key in the hash table, and the fullname (which is the full path to the file) becomes the value associated with each item. When the command runs, errors appear for each duplicate file name that are attempted to be added to the hash table. This is expected, and is confirmation the command works properly. The complete command is shown here:

PS C:\hsgTest> $hash = @{}; dir -recurse | ? { !$_.psiscontainer} | % { $hash.add($_.name,$_.fullname) }

To see the consolidated list of files, I inspect the contents of the $hash variable as shown here:

PS C:\hsgTest> $hash

 

Name                           Value

----                           -----

testfile30.txt                 C:\hsgTest\testfile30.txt

testfile27.txt                 C:\hsgTest\hsgtest2\testfile27.txt

testfile21.txt                 C:\hsgTest\hsgtest2\testfile21.txt

testfile20.txt                 C:\hsgTest\testfile20.txt

testfile25.txt                 C:\hsgTest\hsgtest2\testfile25.txt

testfile10.txt                 C:\hsgTest\testfile10.txt

testfile6.txt                  C:\hsgTest\testfile6.txt

testfile35.txt                 C:\hsgTest\hsgtest2\hsgTest3\testfile35.txt

testfile34.txt                 C:\hsgTest\hsgtest2\hsgTest3\testfile34.txt

testfile28.txt                 C:\hsgTest\testfile28.txt

testfile1.txt                  C:\hsgTest\testfile1.txt

testfile23.txt                 C:\hsgTest\hsgtest2\testfile23.txt

testfile2.txt                  C:\hsgTest\testfile2.txt

testfile29.txt                 C:\hsgTest\hsgtest2\testfile29.txt

testfile24.txt                 C:\hsgTest\testfile24.txt

testfile9.txt                  C:\hsgTest\testfile9.txt

testfile38.txt                 C:\hsgTest\hsgtest2\hsgTest3\testfile38.txt

testfile40.txt                 C:\hsgTest\hsgtest2\hsgTest3\testfile40.txt

testfile8.txt                  C:\hsgTest\testfile8.txt

testfile36.txt                 C:\hsgTest\hsgtest2\hsgTest3\testfile36.txt

testfile33.txt                 C:\hsgTest\hsgtest2\hsgTest3\testfile33.txt

testfile32.txt                 C:\hsgTest\hsgtest2\hsgTest3\testfile32.txt

testfile26.txt                 C:\hsgTest\testfile26.txt

testfile22.txt                 C:\hsgTest\testfile22.txt

testfile31.txt                 C:\hsgTest\hsgtest2\hsgTest3\testfile31.txt

testfile39.txt                 C:\hsgTest\hsgtest2\hsgTest3\testfile39.txt

testfile37.txt                 C:\hsgTest\hsgtest2\hsgTest3\testfile37.txt

testfile5.txt                  C:\hsgTest\testfile5.txt

testfile4.txt                  C:\hsgTest\testfile4.txt

testfile7.txt                  C:\hsgTest\testfile7.txt

testfile3.txt                  C:\hsgTest\testfile3.txt

Of course, I could simply use the Copy-Item cmdlet to flatten the hierarchy, but unfortunately, the nested folders still get copied. The container switched parameter causes the Copy-Item cmdlet to duplicate the file structure, including any nested folders. By default, the container switched parameter has a value of TRUE and will duplicate the existing hierarchy as seen in the following figure.

Image of container switched parameter when set to TRUE duplicating existing hierarchy

To supply a value of FALSE to the switched parameter requires the trick of using a colon and then the $false value. The resulting command is shown here:

PS C:\hsgTest> Copy-Item -Path . -Destination C:\hsgBackup -Recurse -Container:$false

All of the files are copied into the root of the destination, but the two nested folders still appear. The two nested folders are empty, but still present. This is shown in the following figure.

Image of nested folders empty but still present

One of the cool things about the Copy-Item cmdlet is that it will accept an array for the path parameter. Using the values property from the hash table stored in the $hash variable, I have the full path to each unique file in the directory structure. I can copy the unique files to the hsgbackup directory by using the Copy-Item cmdlet. The sub expression $() is required to force evaluation of the $hash.values property before executing the Copy-Item command.

PS C:\hsgTest> Copy-Item -Path $($hash.values) -Destination C:\hsgBackup

A quick look at the hsgbackup directory reveals that the copy proceeded as expected—no nested folders appear. The backup directory is shown in the following figure.

Image of backup directory

The complete command used to backup unique files from nested directories follows this paragraph. Refer to earlier portions of this article for a complete explanation of the commands and aliases. (To simplify the command syntax, I set my working directory to the directory I wanted to backup.)

$hash = @{}; dir -recurse | ? { !$_.psiscontainer} | % { $hash.add($_.name,$_.fullname) }

Copy-Item -Path $($hash.values) -Destination C:\hsgBackup

RE, that is all there is to using hash tables in scripts. Hash Table Week will continue tomorrow when I will talk about using hash tables in conjunction with other Windows PowerShell commands.

 

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.

Ed Wilson, Microsoft Scripting Guy

 

 

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Ed, there is a typo here, Dir is an alias for Get-ChildItem, not for Copy-Item.

  • "The dir command is an alias for the Copy-Item cmdlet"....is the dir command not an alias for the Get-ChildItem cmdlet??

  • Sorry, just noticed 'Srikanth' has already commented on this.

  • Hello Ed,

    besides the typo :-) it is a very good example to use hash tables in daily work!

    Just one thing to mention:

    If you really want to use this approach in practice, you have to find out what "real" duplicates are, in general. Identical filenames do not necessarily imply identical files.

    You may have to use compare-object and rename-object before "flattening" the copy process, otherwise you may end up with a randomly picked copy of one of the identically named files, which you may not want :-)

    Klaus

  • Great article, I've been working on searching for duplicate files in my photo library. There are over 20,000 images, and I know there are some duplicates. The problem with using the name is there could be two images in two different folders that are identical but have different names. Basically I felt that if I could MD5 each image, then I could compare based on the MD5, that's when I realized that if there is something different the MD5's are different.

    That led me down a rabbit hole for a while until I found an article that showed how to MD5 the image part of the image. This worked great, but be warned it's a bit intensive on both the CPU and Disk.

    This site, http://www.out-web.net/?p=847, has the code that will load an image and MD5 it. The reason I'm mentioning this is, I would imagine that it could be modified to load the contents of any file, return the MD5, and then you can be guaranteed that each file you copied is in fact unique.

    I'm no code guru, but my Bing-fu was working well this past weekend!

  • Don't worry Ed, Klaus, Danny, Sikanth...

    We can fix that typo ourselves like this

    set-item -path alias:dir -value copy-item -force

  • @Srikanth, Danny London You are both correct, Dir is an alias for the Get-ChildItem cmdlet. I have corrected the text in the article to address this issue.

    @Klaus Schulte - you are absolutely correct, there are many more things to take into account when finding "real duplicates" than simple file names. Size, lastwritetime and things like that come into play.

    @Jeffery Patton that is a great idea for finding duplicate photo files. I know I have many duplicate images on my central storage unit.

  • @JRV -- too funny. However, in general I do not recommend changing the meaning of common aliases -- it can lead to unpredictable problems for others who may use the system. But it can be done. The fact that it can be done is the reason I recommend to NEVER USE AN ALIAS in a script --- I do have a friend who renamed the alias ls (the unix compatability alias for Get-ChildItem) to Set-Location (in his country and in his native language it would be common to say location set, and not set location ... he was always typing ls instead of sl).