Hey, Scripting Guy! QuestionHey Scripting Guy! I am trying to convert a number of text files that have been used to create old Web pages to a Web site that uses a new format. The problem is that the text files are not in a standard format. I need to identify the shortest lines in the text files so that I can go in and fix the files before moving everything over to the new Web site. The problem is that some of the text files have lines that begin with a space, and others have blank lines, and of course, there should really be none of that in these files. I would do it manually, but there are hundreds of these files and only one of me. A script would be really helpful at this point.

-- MT

Hey, Scripting Guy! Answer

Hello MT,

Microsoft Scripting Guy Ed Wilson here. Scripting is one of those skills that either you learn or you don’t. Actually, you could say the same thing for baseball, football, tennis, playing a guitar, underwater photography, and most things. Either you learn it or you don’t. Many skills you learn by doing. You learn to play guitar by practicing and playing the guitar, not by simply reading a book. You learn scripting by writing scripts, running scripts, modifying scripts, picking up new tricks, and incorporating them into your scripts. The great thing about the Scripting Games is you get a chance to pick up new skills in a stress-free environment before you really need to use it (like learning how to put your scuba mask on underwater in a swimming pool instead of in 130 feet of water inside a wrecked ship in the Atlantic Ocean). Fortunately, we reviewed handling malformed text files during the 2009 Summer Scripting Games.

This week we will be reviewing some of the scripts that were submitted during the recently held 2009 Summer Scripting Games. The description of the 2009 Summer Scripting Games details all of the events. Each of the events was answered by a globally recognized expert in the field. There were some cool prizes and winners were recognized from around the world. Additionally, just like at the "real Olympics" because there was a lot going on, an "if you get lost page" was created. Communication with participants was maintained via Twitter, Facebook, and a special forum. (The special forum has been taken down, but Twitter and Facebook are still used to communicate with Hey, Scripting Guy! fans). We will be focusing on solutions that used Windows PowerShell. We have several good introduction to Windows PowerShell Hey, Scripting Guy! articles that you will find helpful.

The 2009 Summer Scripting Games Event 1 was the 100-meter dash. In the Advanced Event 1, you were required to read the file and determine the three shortest lines in the file. The PersonalInformationCards_ADV1.txt file is a real jumble. There are some blank lines, as well as some lines that start with blank spaces. All of these symptoms of a malformed text file cause problems in trying to manipulate the file. This is seen here:

Image of the PersonalInformationCards_ADV1.txt file

 

In looking through the 2009 Summer Scripting Games submissions over at PoshCode, I found a nice, entry posted by googleuser. The submission begins by using gc, which is an alias for the Get-Content cmdlet. The Get-Content cmdlet reads the contents of the text file, and the results are piped to the Where-Object cmdlet (where is an alias), and the trim() method is used to remove blank lines and extra spaces at the beginning and the end of the line. Next the results are piped to the Sort-Object (sort is an alias) cmdlet where the lines are sorted based upon the length property and the first three lines are selected. This is seen here:

ScriptingGamesAdvancedEvent1.ps1

gc '.\Personal Information Cards_ADV1.txt' |
where {$_.trim()} |
sort length |
select -first 3

To make the command a bit easier to read and to understand, you may want to look at the following, which is exactly the same command but instead uses the cmdlet names:

Get-Content –path '.\Personal Information Cards_ADV1.txt' |
Where-Object { $_.trim() } |
Sort-Object –Property length |
Select-Object –first 3

After the script has run, the following output is seen:

Image of the script output


LKH takes a slightly different approach, but the results, as seen here, are exactly the same:

PS C:\Users\edwils> C:\data\ScriptingGuys\HSG_8_17_09\ScriptingGamesAdvancedEvent1a.ps1
PPID
Claims
Street

The LKH script is exactly the same except for using a regular expression pattern instead of the trim() method. The pattern that LKH uses says if a line begins with a space and is followed by any number of other spaces to the end, the line will not match and therefore it will not be passed down the pipeline. The regular expression pattern can be tested by using the match operator directly at the Windows PowerShell console. This is seen here:

PS C:\> " d" -match "^\s*$"
False
PS C:\> " " -match "^\s*$"
True
PS C:\> "      " -match "^\s*$"
True
PS C:\> "      der" -match "^\s*$"
False
PS C:\> "      der  " -match "^\s*$"
False
PS C:\>

The LKH ScriptingGamesAdvancedEvent1a.ps1 script is seen here.

ScriptingGamesAdvancedEvent1a.ps1

Get-Content 'Personal Information Cards_ADV1.txt' `
|Where-Object {$_ -notmatch '^\s*$'} `
| Sort-Object {$_.Length} | Select-Object -First 3

One improvement to the LKH script would be to move the pipeline characters to the right side of the code. This would allow you to avoid the need for the line continuation character (`) at the end of line one and line two. This is seen here:

Get-Content 'Personal Information Cards_ADV1.txt' |
Where-Object {$_ -notmatch '^\s*$'} |
Sort-Object –property {$_.Length} |
Select-Object -First 3

Well, MT, that is about all there is to cleaning up a text file with blank lines and spaces in it. LKH and googleuser, thanks for contributing to the 2009 Summer Scripting Games. It was interesting seeing the different approaches to a common problem.

If you want to be the first to know what is happening on the Script Center, follow us on Twitter or on Facebook. If you need assistance with a script, you can post questions to the Official Scripting Guys Forum, or send an e-mail to scripter@microsoft.com. The 2009 Summer Scripting Games wrap-up will continue tomorrow. Until then, peace.

Ed Wilson and Craig Liebendorfer, Scripting Guys