Bookmark and Share

Hey, Scripting Guy! Question

Hey, Scripting Guy! I need to obtain a listing of unique words from a Microsoft Word document. I know that there is the Sort-Object cmdlet that can be used to retrieve unique items, and there is the Get-Content cmdlet that can read the text of a text file. However, the Get-Content cmdlet is not able to read a Microsoft Word document, and I do not think I can use the Sort-Object cmdlet to produce a unique listing of words.

-- EM

Hey, Scripting Guy! Answer

Hello EM,

Microsoft Scripting Guy Ed Wilson here. I am listening to the Bourbon Street Rag on my Zune, and was day dreaming a bit about my last trip to New Orleans. The really good news is that TechEd 2010 will be held in New Orleans, and (drum roll please) the Microsoft Scripting Guys already have set aside the budget to be there! "Do you know what it means to miss New Orleans?" the song continues to amble. Now the upbeat sound of Van Halen is coming from my Zune. Quite the segue! It’s a shuffle kind of day.

I am having a great day today, and I have responded to several really cool questions sent to scripter@microsoft.com e-mail. EM, your question was really interesting, and I decided to write the GetUniqueWordsFromWord.ps1 script that is shown here.

GetUniqueWordsFromWord.ps1

$document = "C:\fso\WhyUsePs2.docx"
$app = New-Object -ComObject word.application
$app.Visible = $false
$doc = $app.Documents.Open($document)
$words = $doc.words
$outputObject = @()
"There are " + $words.count + " words in the document"
For($i = 1 ; $i -le $words.count ; $i ++)
     {
      $object = New-Object -typeName PSObject
       $object |
       Add-Member -MemberType noteProperty -name word -value $words.item($i).text
       $outputObject += $object
     }
$doc.close()
$app.quit()
$outputObject | sort-object -property word -unique

Before jumping into the GetUniqueWordsFromWord.ps1 script, take a look at the Word document seen here:

Image of Word document with 231 words

As you can see, there are 231 words in the document. Many of these words are unique such as "after," but some of the words are not unique such as the word "the." The GetUniqueWordsFromWord.ps1 script will display a list of all the unique words in the Microsoft Word document.

To display the unique words in the Microsoft Word document, the GetUniqueWordsFromWord.ps1 script begins by using the $document variable to hold the path to the Microsoft Word document that is to be analyzed. Next, the word.application COM object is used to create an instance of the application object. The application object is the main object that is used when working with the Microsoft Word automation model. The visible property is set to $false, which means the Microsoft Word document will not be visible while the Windows PowerShell script is running. This section of the script is shown here:

$document = "C:\fso\WhyUsePs2.docx"

$app = New-Object -ComObject word.application

$app.Visible = $false

After the application object has been created, the documents property from the application object is used to obtain an instance of the documents collection object. The open method from the documents collection object is used to open the document that is specified in the $document variable. The open method from the documents collection object returns a document object that is stored in the $doc variable. This line of the script is shown here:

$doc = $app.Documents.Open($document)

The words property of the document object is used to return a words collection object that represents all of the words in the document. The words collection object is stored in the $words variable as seen here:

$words = $doc.words

After the words collection has been created, it is time to create an empty array that will be used to store the custom object the script will create. It is also time to display a message on the Windows PowerShell console that indicates how many words are in the document. Please note that in most cases, the number of words displayed by the count property of the words collection object will not correspond with the number that is shown at the bottom of the Microsoft Word document. This is because different characters are considered words by the count property than the ones shown in the document. This section of the script is seen here:

$outputObject = @()

"There are " + $words.count + " words in the document"

The for statement is used to set up a loop that will be used to walk through the collection of words stored in the words collection object. The loop begins at 1 and continues as long as the value of the variable $i is less than or equal to the count of the number of words in the collection. On each pass through the loop, the value of the $i variable will be incremented by 1. This is seen here:

For($i = 1 ; $i -le $words.count ; $i ++)

     {

Inside each loop, a custom Windows PowerShell PSObject is created by using the New-Object cmdlet and the returned PSObject is stored in the $object variable. This is shown here:

      $object = New-Object -typeName PSObject

The Add-Member cmdlet is used to add a noteProperty to the PSObject stored in the $object variable. The name of the noteProperty is word, and the value is the next word in the collection of words. The item method is used to retrieve the word from the words collection by index number. This is not a direct retrieval, however, because the item method returns a range object and not a word object. The range object does have a text property that is used either to get or to set the value of the text in the selected range. Because this range is a single word, the text property from the range object retrieves the next word from the words collection object. This is shown here:

       $object |

       Add-Member -MemberType noteProperty -name word -value $words.item($i).text

After the word property has been added to the PSObject, the PSObject that is stored in the $object variable is added to the $outputObject array, as shown here:

       $outputObject += $object

     }

The document object is closed by using the close method and the application object is destroyed by calling the quit method. This is shown here:

$doc.close()

$app.quit()

The array of objects stored in the $outputObject variable is piped to the Sort-Object cmdlet, where the object is sorted on the word property and only unique words are displayed on the Windows PowerShell console. This line of code is shown here:

$outputObject | sort-object -property word -unique

When the script is run, the output shown in the following image is displayed:

Image of output of the script

Well, EM, that is about all there is to retrieving unique words from a Microsoft Word document.

If you want to know exactly what we will be looking at tomorrow, follow us on Twitter or Facebook. If you have any questions, send e-mail to us at scripter@microsoft.com or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.

Ed Wilson and Craig Liebendorfer, Scripting Guys