How Can I Tally Up All the Words Found in a Text File?

How Can I Tally Up All the Words Found in a Text File?

  • Comments 1
  • Likes

Hey, Scripting Guy! Question

Hey, Scripting Guy! While browsing the Internet I found a script that showed me how to get a list of all the unique words in a text file. That’s useful, but I’d like to go one step further: how can I determine the number of times each of those words occurs??

-- TZ

SpacerHey, Scripting Guy! AnswerScript Center

Hey, TZ. As it turns out, that’s a lot easier than you might think. In fact, all you have to do is – hang on a second, we just got an email. And not just any email: based on the subject line – underpaid and not appreciated? – this must be a legitimate email that’s truly intended for the Scripting Guy who writes this column. Let’s see what it says:

I wanted to write and tlel you about a great new De!gree program, i tried it out and it worked!

I got my M!asters in 2 weeks ;]. Call the folowing number, this program really works great i was very surprised!

Now that’s a good deal. Although you might find this hard to believe, the Scripting Guy who writes this column already has a Masters degree, and from the University of Washington to boot. (Further proof that the value of a college education is highly overrated.) But the Scripting Guy who writes this column didn’t get his degree in just two weeks; instead, it took him almost two years. Furthermore, in his degree program he had to know spelling and grammar along with everything else. This new program sounds way better! And while we aren’t positive that a M!asters degree is the same thing as a Masters degree, as long as it enables the Scripting Guy who writes this column to get the money and appreciation he deserves, well ….

Anyway, it looks like the Scripting Guy who writes this column will be going back to college, at least for two weeks anyway. That means he has a lot to do: buy some cinder blocks to build a bookcase; stock his cupboards with Top Ramen and boxed macaroni-and-cheese; and write home to his parents asking if they can send him some money. Oh: and show you how to tally up the words found in a text file. You know, by using a script like this one:

Const ForReading = 1

Set objDictionary = CreateObject("Scripting.Dictionary")

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("c:\scripts\test.txt", ForReading)

strText = objFile.ReadAll
objFile.Close

strText = Replace(strText, ",", " ")
strText = Replace(strText, ".", " ")
strText = Replace(strText, "!", " ")
strText = Replace(strText, "?", " ")
strText = Replace(strText, ">", " ")
strText = Replace(strText, "<", " ")
strText = Replace(strText, "&", " ")
strText = Replace(strText, "*", " ")
strText = Replace(strText, "=", " ")
strText = Replace(strText, vbCrLf, " ")

arrWords = Split(strText, " ")

For Each strWord in arrWords
    If Len(strWord) > 0 Then
        If objDictionary.Exists(strWord) Then
            objDictionary.Item(strWord) = objDictionary.Item(strWord) + 1
        Else
            objDictionary.Add strWord, 1
        End If
        
    End If
Next

colKeys = objDictionary.Keys

For Each strKey in colKeys
    Wscript.Echo strKey & " -- " & objDictionary.Item(strKey)
Next

In case any of you are thinking that the Scripting Guy who writes this column is too old and too out-of-touch to go back to college, we can set your mind at ease by pointing out that he cheated in order to finish today’s assignment: in particular, he copied an existing script from the Internet and then just modified it slightly to meet his needs. And even though he cheated, he still waited until the last possible minute to complete the assignment. If that doesn’t sound like a college student, well, we don’t know what does.

Let’s take a few minutes to discuss how this script works; that will be good practice when it comes time for the Scripting Guy who writes this column to defend his master’s thesis. (And yes, we are a little concerned about having just two weeks to complete 45 credits of coursework and write a master’s thesis. But no doubt this school knows what it’s doing.)

The script starts out by defining a constant named ForReading and setting the value to 1; we’ll need this constant when we open our text file. Next we create an instance of the Scripting.Dictionary object. Why? We’ll get to that in just a second. For now, let’s forget about the Dictionary object and focus on the next two lines of code, which create an instance of the Scripting.FileSystemObject and use the OpenTextFile method to open the file C:\Scripts\Test.txt:

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("c:\scripts\test.txt", ForReading)

Once the file is open we use the ReadAll method to read the entire file into memory and store it in a variable named strText:

strText = objFile.ReadAll

Now that the contents of Test.txt are stored in memory we call the Close method and close the file.

Our next task is to identify the individual words in the file. In general that’s fairly easy; as you’ll see, we simply use the Split method to create an array from all the words in strText. And how will we identify the individual words? By splitting on the blank space (“ “), acting under the assumption that all the words in the file are separated by blank spaces.

For the most part, that will work pretty well. However, there is a potential problem here. For example, consider this sample text file:

I saw the cat. The cat was black.

How many times does the word cat appear in this file? We’d agree: it appears twice. However, our script won’t agree; instead, the script sees two different words that happen to include the letters c-a-t:

cat.

cat

See the problem? It’s the period immediately following the first instance of cat. Because our script doesn’t know anything about punctuation (which definitely makes it a candidate for a M!asters degree) it doesn’t know to ignore the period at the end of the sentence. Punctuation – and carriage return-linefeeds – can create problems in this script. Therefore, we use a series of Replace commands to find these characters and replace them with blank spaces:

strText = Replace(strText, ",", " ")
strText = Replace(strText, ".", " ")
strText = Replace(strText, "!", " ")
strText = Replace(strText, "?", " ")
strText = Replace(strText, ">", " ")
strText = Replace(strText, "<", " ")
strText = Replace(strText, "&", " ")
strText = Replace(strText, "*", " ")
strText = Replace(strText, "=", " ")
strText = Replace(strText, vbCrLf, " ")

That turns our practice file into something that looks like this:

I saw the cat  The cat was black

In turn, our script now tells us that the word cat appears twice.

Note. For a somewhat more detailed discussion of this issue, see this Hey, Scripting Guy! column.

After we’ve cleaned up our text file, we then use the Split function to create an array consisting of the individual words found in the text file. For our simple example, that means the array arrWords will contain these items:

I 
saw 
the 
cat  
The 
cat 
was 
black

And now it’s time to start tallying the number of times each individual word occurs. That’s what this block of code, and the Dictionary object, is for:

For Each strWord in arrWords
    If Len(strWord) > 0 Then
        If objDictionary.Exists(strWord) Then
            objDictionary.Item(strWord) = objDictionary.Item(strWord) + 1
        Else
            objDictionary.Add strWord, 1
        End If
        
    End If
Next

What are we doing here? Good question. The first thing we’re doing is setting up a For Each loop that will loop through all the items in the array; in other words, through all the words in the text file. For each word we first verify that the length (Len) is greater than 0 characters. (Why? See our previous column on this topic for details.) Assuming that the length is greater than 0 we then use the following line of code the see if the word in question already exists in our Dictionary:

If objDictionary.Exists(strWord) Then

Let’s assume that the word can’t be found in the Dictionary. In that case, we use the Add method to add the word as a new Dictionary Key. At the same time, we set the value of the corresponding Dictionary Item to 1:

objDictionary.Add strWord, 1

Why 1? Because, so far, we’ve found 1 occurrence of that particular word.

If the word already exists in the Dictionary we don’t try adding it a second time; that would cause an error. Instead, we simply increment the value of the Item property by 1:

objDictionary.Item(strWord) = objDictionary.Item(strWord) + 1

You probably don’t need us to tell you this, but if the Item was equal to 1 then, after we execute this line of code, the Item will be equal to 2. You probably also don’t need is to tell you why we chose to use the Dictionary object; unlike an array, it’s easy to 1) locate a specified key; and 2) determine whether a key already exists. (See this Sesame Script article for more on how the Dictionary object works.

From there all we do is loop around and repeat the process with the next word in the array.

After we’ve finished our For Each loop we use this block of code to report back all the Keys and Item values in the Dictionary:

colKeys = objDictionary.Keys

For Each strKey in colKeys
    Wscript.Echo strKey & " -- " & objDictionary.Item(strKey)
Next

That’s going to give us a report similar to this:

I -- 1
saw -- 1
the -- 1
cat -- 2
The -- 1
was -- 1
black – 1

And yes, it would be nice if those words were sorted alphabetically, wouldn’t it? But that’s a task for another day.

As for going back to college, the Scripting Guy who writes this column is actually having second thoughts. Granted, the idea that you could get a Masters degree in two weeks is a bit suspicious; it’s even more suspicious that the telephone number provided in the email is an unlisted number. But the big problem is that, as near as he can tell, the alleged school has neither a football team nor a basketball team. No football team or basketball team? Then why even have a college?

Besides, there’s no need for him to waste two weeks of his life getting a Masters degree. After all, according to another email he just received the Scripting Guy who writes this column can make $50,000 a month while working home; that works out to $600,000 a year. Sure, that would be a bit of a pay cut, but it might be worth giving up a little money for the chance to work from home.

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Hi,

    Any idea on how to sort based on occurrence either ascending or descending ?

    Thanks

    Prateek