Learn about Windows PowerShell
Hey, Scripting Guy! I just read your article about getting Word documents statistics and would like to know if there is any way of getting the number of sentences and paragraphs per document.-- RLP
Hey, RLP. To be perfectly honest, now that the Super Bowl is over the Scripting Guy who writes this column is more relieved than he is happy. The Scripting Guy who writes this column is not a New York Giants fan, and under normal circumstances he would never ever root for the Giants.
Note. But suppose alien invaders showed up on Earth, challenged the Giants to a football game, and said they would destroy the planet if the Giants lost; wouldn’t the Scripting Guy who writes this column root for the Giants if the very fate of the Earth was at stake?
You know what? That’s probably not going to happen, so maybe we shouldn’t even speculate on that.
As many of you know, however, this year’s Super Bowl was not played under normal circumstances; instead, the New England Patriots were on the brink of being declared the greatest football team – heck, the greatest sports team – heck, the greatest collection of human beings who ever lived or ever will live. The New England Patriots? Say it isn’t so! The truth is, the Scripting Guy who writes this column just couldn’t deal with that prospect. Therefore, with desperate times calling for desperate measures, he bit his lip, held his nose, and rooted for the New York Giants.
And, surprisingly enough, the Giants actually won. Which figures. The teams that the Scripting Guy who writes this column really, truly roots for – the Washington Huskies, the Seattle Mariners, the Seattle Seahawks – never seem to win. But the team he roots for merely as the lesser of two evils, well, they win.
But, still, anyone but the Patriots, right?
Well, no, not the Dallas Cowboys; no way. The Oakland Raiders? Heavens no! And, no, not the New York Jets; we don’t really like the Jets. And definitely not ….
At any rate, it was an exciting game, and a lot of fun to watch. As for the rest of the day’s festivities, the Scripting Guy who writes this column (as always) skipped the pregame shows; he made dinner during the halftime show (although he does like Tom Petty and the Heartbreakers); and he had a lot of trouble figuring out the meaning behind most of the ... innovative … commercials that were telecast during the game. In fact, if anyone can explain why watching a dog slurp from a water bowl for 30 seconds makes you want to rush out and buy Gatorade, well, drop us a line and let us know.
While we wait for than explanation we might as well see if we can solve RLP’s problem. (We have a feeling it’s going to be awhile before anyone can come up with anything.) Before we show you any code, however, we need to note that today’s task is more difficult than you might think; in fact, depending on your point of view, it might be downright impossible. For example, how many sentences would you say are in the following paragraph:
The highlight of the evening was an appearance by Dr. Ken Myer.
Most people would say there’s just one sentence here. However, Microsoft Word is going to insist that there are two sentences. Why? Because in Word’s view a sentence consists of an ending punctuation mark (like a period, question mark, or exclamation mark) followed by a blank space or paragraph return. Consequently, Word thinks the preceding paragraph contains two sentences:
The highlight of the evening was an appearance by Dr.
To be honest, there’s no good way to work around this problem, at least not until computers fully understand English. That means that Word (or any custom regular expression you try to come up with) is almost always going to overestimate the number of sentences in a document. That’s something you’ll just have to learn to live with.
Note. One thing you could do is a run a few tests using typical Word documents. You might find that, for you documents, Word consistently says there are 5% more sentences than there really are. In that case, you could add some code that automatically makes that adjustment when reporting back the number of sentences in a document. But, needless to say, that’s up to you.
The point is, sentences can pose a bit of a problem. Paragraphs can also pose a problem, albeit a completely different one. For example, how many paragraphs do you see in the following selection, with the underscore indicating each time we pressed the ENTER key:
Paragraph 1.__Paragraph 2.__Paragraph 3._
As you might have guessed, the answer is this: it depends. If you use the ComputeStatistics method to calculate the number of paragraphs (like we did in our original article), Word will tell you that there are three paragraphs here. If you use the Paragraphs collection, however (which we’re going to use today) then Word will tell you that there are five paragraphs in this document. Why? Because in that case Word is simply counting the number of times you pressed the ENTER key. How did we insert a blank line between paragraphs? That’s right: we hit the ENTER key. In fact, we hit the ENTER key five times, which is why the Paragraphs collection contains five items. So which of these two values – three or five – is correct? That really depends both on you and on the nature of your documents.
But you know what? There’s no reason why you can’t use both approaches in your script. In fact, why don’t we do just that? Why don’t we use both paragraph-counting methods in our script:
Const wdStatisticParagraphs = 4Set objWord = CreateObject("Word.Application")objWord.Visible = TrueSet objDoc = objWord.Documents.Open("C:\Scripts\Test.doc")Wscript.Echo "Paragraphs (text-only): " & objDoc.ComputeStatistics(wdStatisticParagraphs)Wscript.Echo "Paragraphs (including blank lines): " & objDoc.Paragraphs.CountWscript.Echo "Sentences: " & objDoc.Sentences.Count
As you can see, we start things off by defining a constant named wdStatisticParagraphs and setting the value to 4; that tells Word which kind of statistic we want it to compute. After defining the constant we create an instance of the Word.Application object; set the Visible property to True (just so we can see our instance of Word on screen); and then use this line of code to open the document C:\Scripts\Test.doc:
Set objDoc = objWord.Documents.Open("C:\Scripts\Test.doc")
To make it a little easier for you to follow along at home, here’s what Test.doc looks like:
The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.
By the way, here’s a cool little Word trick. Create a new, blank document, type the following, and then press ENTER:
When you do that, Word will add the preceding text into your document. (In Word 2007 the text will be different, but thre function still works.) In other words, =rand() will add a bunch of sample text into your document, giving you a practice document that actually has some text in it. Would you like a document that has 8 paragraphs, and would you like each of those paragraphs to have 3 sentences in it? Then type the following and press ENTER:
And you thought all we did was write scripts. The truth is, every now and then we actually know something that doesn’t involve scripting.
And yes, that usually is something about as important as knowing how to insert sample text into a Word document.
After we’ve opened our document we’re ready to calculate the number of paragraphs and sentences. To count only the paragraphs that actually contain text we use this line of code:
Wscript.Echo "Paragraphs (text-only): " & objDoc.ComputeStatistics(wdStatisticParagraphs)
In this case, the script is going to tell us that the document contains 3 paragraphs; that’s because we have three paragraphs that actually contain text. To count the number of times we hit the ENTER key, we simply report back the value of the Paragraph collection’s Count property, a property that tells us the number of items in the collection:
Wscript.Echo "Paragraphs (including blank lines): " & objDoc.Paragraphs.Count
This time around Word will tell us that the document contains 4 paragraphs; that’s because the blank line following paragraph 3 is considered a paragraph. Finally, we can use the Count property of the Sentences collection to determine the number of sentences in the document:
Wscript.Echo "Sentences: " & objDoc.Sentences.Count
Because this is a very straightforward little document (i.e., it doesn’t contain abbreviations or any other misleading punctuation marks) Word correctly tells us that the document contains 15 sentences. Like we said, depending on the nature of your document Word won’t always be able to tell you exactly how many sentences there are. In this case, it hit the nail right on the head. In other cases ….
That’s about the best we can do, RLP. It’s far from perfect, but we don’t really know of any foolproof way to get at this information. But, remember, the important thing isn’t whether or not you can count the sentences in a Word document with 100% accuracy. The important thing is that the New England Patriots lost the Super Bowl.
We’ll just try to ignore the fact that, if the Patriots lost, that must mean that the Giants won.
Note. Of course, now that the Super Bowl is over there’s only one big game left: the 2008 Winter Scripting Games. Remember, the Games start on Friday, February 15th. Whatever you do, don’t miss them; after all, nobody wants to see the Giants win the Scripting Games, too.