Use PowerShell to Parse Email Message Headers—Part 1

Use PowerShell to Parse Email Message Headers—Part 1

  • Comments 15
  • Likes

Summary: Guest Blogger Thiyagu teaches how to use Windows PowerShell to parse and analyze email message headers.

 

Microsoft Scripting Guy Ed Wilson here. Thiyagarajan Parthiban is our guest blogger today with an interesting article about using Windows PowerShell to analyze Exchange email. First, though, let’s learn something about Thiyagu.

I am the founder of the Singapore PowerShell User Group. I am an Exchange administrator, and I have been scripting for more than seven years now. Before Windows PowerShell, I did most of my scripting in VBScript. With Windows PowerShell, I automate Exchange/Active Directory tasks, and I am also good at WMI, ADSI, and generating custom reports. I have developed custom applications in C# for automation. I love to automate things!

You will find me on my blog.

Take it away, Thiyagu!

 

Last year, a survey was conducted to figure out the number of email messages sent every day across the globe. It was estimated that approximately 294 billion messages per day are sent, which is 2.8 million messages per second. By the way, 90 percent of all email is either spam or viruses.

Each and every email message you send or receive has a piece of information in it called a message header. Every email message you receive in your inbox has this information. There are different ways to view this message header, depending on which email client you are using. For example, here are instructions for getting the message header of an email address if you are using Outlook 2010. The following figure shows how a message header looks.

Image of a message header


This text information contains a lot of details about the message you have received. RFC 822 tells how to place information about an email message into this header.

For now, focus on the green box in the preceding figure. This section has information about how this message got to your inbox. For an email to come to your inbox, it takes so many routes. That is our main focus in today’s post: we want to get this data parsed out of this messy text and present it in a nice little table so that you can understand what really happened with that email message.

Whenever you try to make sense of an email message header, read it from the bottom up. The above piece of code has only four lines and is cut into different lines. Here is how it looks after removing the unwanted lines:

Image of message header with unwanted lines

You see, it looks already a lot cleaner, after you remove those extra lines. Take a look at the boxes and circles in the image above, and read it like this:

Received the email from Server “Corp.red.com ([16.25.5.17])” by Server “Singapore.red.com ([15.60.22.16])” with protocol “mapi id 14.01.0323.002” and at time “Wed, 13 Jul 2011 18:50:16 +0800” , which is UTC +8 (which is Singapore), you can tell from the server which received the message it says “Singapore.red.com”

If that did not make much sense, it might help to visualize it like in the following figure.

Image of a hop from one server to another

Now, this server will send the message again to another server and so on. Each trip from server to server is called a hop. The hop chain continues until it finally reaches your inbox. At times, there might be a delay when going through one of these chains. Maybe a server was busy or there was too much load, which could cause delays to email being delivered to your inbox. In this example, we have four lines, so we have four hops for the email to reach your mailbox.

Enough of theory. Let’s talk about Windows PowerShell, starting with another figure.

Image of patterns in message header

We have to extract the piece of information between the above-mentioned sections to form our objects. You can see from the preceding screenshot that it has a pattern. Luckily, this is where regular expressions come to the rescue. Read this great article by PowerShell MVP Tome about regular expressions and Windows PowerShell.

We need to get four pieces of information from each line:

  1. All text after Received: from until there is a word called by. This will be our Received From Server information.
  2. All text after by until there is a word called with. This info will be the server who receives the email from the server above.
  3. All text after with, until there is a character ; (a semicolon) and this is the protocol.
  4. All text after ; (a semicolon) and get the next minimum 32 to 36 space/nonspace data.
    1. This data is the date. Here is a sample date:
      1.  Wed, 20 Jul 2011 22:28:16 -0700. This is the maximum possible for standard time, so we can get other data as well. Sometimes, there might be space or there is new line, so I am giving myself a buffer, so later we can remove unwanted data from this string.

 Here is the regular expression pattern I came up with:

$regexFrom1 = 'Received: from([\s\S]*?)by([\s\S]*?)with([\s\S]*?);([(\s\S)*]{32,36})(?:\s\S*?)'

Can you believe, that the above regular expression pattern can do all four of the things I said above? If you are good at Windows PowerShell and still haven’t used regular expressions, you are missing an important weapon in your Windows PowerShell arsenal.

Note   Check out this webcast by Tome. It is a great introduction to regular expressions.

Because we do not know how the text is going to be in the message header, it is good to read the whole data as one long string and work with it. Here is the technique to do read a file into one big string.

$text = [System.IO.File]::OpenText("C:\Scripts\msg6.txt").ReadToEnd()

His file now has the same information as the first screenshot in this post. I wanted to write a function that would take this $text as input, process the string, give out all the parsed data, package it in an array of PSObjects, and return them. I used Select-String along with the regular expression pattern and iterated through all the matches I got.

Here is how I did that:

Function Process-ReceivedFrom

{

Param($text)

$regexFrom1 = 'Received: from([\s\S]*?)by([\s\S]*?)with([\s\S]*?);([(\s\S)*]{32,36})(?:\s\S*?)'

$fromMatches = $text | Select-String -Pattern $regexFrom1 -AllMatches

if ($fromMatches)

{

                        $rfArray = @()

                        $fromMatches.Matches | foreach{

                        $from = Clean-string $_.groups[1].value

                        $by = Clean-string $_.groups[2].value

                        $with = Clean-string $_.groups[3].value

                                    Switch -wildcard ($with)

                                    {

                                     "SMTP*" {$with = "SMTP"}

                                     "ESMTP*" {$with = "ESMTP"}

                                     default{}

                                    }

                        $time = Clean-string $_.groups[4].value

                        $fromhash = @{

                                    ReceivedFromFrom = $from

                                    ReceivedFromBy = $by

                                    ReceivedFromWith = $with

                                    ReceivedFromTime = [Datetime]$time

                        }                      

                        $fromArray = New-Object -TypeName PSObject -Property $fromhash                  

                        $rfArray += $fromArray            

                        }

                        $rfArray

}

else

{

            return $null

}

}

To explain the regular expression a little bit:

'Received: from([\s\S]*?)by([\s\S]*?)with([\s\S]*?);([(\s\S)*]{32,36})(?:\s\S*?)'

 

Each of those is matched into groups and then you can access them using the matches property. This is true except for the last one (in the world of regular expressions, “?:” means don’t group them). This is the class it will get stored in: Microsoft.PowerShell.Commands.MatchInfo. 

I just loop through the matches and then build a PSObject for each of the matches. Now, if I output the results of the function to a gridview, I see what is shown in the following figure:

Image of results of function in gridview 

Read the next part tomorrow, where I show how I put the pieces together to get delay information from different hops and then finally to build a GUI tool for this functionality.

 

Thiyagu, this is an excellent article. Thank you for sharing your time with us and for sharing your expertise with the Windows PowerShell community. I am really looking forward to part 2 tomorrow!

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.

Ed Wilson, Microsoft Scripting Guy

 

 

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • <p>Why not just loaf it into a MAPI MailItem. &nbsp;All proeprties will becom objects and it is browsable.</p>

  • <p>There are serious bugs with the regex described in this article. &nbsp;Please see <a rel="nofollow" target="_new" href="http://chrisjwarwick.wordpress.com/2011/08/18/regex-toolkit-prayer-based-parsing-bad-examples/">chrisjwarwick.wordpress.com/.../regex-toolkit-prayer-based-parsing-bad-examples</a></p> <p>Chris Warwick</p> <p>@cjwarwickps</p>

  • <p>Where can I get the Clean-String function?</p>

  • <p>@nicad49,</p> <p>Here is the Clean-String function:</p> <p>Function Clean-String </p> <p>{ </p> <p>Param([string]$inputString) &nbsp; </p> <p> $inputString = $inputString.Trim() </p> <p> $inputString = $inputString.Replace(&quot;`r`n&quot;,&quot;&quot;) &nbsp; </p> <p> $inputString = $inputString.Replace(&quot;`t&quot;,&quot; &quot;) &nbsp;</p> <p> $inputString </p> <p>} </p> <p>Check the second part of this article :</p> <p><a rel="nofollow" target="_new" href="http://blogs.technet.com/b/heyscriptingguy/archive/2011/08/19/analyze-email-headers-with-powershell-part-2.aspx">blogs.technet.com/.../analyze-email-headers-with-powershell-part-2.aspx</a></p> <p>Complete code can be downloaded from here:</p> <p><a rel="nofollow" target="_new" href="http://gallery.technet.microsoft.com/scriptcenter/8c15881d-c10f-4309-9900-4ff0653987a5">gallery.technet.microsoft.com/.../8c15881d-c10f-4309-9900-4ff0653987a5</a></p> <p>thanks</p> <p>Thiyagu</p>

  • <p>@jrv</p> <p>Thanks jrv, that is a nice idea.</p> <p>I havent tried this approach, it should be possible to do it the way you have mentioned as well.</p> <p>I will try to explore this approach.</p> <p>thanks</p> <p>Thiyagu</p>

  • <p>Why can&#39;t you just do a Clen string like this.</p> <p>(&quot; &nbsp; hello`t world &nbsp; &nbsp;`r`n&quot;).Trim()</p> <p>The string has everything but is cleaned in one step plus the tab is expanded. &nbsp;Try it. &nbsp;That is the way strings work in the NET Framework. &nbsp;No need to fuss and bother.</p>

  • <p>As promised here is my PS1 converted VBS solution to getting email headers from an &nbsp;MSG file.</p> <p><a rel="nofollow" target="_new" href="http://www.designedsystemsonline.com/upload/Get-MessageFromFile.ps1.txt">www.designedsystemsonline.com/.../Get-MessageFromFile.ps1.txt</a></p> <p>You can use the same basic code to extract every piece of a message saved as either MSG or EML format.</p>

  • <p>@jrv,</p> <p>thanks jrv for sharing that script, i think if you havent already uploaded it to the script repository, i may suggest you to upload there, it might help someone.</p> <p>for you trim method, i posted a comment, but it isnt still showing up , may be there is a delay, but what i wanted to say was that trim will only trim on the edges and the captures what i have in my regex have `r and `n within/inside the text .</p>

  • <p>@Chris Warwick</p> <p>I tried to explain your questions which you posted up on your blog, also i have a few questions on your solution as well, please see <a rel="nofollow" target="_new" href="http://www.myexchangeworld.com/2011/08/regex-pattern-for-header-parser/">www.myexchangeworld.com/.../regex-pattern-for-header-parser</a></p>

  • <p>(&quot; &nbsp; hello`t world &nbsp; &nbsp;`r`n &nbsp; hello`t world &nbsp; &nbsp;`r`n&quot;).Trim()</p> <p>Inside, outside, around and through. &nbsp;The above methods always works.</p> <p>Can you explain why?</p>

  • <p>@jrv,</p> <p>Trim method only removes the leading and trailing characters specified.</p> <p><a rel="nofollow" target="_new" href="http://msdn.microsoft.com/en-us/library/system.string.trim%28v=VS.100%29.aspx">msdn.microsoft.com/.../system.string.trim%28v=VS.100%29.aspx</a></p> <p>(&quot; &nbsp; hello`t world &nbsp; &nbsp;`r`n &nbsp; hello`t world &nbsp; &nbsp;`r`n&quot;).Trim()</p> <p>in the above example, it wont remove the `r`n which is between the 2 hello world, may be you can try to export the output before trim and after trim using notepad++ which will show the different new line and tab characters, just select View-&gt;Show Symbols-&gt;Show all characters, since i need to remove the CrLf in between as well, i need to use the replace method for this.</p> <p>if this is not clear, please let me know, i will try to explain more.</p>

  • <p>@thiyagu</p> <p>You got me. &nbsp;It look good on paper though but, as you are pointing out, iti is not an array.</p> <p>This is:</p> <p>(&quot; &nbsp; hello`t world &nbsp; &nbsp;`r`n &nbsp; hello`t world &nbsp; &nbsp;`r`n&quot;).split(&quot;`r`n&quot;,1)</p> <p>My bigger question is how are you reading input such that it does not produce an array? Using PoSH methods it will produce an array. Using FS primitives it will not.</p>

  • <p>@jrv,</p> <p>As mentioned in the article, i use this method to read the entire file into one big string.</p> <p>$text = [System.IO.File]::OpenText(&quot;C:\Scripts\msg6.txt&quot;).ReadToEnd()</p> <p>Since, this is one big string, the captures which i get using regex would contain `r `n or tabs anywhere in them it could be between, in the start or end etc., so i simply have to replace them.</p>

  • <p>@thiyagu</p> <p>[string]::join(&#39;&#39;,(cat file.msg))</p> <p>No line breaks.</p>

  • <p>Hello Thiyagu,</p> <p>thanks for poing out the power of regexps to parse text with only little structure!</p> <p>But I was really stuck if I came to the regexp including the pattern ([\s\S]*?) ... because \s is the opposite of \S by definition and the square brackets define a set of characters I really came to the conclusion that it is as good as using the dot wildcard instead of that! ( @Chris Warwick: you are right and thanks for detailing it! )</p> <p>Your pattern works even though if it encounters the keywords &quot;by&quot; &quot;with&quot; and &quot;;&quot; in so far that it splits the string into 4 components which are usually the parts you wanted to extract from the header.</p> <p>@jrv: aggreed upon! Loading the email as MAPI mailitem would result in objects where we could extract the desired components using dot notation to retrieve the properties! But as an example of using regexps to parse text ... this is useful anyway!</p> <p>Klaus.</p>