Expert Solution for 2011 Scripting Games Advanced Event 7: Use PowerShell and Regex to Get Twitter IDs from a Web Page

Expert Solution for 2011 Scripting Games Advanced Event 7: Use PowerShell and Regex to Get Twitter IDs from a Web Page

  • Comments 7
  • Likes

Summary: Microsoft Windows PowerShell MVP, Tome Tanasovski, uses regular expressions to get Twitter IDs from a web page while solving Advanced Event 7 in 2011 Scripting Games.

Microsoft Scripting Guy, Ed Wilson, here. Tome Tanasovski is the expert commentator for Advanced Event 7.

Photo of Tome Tanasovski

Tome is a Windows engineer for a market-leading, global financial services firm in New York City. He is the founder and leader of the New York City PowerShell User group, a cofounder of the NYC Techstravaganza, a blogger, a speaker, and a regular contributor to the Windows PowerShell forum at Microsoft. He is a recipient of the MVP award for Windows PowerShell.
Tome's contact information:
Blog: Tome's Land of IT
Twitter: toenuff

Worked solution

Advanced Event 7 is my type of task. The core of this challenge is text parsing with a regular expression, and it presents an opportunity to show how Windows PowerShell shines with its flexibility to filter and convert data to different formats (like CSVs). The truth is, my first stab at this completed the requirements in only a few lines of code, but it would not have been a winner in that state.

Before I talk about the solution, I have to point out that if I was handed this requirement in the real world, there are a few things that I may have fought. In my opinion the best functions are those that keep things “PowerShell-able,” that is, they need to keep the pipeline going, they need to allow you to use existing cmdlets for filtering and selecting, and they should be flexible. Unfortunately, this task has some rigid requirements: I would not normally incorporate something like Import-CSV into my script because I would rather leave that power in the hands of the person operating it. The same goes for the filtering and displaying of the data. My personal approach without any requirements handed to me would have been to create the single function that gathers the data from the web page and spits out Windows PowerShell objects with a property for name and Twitter. I would then expect others to take that and do things like:

Get-SQLSaturdayNetworking |Export-CSV sqlsaturday.csv

Import-CSV sqlsaturday.csv | where-object {$_.name -like '*Wilson*'}|select -ExpandProperty Twitter

In my opinion, the previous is the answer, and the answer is all about knowing how to use Windows PowerShell. I am fortunate that I work for a company with extremely heavy Windows PowerShell users who would expect functions from me to do the heavy parsing so that they can take it and be flexible with it, for example, send it to a database, perform calculations on the objects, and automate direct message tweets to the people who attended the event.

In other parts of the universe, I hear that users do not want to know the ins and outs of Windows PowerShell, and they would rather be given a script or a function with very easy-to-use and intuitive parameters. So, that’s what I set out to provide. I wanted to make sure that the core cmdlet was still flexible enough to return only objects, but also with enough ability to empower the end user without having to bog them down with Windows PowerShell syntax. I wanted to do all of this while meeting the requirements set out in the challenge.

Part One - The GREP - Get-SQLSaturdayNetworking

I decided to create this one function to grab the objects from the web or from a CSV. I also decided that my script should be able to handle optional Twitter and LinkedIn accounts. It only seemed appropriate that I do a little more than the requirements to make it something that was truly useful.

I tackled the web download first by creating a very powerful regular expression that would let me pull the username, Twitter, and LinkedIn accounts from each line of the HTML returned:

$regex = '<font size="3">\s*(?<name>.*?)\s*<a(.*?twitter.com/(?<twitter>\W*\w+))*(.*?linkedin.com/in/(?<linkedin>\w+))*'

A technique in the above that is not known by a lot of people is the usage of named captures. By using ?<name> within my parenthesis captures, I can more easily access them via the $matches variable by $matches[‘name’], $matches[‘twitter’], and $matches[‘linkedin’] as opposed to using the order the matches appear in the regex. This feature only exists in the .NET version of regular expressions, and therefore it is accessible to Windows PowerShell.

Another powerful thing in the regex was to use parentheses () with .*? to group together the sections that may be in each row. Vollowing these parentheses with a * allowed me to capture optionally the Twitter or LinkedIn accounts, but only if they exist. This is shown here.

(.*?twitter.com/\W*(?<twitter>\w+))*

Another regular expression worth calling out is the one I used to split the html content into individual lines. This regular expression that lives in my tool belt lets me split without knowing for sure whether the line terminator is a `n or a `r`n. Dealing with unknown text can be tricky because of silly things like this. Fortunately, the ‘(?m)\s*$’ uses the \s* to signify spaces and any type of newline characters, the (?m) to signify that the dollar sign ($) will match the end of a line instead of the end of the string, and the dollar($) to match the end of the line.

If you are a beginner or an advanced PowerShell scripter who has been using cmdlets like Write-Host a lot or a lot of string concatenation, I hope that you take this next one home with you. I am sure you will see this technique repeated over and over by a lot of the guest commentators in the advanced track:

New-Object psobject -Property @{Name=$matches['name'];Twitter=$matches['twitter'];LinkedIn=$matches['linkedin']}

This code on a line by itself will push the object outside of the function as the return value. My function returns a series of these which make the output a collection of objects that can easily be piped to other cmdlets like Export-CSV, Out-GridView, or into my second function.

Part Two - The End-User Function - Get-SQLSaturdayPerson

While the first function is easy to use (especially for someone who knows Windows PowerShell), I wanted to ensure that I created a single entry point that would do the filtering and the gathering of the data in one SLAM of an enter key. Get-SQLSaturdayPerson is that function.

Most of the magic in this cmdlet is all in the parameters. The rest is just using a Like comparison to ensure that wildcard support exists.

I created three parameter sets that let me break down this function into web requests, csv imports, and pipeline requests (or those that use the InputObject parameter). The first two call my first function and then pipe the contents back into the Get-SQLSaturdayPerson function. The third is the filter that can work on objects in the pipeline.

Notes of interest:

  • I used parameter validation with ValidateSet to restrict users to a few options for specific parameters.
  • I ensured that my InputObject parameter could come from the pipeline.

One final note before we get to the payoff: I personally use inline help to make sense of the functions I write. As I begin to write notes and rationalize what each parameter does, it can make things clearer. For example, I originally called the Filter parameter, Name, and the Type parameter, NameType. After a few minutes of trying to explain what these words meant it became clear that they were the wrong words to use. Another technique I try to use on occasion is to write the inline Help before I write the cmdlet. This is probably the best way to approach a new function, but it is not always possible. Regardless of when you decide to write your inline Help, the key point to take away is that the documentation of your function from the perspective of an end-user can help you write a better function.

Part Three - The Payoff

As you can see by the last three lines of the script the usage is simple and meets the requirements, but if you dig a bit deeper, and look at the Get-Help Full for the Get-SQLSaturdayPerson cmdlet, you will see that it can do a lot more. The cmdlets remains flexible by returning objects, but it also allows users to return strings of the data that they want to see by using the OutputType parameter. It lets you use the CSV file to find what you are looking for (as requested), but it also lets you pull the data down from the web on the fly.

Windows PowerShell is powerful, but unless your functions maintain that versatility you may be draining its batteries. At the same time, however, requirements must be met. Even someone like me, who was very skeptical about the requirements, can see that by approaching something rigid with PowerShell flexibility in mind you can create something that not only satisfies requirements, but becomes useful in all of the ways that make Windows PowerShell the greatest scripting language in the world (platform dependency aside).

The complete script can be found on the Scripting Guys Script Repository.

Thank you Tome; that is a great write up, and you offer a ton of great advice.

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.

Ed Wilson, Microsoft Scripting Guy

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • <p>Thanks Tome,</p> <p>this a really advanced topic with an adequate solution!</p> <p>It is at least advanced enough for me because I don&#39;t understand some aspects of it :-)</p> <p>Anyway:</p> <p>I saw that a try-catch-block is provided tp protect the download ... great!</p> <p>I saw some regexps with named groups ... a wonderful feature!</p> <p>I saw that it worked! </p> <p>Though I had to comment the call to powershell out (on the first line) because the ISE did &quot;hang&quot; otherwise.</p> <p>Well one last thing, I could do is to provide the URL to the solution here:</p> <p><a rel="nofollow" target="_new" href="http://gallery.technet.microsoft.com/scriptcenter/7e1a1991-be49-4e64-8198-3166fb6ac536">gallery.technet.microsoft.com/.../7e1a1991-be49-4e64-8198-3166fb6ac536</a></p> <p>kind regards, Klaus</p>

  • <p>right behind ya, klaue. &nbsp;On my script every time I would export my array containing the username comma twittername to a csv it would just give me the length of teh objects and not the actual values. &nbsp;I had to send it to a text file, import the text file as a csv, then export that back out as a csv to get it to work correctly. &nbsp;Would you mind expanding on where I went wrong on that part?</p>

  • <p>Stupid spelling mistakes. &nbsp;Sorry, Klaus.</p>

  • <p>Thanks Klaus for added the uRL, I went back and added it to the appropriate place in the blog. </p>

  • <p>Chris,</p> <p>Looking at your entry: <a rel="nofollow" target="_new" href="https://2011sg.poshcode.org/1676">2011sg.poshcode.org/1676</a></p> <p>Your problem is working with strings instead of objects. &nbsp;Learn to use the following for everything - that includes output to screen, output to csv, etc. &nbsp;With time this will become second nature. &nbsp;Learning this technique is what turns a good scripter into a good PowerShell scripter.</p> <p>I personally started with this template, but there are other ways to do it - actually there are more efficient ways, but this is the best one to start with in my opinion.</p> <p>$objects = @()</p> <p>foreach ($number in (0..100)) { #Simulating a loop you would be doing for every line or something else</p> <p> &nbsp; &nbsp; $object = new-object psobject </p> <p> &nbsp; &nbsp; $object |Add-member noteproperty -name column1 -value $number</p> <p> &nbsp; &nbsp; $object |Add-member noteproperty -name column2 -value &quot;blah$number&quot;</p> <p> &nbsp; &nbsp; $object |Add-member noteproperty -name column3 -value ([int]$number/10)</p> <p> &nbsp; &nbsp; $objects += $object</p> <p>}</p> <p>#now you display with</p> <p>$objects</p> <p>#you turn to a csv with</p> <p>$objects |convertto-csv file.csv</p> <p>#you display to a grid with</p> <p>$objects |out-gridview</p>

  • <p>“The truth is, my first stab at this completed the requirements in only a few lines of code…”</p> <p>I’m curious and would like to see your terse solution. I got a 233 char one-liner that solves the event’s requirements, except for the CSV, which to me is excessive.</p> <p>Nice RegEx by the way, I’m also a RegEx enthusiast; mine is 42 char long.</p>

  • <p>Tome, thanks for the explaination. &nbsp;I attempted to create a custom object earlier but didn&#39;t understand what it was completely doing, but by 10 I understood it more, and your directed explaination wraps the steak in bacon(icing on the cake is too over-used). &nbsp;Creating custom objects is becoming a very import part of a self-imposed challenge &quot;No GUI: PowerShell Only Day&quot;.</p>