Expert Solution for 2011 Scripting Games Advanced Event 6: Use PowerShell to Get Twitter IDs from a Web Page

Expert Solution for 2011 Scripting Games Advanced Event 6: Use PowerShell to Get Twitter IDs from a Web Page

  • Comments 5
  • Likes

Summary: Microsoft Windows PowerShell senior software engineer, Lee Holmes, solves 2011 Scripting Games Advanced Event 6 and gets Twitter IDs from a web page.

Microsoft Scripting Guy, Ed Wilson, here. Today we have Lee Holmes as our expert commentator for Advanced Event 6.

Photo of Lee Holmes

Lee Holmes is a senior software engineer on the Microsoft Windows PowerShell team, and he has been an authoritative source of information about Windows PowerShell since its earliest betas. He is the author of the Windows PowerShell Cookbook, Windows PowerShell Pocket Reference, and the Windows PowerShell Quick Reference.
Lee’s contact information:
Blog: Precision Computing
Twitter: http://www.twitter.com/Lee_Holmes
LinkedIn: http://www.linkedin.com/pub/lee-holmes/1/709/383

Worked solution

While getting ready to attend a SQL Saturday event, you suddenly realize that this is the perfect opportunity to simultaneously flex your scripting social networking muscles. Fortunately, the SQL Saturday site is so organized that it has a page for everybody that is coming. Let’s figure out their Twitter user names.

When dealing with data from the wild internet, you generally have three options: web services, highly-structured data feeds (such as RSS, ATOM, and REST), or the basic HTML normally intended for web browsers. For the first two, Windows PowerShell offers some excellent tools. For web services, the New-WebServiceProxy cmdlet lets you interact with the resource as though it were a regular .NET object, working with properties, calling methods, and more. For highly-structured data feeds, Windows PowerShell’s [XML] type adapter makes quick work of the content they return.

In this case, the SQL Saturday networking page is just a simple web page. Let’s use the System.Net.WebClient class to download it and see what it contains:

$wc = New-Object System.Net.WebClient
$htmlContent = $wc.DownloadString($uri)

Sometimes, web pages are written in a form called XHTML, which is a form much more structured than regular HTML. When this is true, you can use Windows PowerShell’s [XML] type adapter to work with its content.

Unfortunately, for us, the SQL Saturday page is not an example of this kind of page:

Image of code

That’s OK—mankind made it to the moon without the help of XML. Surely, we can extract data from a web page without it!

When we take a look at the contents of $htmlContent, we see that all of the links to Twitter accounts follow a pattern:

PS > $htmlContent
(…)
                <span id="ctl00_ContentPlaceHolder1_DataList1_ctl35_Label1"><font size="3">Ed Wilson
<a href="http://www.twitter.com/ScriptingGuys" class="noarrrow">(…)</span>
(…)

This is the kind of pattern that lends itself to the Select-String cmdlet. The Select-String cmdlet takes text (or files) as input, applies a Regular Expression to that content, and returns objects that represent the match. If you specify the AllMatches parameter, the Select-String cmdlet returns an object for each match that it finds in the content.

Regular Expressions are, at their heart, a finely-tuned language with the sole purpose of parsing text. Writing one is part art, part science. Although Windows PowerShell does an amazing job at shielding you from the crazy world of text parsing, Regular Expressions become invaluable for those times when text is all you’ve got.

Here’s the pattern we’ll use:

$pattern = '<a href="http://www.twitter.com/([^"]*)"'

The Regular Expression portion inside the single quotes says:

1) Find the literal text (<a href="http://www.twitter.com/)

2) Start remembering the stuff that comes next: (

3) Find a bunch of characters: []*

4) That are not ( ^ ) the quote character (")

5) Then stop remembering: )

6) And find another quote character: "

Next, we supply this pattern to the Select-String cmdlet. We use the AllMatches parameter to get all of the results:

$result = $htmlContent | Select-String -Pattern $pattern -AllMatches

Here’s what one result looks like:

PS > $result.Matches[0]

 

Groups   : {<a href="http://www.twitter.com/mhthomas42", mhthomas42}

Success  : True

Captures : {<a href="http://www.twitter.com/mhthomas42"}

Index    : 28613

Length   : 43

Value    : <a href=http://www.twitter.com/mhthomas42

In Regular Expressions the text inside parenthesis are called groups, so Windows PowerShell exposes their values in the Groups property. Groups[0]represents everything that was matched, while Groups[1]and on represent any groups that we defined. We can dig into the groups to find the information we need:

PS > $result.Matches[0].Groups[1]

Success  : True

Captures : {mhthomas42}

Index    : 28645

Length   : 10

Value    : mhthomas42

Success! Now, we weren’t just interested in one user name, let’s call our good friend Foreach-Object to process them all, and then spend the rest of the day gloating:

$usernames = $result.Matches | Foreach-Object { $_.Groups[1].Value }

PS > $usernames

mhthomas42

coryloriot

JohnEGI

http://twitter.com/sqlvariant

randy_knight

adam_Jorgensen

(…)

Oh no! We’ve got a bug! Somehow, we left http://twitter.com in somebody’s user name. Let’s take a look at the content itself:

<span id="ctl00_ContentPlaceHolder1_DataList1_ctl01_Label1"><font size="3">Aaron Nelson <a href="http://www.twitter.com/http://twitter.com/sqlvariant" cl…

It turns out that we aren’t to blame…we’ve just got dirty data!

Although we could get “smarter” with our Regular Expression, it’s usually more trouble than it’s worth. Taking a second pass on the data is often easier, so let’s take that approach. We can go through the user names again, this time using Windows PowerShell’s Replace operator to remove anything up to (and including) the slash:

$usernames -replace ".*/(.*)",'$1'

The Replace operator has two parts on the right-hand side: the Regular Expression to find, and the content to replace it with.

Converting that to English, we have:

1) Find a bunch of text: .*

2) Followed by a slash: /

3) Then start remembering stuff: (

4) Find a bunch of text: .*

5) And then stop remembering stuff: )

For the replacement portion, '$1' means “the text that was captured in Group #1.” As with the Select-String example, Group #1 is the text that was matched inside of parenthesis. Putting this in single quotes is important. Like other strings, Windows PowerShell will think that “$1” means a Windows PowerShell variable if you use double quotes.

Although there are simpler ways to do this replacement, that wouldn’t give us an excuse to talk about capture group references, now would it?

When we put this all into a script, we’ve got a powerful little web scraper!

001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024

## .SYNOPSIS
## Retrieves the list of Twitter usernames from
## http://www.sqlsaturday.com/70/networking.aspx

param(
    ## The web page holding the twitter usernames
    [Parameter()]
    [URI] $Uri = "http://www.sqlsaturday.com/70/networking.aspx"
)

## Download the file
$wc = New-Object System.Net.WebClient
$htmlContent = $wc.DownloadString($uri)

## Find all hyperlinks that are of the form: http://www.twitter.com/<username>
$pattern = '<a href="http://www.twitter.com/([^"]*)"'
$result = $htmlContent | Select-String -Pattern $pattern -AllMatches
$usernames = $result.Matches | Foreach-Object { $_.Groups[1].Value }

## Dirty data! Welcome to the internet!
## Some of the URLs are incorrect, such as
## http://www.twitter.com/http://twitter.com/<username>
## If a username has a slash in it, just take everything after it.
$usernames -replace ".*/(.*)",'$1'

 

You can also find Lee's script at the Script Repository.

Thank you, Lee.

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.

Ed Wilson, Microsoft Scripting Guy

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Dear Lee,

    this is a very well explained working solution with my favorite: regexps :-)

    In fact the slightly more complex regex solution without the second pass through the results might look like:

    ## Find all hyperlinks that are of the form: www.twitter.com/<username>

    $pattern = '<a href="http://www.twitter.com/([^"]*/)*([^"]*)"'

    $result = $htmlContent | Select-String -Pattern $pattern -AllMatches

    $usernames = $result.Matches | Foreach-Object { $_.Groups[2].Value }

    That will fix the issue with the duplicated address parts.

    Ok, we have two groups then and you have to concentrate on Group[2] ...

    But that's not my major concern here ... the solution works fine either way!

    If I would issue:

    $htmlContent = $wc.DownloadString($uri)

    at work, We will see an error! ( yes, it's me again .... this nasty error hunter :-)

    This will occur, because our firewall will not let the WebClient contact the internet,

    at least not without credentials, that allow to do that! In this case we would even

    worse have to specify our firewall proxy's IP, port and user credentials to accomplish that!

    That's real life :-(

    Expect errors to happen! ... ALWAYS!

    kind regards, Klaus

  • Much cleaner than my 8-something lines of replace code.  I knew regex would help but I didn't know where to start.

  • @cseiter: www.regular-expressions.info is always a very good starting point!

    A free regexp tester might be the first choice learning tool, e.g. www.ultrapico.com/Expresso.htm is a fine thing!

    There are many others ...

    kind regards, Klaus

  • Thanks!  Know any good resources for error-handling?  I don't have that down at all.

  • thank you