Use PowerShell to Download Web Page Links from a Blog

Use PowerShell to Download Web Page Links from a Blog

  • Comments 5
  • Likes

Summary: Microsoft Scripting Guy, Ed Wilson, shows how to use Windows PowerShell 3.0 to easily download web page links from a blog.

Microsoft Scripting Guy, Ed Wilson, is here. Today the weather outside is beautiful here in Charlotte, North Carolina in the United States. I opened the windows around the scripting house, and from my office, I am looking out on the green trees in our front yard. Our magnolia tree is still in bloom, as are our neighbor’s hibiscus plants. (Luckily for my neighbor, I get my hibiscus flowers from an organic grower on the Internet; otherwise, he might open his door one morning to find my teacup and I in his garden.)

The Scripting Wife continues to hammer away at the details for our three-week European tour, and the emails, Facebook posts, and tweets are flying back-and-forth across the big pond nearly 24-hours a day. When we have everything organized, I will post updates on the Scripting Guys Community page.

Use Invoke-WebRequest to obtain links on a page

By using the Invoke-WebRequest cmdlet in Windows PowerShell 3.0, downloading page links from a website is trivial. When I write a Windows PowerShell script using Windows PowerShell 3.0 features, I add a #Requires statement. (I did the same thing in the early days of Windows PowerShell 2.0 also. When Windows PowerShell 3.0 is ubiquitous, I will probably quit doing this.) Here is the #Requires statement.

#requires -version 3.0

The next thing I do is use the Invoke-WebRequest cmdlet to return the Hey, Scripting Guy! Blog. I store the returned object to a variable named $hsg as shown here.

$hsg = Invoke-WebRequest -Uri http://www.scriptingguys.com/blog

The object that is stored in the $hsg variable is an HTMLWebResponseObject object with a number of properties. These properties are shown here.

PS C:\> $hsg | gm -MemberType Property

 

   TypeName: Microsoft.PowerShell.Commands.HtmlWebResponseObject

 

Name              MemberType Definition                                                              

----              ---------- ----------                                                              

AllElements       Property   Microsoft.PowerShell.Commands.WebCmdletElementCollection AllElements {...

BaseResponse      Property   System.Net.WebResponse BaseResponse {get;set;}                          

Content           Property   string Content {get;}                                                   

Forms             Property   Microsoft.PowerShell.Commands.FormObjectCollection Forms {get;}          

Headers           Property   System.Collections.Generic.Dictionary[string,string] Headers {get;}     

Images            Property   Microsoft.PowerShell.Commands.WebCmdletElementCollection Images {get;}  

InputFields       Property   Microsoft.PowerShell.Commands.WebCmdletElementCollection InputFields {...

Links             Property   Microsoft.PowerShell.Commands.WebCmdletElementCollection Links {get;}   

ParsedHtml        Property   mshtml.IHTMLDocument2 ParsedHtml {get;}                                  

RawContent        Property   string RawContent {get;}                                                

RawContentLength  Property   long RawContentLength {get;}                                            

RawContentStream  Property   System.IO.MemoryStream RawContentStream {get;}                          

Scripts           Property   Microsoft.PowerShell.Commands.WebCmdletElementCollection Scripts {get;} 

StatusCode        Property   int StatusCode {get;}                                                    

StatusDescription Property   string StatusDescription {get;}        

I decide that I need to use the Links property to return the hyperlinks from the Hey, Scripting Guys! Blog. This command is shown here.

$hsg.Links

As I looked over the returned links, I noticed that their appeared to be several different classes of links. To review the different types of links, I piped the classes to the Sort-Object cmdlet, and I used the Unique switch. This command is shown here, along with the associated output.

PS C:\> $hsg.Links | select class | sort class -Unique

 

class                                                                                                

-----                                                                                                

                                                                                                     

external-link view-post                                                                              

internal-link advanced-search                                                                         

internal-link rss                                                                                    

internal-link view-application                                                                       

internal-link view-detail-list                                                                       

internal-link view-group                                                                              

internal-link view-home                                                                              

internal-link view-list                                                                               

internal-link view-post                                                                              

internal-link view-post-archive-list                                                                 

internal-link view-user-profile                                                                       

last                                                                                                 

menu-title                                                                                            

MSTWButtonLink                                                                                       

page                                                                                                 

rss-left                                                                                              

rss-right                                                                                            

selected                                                                                              

sidebar-tile-comments                                                                                

sidebar-tile-contact                                                                                 

sidebar-tile-subscribe                                                                                

tweet-url hashtag                                                                                    

tweet-url username                                                                                    

twtr-fav                                                                                             

twtr-join-conv                                                                                       

twtr-profile-img-anchor                                                                               

twtr-reply                                                                                           

twtr-rt                                                                                               

twtr-timestamp                                                                                       

twtr-user       

From the list, I can see that I am interested in only the “internal-link view-post” class of links. I add a Where-Object command (using the simplified syntax) to return only the “internal-link view-post” class links, and I am greeted with the output shown here. (I have deleted all but one instance of the record.)

PS C:\> $hsg.Links |

 Where class -eq 'internal-link view-post'

  

innerHTML : <span></span>Use PowerShell Redirection Operators for Script Flexibility

innerText : Use PowerShell Redirection Operators for Script Flexibility

outerHTML : <a class="internal-link view-post" href="http://blogs.technet.com/b/heyscriptingguy/archive/2012/09/20/use-powersh

            ell-redirection-operators-for-script-flexibility.aspx"><span></span>Use PowerShell

            Redirection Operators for Script Flexibility</a>

outerText : Use PowerShell Redirection Operators for Script Flexibility

tagName   : A

class     : internal-link view-post

href      : /b/heyscriptingguy/archive/2012/09/20/use-powershell-redirection-operators-for-script-flex

            ibility.aspx

From this output, I see that I am interested in only the outerText, and the href properties. I select these two properties, and am left with the script that is shown here.

Get-WebPageLinks.ps1

#requires -version 3.0

$hsg = Invoke-WebRequest -Uri http://www.scriptingguys.com/blog

$hsg.Links |

 Where class -eq 'internal-link view-post' |

    select outertext, href

The script and associated output are shown in the image that follows.

Image of command output

One thing that might be interesting is to send the output to the Out-Gridview cmdlet. This would permit easier analysis of the data. To do that would require only adding the Out-GridView command to the end of the script. The modification is shown here.

$hsg.Links |

 Where class -eq 'internal-link view-post' |

    select outertext, href | Out-GridView

Join me tomorrow when I will talk about more cool Windows PowerShell 3.0 stuff.

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.

Ed Wilson, Microsoft Scripting Guy

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment
  • Like everytime, always helpful. :)

  • Hi Ed,

    this is really a cool "trick"to get at the required information!

    In fact I couldn't follow your idea:  "I noticed that their appeared to be several different classes of links."

    at first ... because I really couldn't see any class properties in the output.

    Well ... having disected the last links, I finally got the idea that not each link may have a "class" property :-)

    So ... if anybody is like me, try this before you start wondering :-)

    "There are: {0} links ... but only: {1} of them have a 'class' member!" -f  $hsg.links.count, ($hsg.links | where { $_.class } | Measure-Object).count

    Klaus.

  • @Livio von Buren -- woo hoo! I am glad you enjoy the blog, and that you find it helpful.

  • @K_Schulte I am glad you like the trick. I am sorry that at first you did not see the class property ... thank you for pointing it out to everyone so they can find it more easily.

  • last script example returns no results. Changing it from

    Where class -eq 'internal-link view-post'

    to

    Where class -ne 'internal-link view-post' |

    shows results