A blog by Jose Barreto, a member of the File Server team at Microsoft.
All messages posted to this blog are provided "AS IS" with no warranties, and confer no rights.
Information on unreleased products are subject to change without notice.
Dates related to unreleased products are estimates and are subject to change without notice.
The content of this site are personal opinions and might not represent the Microsoft Corporation view.
The information contained in this blog represents my view on the issues discussed as of the date of publication.
You should not consider older, out-of-date posts to reflect my current thoughts and opinions.
© Copyright 2004-2012 by Jose Barreto. All rights reserved.
Follow @josebarreto on Twitter for updates on new blog posts.
Last week I was testing Visual Studio 2010 to write a C# application to export all my blog posts to a file. I described that in some detail at http://blogs.technet.com/josebda/archive/2010/03/21/experimenting-with-visual-studio-2010-and-backing-up-the-entries-on-my-blog.aspx
I am performing the exact same task, but this time using PowerShell V2. The basic idea is still the same and I am still using a Browser and HTML document objects to do most of the work. This is part of a series of blog posts I am doing on PowerShell V2, focusing mostly on programming. The sample script uses many variables, objects and collections. It also uses different types of loops and conditinal statements. If you're familiar with programming, it should all make sense.
The code
Let's start with the basics of creating and running the PowerShell script. To make it simple, you can use Notepad to create a file called "BlogBackup.ps1". You can then copy and pasted the code from second cell of the table below:
BlogBackup.ps1 $Browser = New-Object -COM "InternetExplorer.Application"$MorePages = $True$Page = 1$Post = 0$BaseURL = "http://blogs.technet.com/josebda" $File = "./josebda.htm""Exporting posts from " + $BaseURL + " to a file""<HTML><BODY>" | Out-File $FileWhile ($MorePages){ $URL = $BaseURL + "/default.aspx?p=" + $Page "Loading Page " + $Page + " (" + $URL + ")" $Browser.Navigate($URL) While ($Browser.ReadyState -ne 4) { Start-Sleep -Seconds 1 } "Processing Page " + $Page $MorePages = $False $Divs = $Browser.Document.getElementsByTagName("DIV") ForEach ($Div in $Divs) { $DivText = $Div.OuterHTML.ToString() If ($DivText.Length -gt 16) { If ($DivText.Substring(2,16) -eq "<DIV class=post>") { $MorePages = $True $Post++ $Title = @($Div.getElementsByTagName("A"))[0].InnerHTML "Exporting post " + $Post + " = " + $Title $DivText | Out-File $File -append } } } "Processed Page " + $Page +"." $Page++}"</BODY></HTML>" | Out-File $File -append"Processing complete!"
$Browser = New-Object -COM "InternetExplorer.Application"$MorePages = $True$Page = 1$Post = 0$BaseURL = "http://blogs.technet.com/josebda" $File = "./josebda.htm""Exporting posts from " + $BaseURL + " to a file""<HTML><BODY>" | Out-File $FileWhile ($MorePages){ $URL = $BaseURL + "/default.aspx?p=" + $Page "Loading Page " + $Page + " (" + $URL + ")" $Browser.Navigate($URL) While ($Browser.ReadyState -ne 4) { Start-Sleep -Seconds 1 } "Processing Page " + $Page $MorePages = $False $Divs = $Browser.Document.getElementsByTagName("DIV")
ForEach ($Div in $Divs) { $DivText = $Div.OuterHTML.ToString() If ($DivText.Length -gt 16) { If ($DivText.Substring(2,16) -eq "<DIV class=post>") { $MorePages = $True $Post++ $Title = @($Div.getElementsByTagName("A"))[0].InnerHTML "Exporting post " + $Post + " = " + $Title $DivText | Out-File $File -append } } } "Processed Page " + $Page +"." $Page++}"</BODY></HTML>" | Out-File $File -append"Processing complete!"
Execution Policy
If you simply try to run the script, you should get the following error:
PS C:\> .\blogbackup.ps1File C:\blogbackup.ps1 cannot be loaded because the execution of scripts is disabled on this system. Please see "get-help about_signing" for more details.At line:1 char:17+ .\blogbackup.ps1 <<<< + CategoryInfo : NotSpecified: (:) [], PSSecurityException + FullyQualifiedErrorId : RuntimeException
That's because the default setting for PowerShell is to restrict the execution of unsigned scripts. You can confirm that by running
PS C:\> Get-ExecutionPolicyRestricted
You can change that policy by using the Set-ExecutionPolicy cmdlet. To be on the safe side, we'll change it just for the current process (if you close PowerShell and open it again, the policy will be back to Restricted).
PS C:\> Set-ExecutionPolicy -scope process UnrestrictedExecution Policy ChangeThe execution policy helps protect you from scripts that you do not trust. Changing the execution policy might exposeyou to the security risks described in the about_Execution_Policies help topic. Do you want to change the executionpolicy?[Y] Yes [N] No [S] Suspend [?] Help (default is "Y"): <ENTER>PS C:\Users\josebda> Get-ExecutionPolicyUnrestricted
With that, you should now be able to run the script just fine for this session. If you really intend to write scripts and you want to run them securely, you should learn about how to sign themYou can start by reading http://technet.microsoft.com/en-us/magazine/2008.04.powershell.aspx.
The output
Here's what the output of the script should look like:
PS C:\> .\blogbackup.ps1Exporting posts from http://blogs.technet.com/josebda to a fileLoading Page 1 (http://blogs.technet.com/josebda/default.aspx?p=1)Processing Page 1Exporting post 1 = Comparing RPC, WMI and WinRM for remote server management with PowerShell V2Exporting post 2 = Why Hyper-V VHD Files Are So Large - And How To Efficiently Copy ThemExporting post 3 = Experimenting with PowerShell V2 RemotingExporting post 4 = How DFS Replication (DFS-R) secures its communicationExporting post 5 = Experimenting with Visual Studio 2010 and backing up the entries on my blogExporting post 6 = Windows Storage Server 2008 and iSCSI Software Target 3.2 documentation on TechNetExporting post 7 = FAST'10 Technical SessionsExporting post 8 = Unique Document URLs in MOSS 2007 and the new Document ID feature in SharePoint 2010Exporting post 9 = Random thoughts and links on StorageExporting post 10 = Presentations from Storage Developer Conference 2009 (SDC 2009) are now available for downloadExporting post 11 = Windows Server DFS Namespaces (DFS-N) ReferenceExporting post 12 = Configuring Failover Clusters with Windows Storage Server 2008Exporting post 13 = Automatically uploading files from File Server to SharePoint using the File Classification Infrastructure (FCI)Exporting post 14 = Six Uses for the Microsoft iSCSI Software TargetExporting post 15 = Download for Powershell v2 for Windows 7? No need... It's already there!Processed Page 1.Loading Page 2 (http://blogs.technet.com/josebda/default.aspx?p=2)Processing Page 2Exporting post 16 = SQL Server 2008 R2 Enterprise Evaluation November CTP available for MSDN/TechNet SubscribersExporting post 17 = Mistakes when configuring your Hyper-V environmentExporting post 18 = Scary SQL Server stuff: tombstones, phantoms, blobs, ghosts and zombiesExporting post 19 = Implementing an End-User Data Centralization Solution with Folder Redirection and Offlines FilesExporting post 20 = SharePoint 2010 beta in November. Details and documentation right now!Exporting post 21 = File Server Capacity Tool (FSCT) 1.0 available for download[Lots of lines excluded here for brevity]Exporting post 296 = EAS Support in Windows / Suporte para EAS no WindowsExporting post 297 = Project Server 2003 Training / Treinamento para Project Server 2003Exporting post 298 = Automating Certificates / Automaçao de CertificadosExporting post 299 = DoNotAllowXPSP2Exporting post 300 = MOM 2005 PreviewProcessed Page 20.Loading Page 21 (http://blogs.technet.com/josebda/default.aspx?p=21)Processing Page 21Exporting post 301 = Security for ApplicationsExporting post 302 = Good scripting book for WSH, ADSI, WMI?Exporting post 303 = Outlook 2003 and the OABExporting post 304 = We live very interesting timesExporting post 305 = Security, VHDs e DefragmentationExporting post 306 = This blog thingProcessed Page 21.Loading Page 22 (http://blogs.technet.com/josebda/default.aspx?p=22)Processing Page 22Processed Page 22.Processing complete!
After it runs, you should also find a file called josebda.htm with the entire text of all the blog posts. That's what the Out-File cmdlet used in the script does.
The variables
The script uses a number of variables to keep track of things. These are the items starting with a $ sign, like $Page (used to track what page we are processing), $BaseURL (the blog location), $File (the name of the output file) or $Post (the number of posts found so far). You will notice they are initialized and later updated throughout the code. One special variable called $MorePages is used to know if we have reached the end of the blog. You see, the blog system provides a set of pages starting with 1, but I cannot tell from the start how many pages with blog posts there will be. So, I use this variable to track if I have loaded a page with no posts in it, which indicates there are no more pages to process.
There are some variables that hold more complex information. They are actually objects. That includes the $Browser (this is an instance of an Internet Explorer browser that is used to retrieve the pages) and $Divs (a set of HTML elements using the <DIV> tag). These objects are not just basic types, but more complex ones, which include a longer list of properties and methods. For the $Browser, for instance, I use the $Browser.Navigate method to load a specific page) and the $Browser.ReadyState property to tell if the page is done loading. I also use $Browser.Document.getElementsByTagName (yes, that's a method of an object inside an object) to find all "DIV" tags in the resulting document. I used $Divs to store that resulting set of tags.
The control structures
Several PowerShell control structures are used, like While, ForEach and If. The main loop, for instance, makes sure we keep getting more pages until we find a page with no $MorePosts, incrementing $Page at the end of each pass. A second loop uses a ForEach to look into each of the elements returned by GetElementsByTagName("DIV") to inspect it. Also, two If statements look at whether there are enough characters in the $Div.OuterHTML to look at (we need at least 16 to have a chance of being a post) and then if it starts with "<DIV class=post>", which means it's one of the posts inside the page (there are usually a few posts in each page).
If you're not familiar with PowerShell expressions, you might be confused by the comparison operators. They might seem unintuitive at first, but you get used to them. In the script, I used -eq (equal), -ne (not equal) and -gt (greater than). The main While loop does not use an operator because $MorePages is already a boolean (contains either a $True or a $False value).
I must admit that moving strings around like this is probably not the most efficient wait to process the document, but I was shooting for simplicity not for extreme performance. In fact, the only reason I even bothered to check the length of the $Div.OuterHTML was because the .Substring(2,16) will fail if the string is not long enough.
A couple of tricks
There are two lines in the code that are somewhat tricky and also deserve a comment.
First, there is the line saying "While ($Browser.ReadyState -ne 4) { Start-Sleep -Seconds 1 }". This is the line that waits for the document to finish loading before we go look at it. You see, the browser control is asynchronous and it will give us back control before the page is fully loaded. This line will wait for ReadyState to become 4, which means that "the control has finished loading the new document and all its contents".The statement inside the loop waits for 1 second. See http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.readystate.aspx for details on the other states.
Second, there is the line saying "$Title = @($Div.GetElementsByTagName("A"))[0].InnerHTML". In short, this line extracts the text inside the first "<A>" tag within the post. The $Div contains a post, which is an HTML element containing an entire "<DIV>" tag. Inside it, the first "<A>" tag contains a link to the URL of the post and the inner text of that tag is the title of the post. The statement starts by getting a list of all "<A>" tags within the $Div and then gets the first of those elements (that would be element number zero, represented by [0]). This whole section is optional, but it's interesting to show the name of the post while you are looking at the output.
Conclusion
This post has more of a developer flavor to it and it shows the range of tasks you can automate with PowerShell. For instance, you could create a script to get a list of all computer objects in Active Directory with the term "file" in the name, then you could use that list to find all file shares in each of those servers that do not end with $, then you could get information about used/free space on each volume used by each share and finally you could output a nice HTML table with all the results. But that would be an entirely new blog post...
Yikes, the first thing you taught folks about Powershell was how to disable security protection. I hop eDon Jones doesn't find you...
\\Greg
To address your concern, I added some additional information on script signing, including a link.
At least I am doing this just for the current process.