by Anupadmaja Raghavan

In the last post, http://blogs.technet.com/filecab/archive/2009/08/14/using-windows-powershell-scripts-for-file-classification.aspx, we saw how simple PowerShell scripts could be used to do custom file classification in Windows. This post will illustrate how PowerShell scripts could be used to do file classification based on the contents of the file. This post assumes that the reader has a good understanding of the basics of the PowerShell classifier module discussed in http://blogs.technet.com/filecab/archive/2009/08/14/using-windows-powershell-scripts-for-file-classification.aspx. The file classification infrastructure is referred to as FCI in this post (http://blogs.technet.com/filecab/archive/2009/05/11/classifying-files-based-on-location-and-content-using-the-file-classification-infrastructure-fci-in-windows-server-2008-r2.aspx).

The PowerShell classifier module provides the capability to read file contents to enable content classification. We saw in http://blogs.technet.com/filecab/archive/2009/08/14/using-windows-powershell-scripts-for-file-classification.aspx that for each file, during classification, the pipeline input from FCI to the PowerShell classifier module’s PowerShell script consists of an IFsrmPropertyBag object. The GetStream() method in the IFsrmPropertyBag object returns a standard PowerShell Stream object to read the contents of the file as raw byte streams.

In http://blogs.technet.com/filecab/archive/2009/08/14/using-windows-powershell-scripts-for-file-classification.aspx, we saw how to pass the mandatory parameter named ScriptFileName, from FCI to the PowerShell classifier module. The PowerShell classifier module is also capable of receiving additional parameters from FCI. This is done by specifying these additional parameters in the rule definition (similar to how we specified “ScriptFileName” in the rule parameters). The rule definition is an IFsrmClassificationRule object named $Rule which is available as a pipeline input to the PowerShell classifier module’s PowerShell script. $Rule.Parameters contains the fields specified in the Parameters section in the FCI rule.

Let us take an example to see how these two features could be used to do content classification. Say we want to classify files as either patents or copyrights of a company. Say the criteria we pick to classify files are based on the following known information:

  • The copyright or patent documents are stored in *.txt files.
  • The patent or copyright information is stored in the first line of the file.
  • Patents contain the term “Patent of X” with the same casing.
  • Copyrights contain the term “Copyright of X“with the same casing.
  • One file cannot both be a copyright and a patent.

To do this using the PowerShell classifier module, we will create a String Property named “Document Type” and setup a rule to set this property based on output of the PowerShell script. The rule will be defined with the following parameters:

  • Parameter Name: ScriptFileName, Value: <Name of PowerShell script file>
  • Parameter Name: Copyright, Value: “Copyright of”
  • Parameter Name: Patent, Value: “Patent of“

The PowerShell script will read the first line of any *.txt file and if it contains the defined terms above, it will classify the file with the “Document Type” property set to either “Copyright” or “Patent”.

Process
{
    ################################
    ### Get the file name
    ################################
    $PropertyBag = $_
    $FileName = $PropertyBag.Name
    ################################
    ### If this isn't a .txt file don't process it
    ################################
    if(!($FileName -like "*.txt"))
    {
    return
    }
    ################################
    ### Collect the identifiers specified in the rule
    ################################
    $Identifiers = @{}
    foreach($RuleParam in $Rule.Parameters)
    {
        $Key,$Value = $RuleParam -split "=",2
        If ($Key –ne ‘ScriptFileName’)
        {
            $Identifiers[$Key] = $Value
        }
    }
    $FileStream = $PropertyBag.GetStream()
    $FileStreamReader = new-object
    System.IO.StreamReader($FileStream)
    If ($FileStreamReader.EndOfStream)
    {
        return
    }
    $Line = $FileStreamReader.ReadLine()
    $FileStreamReader.Close()
    $FileStream.Close()
    $Identifiers.GetEnumerator() | foreach-object
    {
        If ($Line.Contains($_.Value))

        {
            return $_.Key ### return the document type
        }
    }
}

Thus the PowerShell host classifier module provides a simple way to do content classification of files using PowerShell scripts. More details on the topics discussed in this post and other capabilities of the PowerShell classifier module are available in the SDK.