Classifying files based on their content is something we have covered before for the File Classification Infrastructure (FCI). Basically, FCI allows you to create rules that will search the contents of files (using the Content Classifier) for strings or patterns and when they are found, to set a classification property to a specific value. To extract text from files to search for strings or patterns, the Content Classifier uses the IFilter components that support the search indexing mechanisms. Out of the box, a series of file types are supported.

One common problem we see when searching through data for strings is that some data that is easily recognizable by humans is very difficult to find by software. The classic example is text embedded in images. Think about all the faxes or scanned documents lying around your Server. All of these may contain valuable information. However, up to now you have not had a good way to automate the process of finding it. In Windows 7 and Windows Server 2008 R2 there is a new optional component to help you with this problem: the Windows TIFF IFilter.

This is an optional Windows feature that you have to install (Server Manager –> Add Feature –> Windows TIFF IFilter). The Windows TIFF IFilter will then be able to perform OCR (Optical Character Recognition) of TIFF images (the most common format for faxes and scanned documents). With this '’feature’, the search indexer can essentially read the content of TIFF images to index these files according to the embedded text. This enables the Content Classifier to find text within images and classify the file.

Now your classification rule to find the word “Confidential” and mark those files as “Secrecy=High” can find the word in scanned documents too!

Languages

By default, the Windows TIFF IFilter uses the locale of the system to decide which language it should be attempting to OCR from images. However, this can be modified.

There is a group policy administrator template for setting preferred OCR languages. The only limitation is that languages must be from the same code page.

By selecting several OCR languages (e.g. English and Spanish) administrators can enable the Windows TIFF IFilter to process documents in both languages as well as those with the mixed language content. For more details please take a see: http://technet.microsoft.com/en-us/library/dd744701(WS.10).aspx