Vedant Kulshreshtha

I have no special talents. I am only passionately curious.

FAST ESP: Document Processing Engine

FAST ESP: Document Processing Engine

  • Comments 1
  • Likes

Continuing the discussion on document processing, this post focusses on the document processing engine in FAST ESP. 

The Document Processing Engine provides processing of documents through customizable document processing pipelines. The Engine consists of multiple document processing pipelines. Any incoming document is sent through a specified document processing pipeline. Document processing pipelines consist of multiple document processing stages.

Let's see what these core components are.

Document Processor Stage

A document processing stage performs a particular document processing task and can modify, remove, or add elements to a document. It takes one or more document elements to be input and the resulting output is new or modified elements that may be further processed. With each document processing stage focusing on one particular area of document processing, document processing stages can be reused in a multitude of settings and pipelines.

FAST ESP is shipped with a set of document processor stages as well as pre-configured instances. The customer can write new document processor stages, either processing documents from scratch or leveraging the output of standard FAST ESP stages. For example, a custom document processor stage performing language-specific operations may leverage the automatic language detector shipped with FAST ESP.

Document Processing Pipeline

A pipeline is a sequence of document processor stages, in which the stages are executed sequentially. The system administrator associates a pipeline with a collection. A pipeline can be broken into 3 phases:

  1. Normalization/Pre Processing: This is where the pipeline is initialized, documents are fetched (if passed by reference), and format detection, file conversion, character normalization, and language identification is performed. In most out-of-the-box pipelines, preprocessing ends with an HTML parser stage. This stage ensures that the data blog has been parsed into title, metadata (HTML meta tags, Office document properties, PDF meta tags), and body. Different criteria are used for determining the title if one is not provided during feeding. Any text not identified as title or metadata is copied into the body.
  2. Data Manipulation: This is where most custom stages are added. During data manipulation, data from one or more document attributes can be used to compute values for other attributes. Stages can interact with external applications or data. It is important to note that all data manipulation must be complete before any linguistic or scope processing starts, which adds tags to attributes. Modifying the attributes afterwards could cause errors.
  3. Post Processing: This includes linguistics and scope tagging. The final steps in post processing are Generate FIXML and Output to Indexer. In most out-of-the-box pipelines, post-processing begins at the Tokenizer stage.

Example of a Pipeline

A typical document-processing pipeline for information retrieved from the Internet consists of the following stages:

  • Format detection: Detects the MIME type of the document and determines if a format conversion is required.
  • Format conversion: A built-in document converter supports external document formats, such as PDF and Microsoft Word. The original structure of the documents, as well as meta-data if embedded in the documents (author, date, etc.), is preserved in the conversion process. An HTML parser extracts key information, such as title, body and Dublin Core metadata.
  • Language and encoding detection: FAST ESP may recognize 81 languages with corresponding character encoding. Language detection is used for language dependant processing and can be used for narrowing the scope of a search.
  • Tokenizing: The tokenizer groups characters into words that are configurable by, for example, detecting white-space characters and other symbols that are not relevant to the matching process. Special handling is required for Asian languages as it is difficult to detect ‘words’. The tokenizer stage groups the Asian characters into ‘tokens’ that can be treated as words in the processing and matching stages
  • Character encoding: Character encoding is normalized to UTF8 representation in order to handle search across content with different encoding.
  • Extraction of Named Entities: FAST ESP is able to extract Named Entities such as person/geo/company names and addresses. These entities can typically be used for drill-down navigation within the result set.
  • Document Similarity Analysis: A Similarity Vector is extracted based on statistical document analysis. This vector can in turn be used for unsupervised clustering and ‘related topics’ navigation.
  • Summary Extraction: Extraction of document summaries to be used in the presentation of search results.
  • Lemmatization: The process performs grammatical normalization in order to enable query match regardless of grammatical forms in documents and queries.
  • Mapping: The mapping from document attributes to the FAST Index structure may be configured per collection. Different views of the content structure can be applied to queries, for instance a search within the document title or author only.

Some things to note here are:

  • A document processor stage can appear in several pipelines
  • A pipeline is a lightweight object, thus imposing no limit on the number of pipelines
  • Each pipeline can be used as a template for new pipelines

FAST ESP ships with lot of different pipelines with specific usage scenario. Some examples are given below. Product documentation has the complete list and guidance on where they should be used.

Name

Description

Generic

Basic support for web content and documents

SiteSearch

Additional support for Site Search and other web search applications that needs relevancy from link analysis.

NewsSearch

Additional support for News Search and other applications that need search precision and drill-down on extracted entities.

Semantic

Detects and marks up selected semantic and structural entities for increased search precision. Includes an extensive set of entity extractors.

LightweightSemantic

A performance-optimized contextual search pipeline which includes only a basic set of entity extractors.

Customizing Pipelines

Pipelines can be customized in 2 broad ways:

  • Adding or removing stages
  • Reconfiguring stage parameters

FAST ESP provides a Python programming language based document processor API to create custom document processor stage. A custom document processor stage consists of a Python script and a deployment descriptor, and can be plugged into the FAST ESP document processing framework like any other of FAST ESP's built-in stages.

Content Distributor

The Document Processing Engine also includes a Content Distributor which is responsible for dispatching incoming documents to the right document processing pipelines by controlling processor servers. It sends the current document to the processor server along with a pipeline request, and the processor server executes the stages in the requested pipeline on the document.

Document Log

A document processor stage can use two kinds of logging, the system log and a document log. Normally the document processor stage itself should not use the system log as it is reserved for messages reflecting the system status. Major document processor stage decisions as well as errors and warnings are logged in the document log. The framework automatically inserts the name of the pipeline and the document processor stages as they are executed in the document log. The framework catches stage failures, and appropriate messages are appended to the document log. The document logs provide a per document log. A stand-alone program, doclog, is used to view the document logs.

As we can have a very detailed and verbose logging, the advantage of document log is that the processing of a particular document can then be analyzed without having to enable attribute tracing and reprocess the document. For example, the language detector informs what language it detected and whether it was from a meta tag or automatically from the content. Another example is the format converter which informs what technology it is using to convert the input data to text.

Administration Interface

The Document Processing Engine can be monitored through the FAST ESP Administrator Interface. The administration GUI provides functionality for adding and removing document processors, viewing the active ones and what they are currently doing, and statistics over past documents, showing potential pipeline bottlenecks.

The administrator can configure the pipelines as well as the document processor stages in the administration GUI. You can define new document processing pipelines from the interface, as well as specify the document processing stages to be involved and the sequence of execution within each pipeline.

Comments
  • HI

    While searching data base record we are getting partial data of the matching document tag (not getting all column/entire row data; that means entire <Document> tag data).

    We generate Database content xml (FAST xml) using JDBC connector and push FAST xml to collection using file traverser.

    For Example:-

    We are using follwing document.

    Code:

    <documents>

     <document>

       <element name="docid">

         <value><![CDATA[17301]]></value>

       </element>

       <element name="addressid">

         <value><![CDATA[17301]]></value>

       </element>

       <element name="addressline1">

         <value><![CDATA[24, rue Royale]]></value>

       </element>

       <element name="city">

         <value><![CDATA[Saint Ouen]]></value>

       </element>

       <element name="stateprovinceid">

         <value><![CDATA[103]]></value>

       </element>

       <element name="postalcode">

         <value><![CDATA[17490]]></value>

       </element>

       <element name="rowguid">

         <value><![CDATA[248D10DE-9867-4923-932C-258A56D9C4BD]]></value>

       </element>

       <element name="modifieddate">

         <value><![CDATA[2004-01-22T10:09:29]]></value>

       </element>

     </document>

    </documents>

    We are searching for "24, rue Royale" text and we are getting follwing result in body property of QueryResult.

    Code:

    ...value><![CDATA[17301]]></value></element><element name="addressline1" ><value><![CDATA[<b>24, rue Royale</b>]]></value></element><element name="city" ><value><![CDATA[Saint Ouen]]></value></element...

    Is there any way to get the entire document tag information while searching the element tag value?

    We need help in this regard.

    Thanks in advance.

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment