Continuing the discussion on document processing, this post focusses on the document processing engine in FAST ESP.
The Document Processing Engine provides processing of documents through customizable document processing pipelines. The Engine consists of multiple document processing pipelines. Any incoming document is sent through a specified document processing pipeline. Document processing pipelines consist of multiple document processing stages.
Let's see what these core components are.
Document Processor Stage
A document processing stage performs a particular document processing task and can modify, remove, or add elements to a document. It takes one or more document elements to be input and the resulting output is new or modified elements that may be further processed. With each document processing stage focusing on one particular area of document processing, document processing stages can be reused in a multitude of settings and pipelines.
FAST ESP is shipped with a set of document processor stages as well as pre-configured instances. The customer can write new document processor stages, either processing documents from scratch or leveraging the output of standard FAST ESP stages. For example, a custom document processor stage performing language-specific operations may leverage the automatic language detector shipped with FAST ESP.
Document Processing Pipeline
A pipeline is a sequence of document processor stages, in which the stages are executed sequentially. The system administrator associates a pipeline with a collection. A pipeline can be broken into 3 phases:
Example of a Pipeline
A typical document-processing pipeline for information retrieved from the Internet consists of the following stages:
Some things to note here are:
FAST ESP ships with lot of different pipelines with specific usage scenario. Some examples are given below. Product documentation has the complete list and guidance on where they should be used.
Basic support for web content and documents
Additional support for Site Search and other web search applications that needs relevancy from link analysis.
Additional support for News Search and other applications that need search precision and drill-down on extracted entities.
Detects and marks up selected semantic and structural entities for increased search precision. Includes an extensive set of entity extractors.
A performance-optimized contextual search pipeline which includes only a basic set of entity extractors.
Pipelines can be customized in 2 broad ways:
FAST ESP provides a Python programming language based document processor API to create custom document processor stage. A custom document processor stage consists of a Python script and a deployment descriptor, and can be plugged into the FAST ESP document processing framework like any other of FAST ESP's built-in stages.
The Document Processing Engine also includes a Content Distributor which is responsible for dispatching incoming documents to the right document processing pipelines by controlling processor servers. It sends the current document to the processor server along with a pipeline request, and the processor server executes the stages in the requested pipeline on the document.
A document processor stage can use two kinds of logging, the system log and a document log. Normally the document processor stage itself should not use the system log as it is reserved for messages reflecting the system status. Major document processor stage decisions as well as errors and warnings are logged in the document log. The framework automatically inserts the name of the pipeline and the document processor stages as they are executed in the document log. The framework catches stage failures, and appropriate messages are appended to the document log. The document logs provide a per document log. A stand-alone program, doclog, is used to view the document logs.
As we can have a very detailed and verbose logging, the advantage of document log is that the processing of a particular document can then be analyzed without having to enable attribute tracing and reprocess the document. For example, the language detector informs what language it detected and whether it was from a meta tag or automatically from the content. Another example is the format converter which informs what technology it is using to convert the input data to text.
The Document Processing Engine can be monitored through the FAST ESP Administrator Interface. The administration GUI provides functionality for adding and removing document processors, viewing the active ones and what they are currently doing, and statistics over past documents, showing potential pipeline bottlenecks.
The administrator can configure the pipelines as well as the document processor stages in the administration GUI. You can define new document processing pipelines from the interface, as well as specify the document processing stages to be involved and the sequence of execution within each pipeline.
While searching data base record we are getting partial data of the matching document tag (not getting all column/entire row data; that means entire <Document> tag data).
We generate Database content xml (FAST xml) using JDBC connector and push FAST xml to collection using file traverser.
We are using follwing document.
<value><![CDATA[24, rue Royale]]></value>
We are searching for "24, rue Royale" text and we are getting follwing result in body property of QueryResult.
...value><![CDATA]></value></element><element name="addressline1" ><value><![CDATA[<b>24, rue Royale</b>]]></value></element><element name="city" ><value><![CDATA[Saint Ouen]]></value></element...
Is there any way to get the entire document tag information while searching the element tag value?
We need help in this regard.
Thanks in advance.