Vedant Kulshreshtha

I have no special talents. I am only passionately curious.

FAST ESP: Documents and Document Processing

FAST ESP: Documents and Document Processing

  • Comments 1
  • Likes

Besides the architectural differences and feature set, transitioning from SharePoint to FAST ESP involves knowing the differences in terminology also. The term – “Document” is just one of the many examples available. This post is about what a document is in FAST ESP. In the following post, I will cover what does “document processing” mean and how it is done. Let’s start with 2 important concepts first:

  • Content: Data that has not yet been submitted to the FAST ESP system is referred to as “content”. Examples of content are MS Office files, PDF files, HTML pages, or database entries.
  • Document: In FAST ESP a “document” denotes a single searchable entity. It is used for the unit of content that the matching engines (i.e. search engine and the alert engine) report in the result.

Putting it roughly, “content” is what gets submitted to the FAST ESP system. This content then gets converted into one or more “document” which after processing eventually goes into the Search index and becomes available for search.

What’s in a Document?

Expanding a bit on the definition given above, a “document” represents the “content” entity as a set of data elements or named attributes, which are also called document elements. A document may, for example, have a title and body element, each containing the title and body text respectively. FAST ESP supports textual and numeric metadata attributes, including normalized time/date and location information. A document might contain information extracted from the original data source (content) combined with additional metadata information (usually added to improve search relevancy). Each document has a document identifier that is unique across the entire set of documents handled by the FAST ESP system.

The concept of a document is independent of the type of content. For example, if the source is a database table, each row of information from a table or view may map to a searchable document. For both search and filtering, each document is treated as one searchable item and is listed as such in the result list.

Document Processing

When content is converted to documents, the data in these documents is still very raw and not optimized for indexing, search or relevance. Document Processing is done on these documents using multiple stages to transform data into an index of words, terms, variables, and values that are optimized.

Document processing involves reading document element values, doing a computation based on these values, and adding or modifying document elements as output.

Comments
  • <p>Continuing the discussion on document processing , this post focusses on the document processing engine</p>

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment