Vedant Kulshreshtha

I have no special talents. I am only passionately curious.

FAST ESP: Different ways to retrieve Content

FAST ESP: Different ways to retrieve Content

  • Comments 6
  • Likes

Content retrieval is done very differently in FAST then in SharePoint 2007. FAST ESP may retrieve content from the data sources using two broad approaches:

1) Content Pull: this approach leverages content connectors to retrieve the information via standard APIs or interfaces provided by the source content repositories. This is the core technology of most search solutions, and includes retrieval of file server based documents, web based information, databases or any other enterprise applications. The content connectors do not require integration programming towards the target data repositories.

2) Content Push: this approach requires that the data repositories, applications or messaging middleware send the data directly to FAST ESP via its Content API. This omits the latency of crawling but it requires a closer relationship between the content application and search engine. Multiple programmatic interfaces like .NET, Java, C++, XML-RPC are available for pushing content.

Content Connectors

A content connector is a program that extracts content from some source system, maps the content from the source document model to the document model of FAST ESP, and feeds the documents to FAST ESP for indexing. FAST ESP ships with several commonly used connectors like:

  • Enterprise Crawler
  • File Traverser
  • Database Connector

Other connectors available from FAST include: Microsoft Content Management Server (MCMS) 2002, Documentum, Sharepoint Portal Server (SPS) 2003, Microsoft Office SharePoint Server (MOSS) 2007, StarTeam,  CaliberRM, Meridio, Oracle Content Server...

Connectors available from FAST Partners include: Kapow, Vignette VCM, OpenText LiveLink, InterwovenTeamSite, FileNet P8 Content Manager, SAP EP Connector, and SAP PLM...

The connector for specific applications mentioned above include various properties one of which is that they will recognize the source system's security model. This enables security trimming of the search results i.e search users get to see only the information which they have access to in the search results.

In addition to all these options, the FAST ESP SDK also provides a Content Connector Toolkit which helps you create your own connector application.

Enterprise Crawler

Content on Web sites/applications can be retrived using the FAST Enterprise Crawler. The Crawler scans specified web sites and follows hyperlinks, extracts the desired information and detects duplicates. It interfaces directly with the Content API to submit the content. The document processing converts the HTML into structured data as defined by the web representation.

Multiple web domains from Intranet, Extranet or Internet can be specified, with individually configured refresh rate, MIME-type support, etc. Parts of web domains can be included/ excluded from the crawl using regular expression based configuration. The Crawler supports incremental crawling, dynamic pages, entitled content (cookie, SSL, password), HTTP 1.0/1.1, FTP, frames, Macromedia Flash content, robots.txt and meta robots tags.

Intelligent loop detection keeps the crawler from repeatedly traversing the same page. During incremental crawling, the Crawler can be configured to focus on retrieving new content only, or detecting modified or deleted items in previously retrieved content.

The Browser Engine is a standalone component which is used by the Enterprise Crawler to extract information from JavaScript and Flash files. A unique JavaScript parser enables the crawler to index dynamic content generated by JavaScript on the client side, and follow JavaScript generated links. The crawler includes the ability to follow hyperlinks and index textual content from Macromedia Flash files.

A multi-node Crawler architecture provides unlimited scalability in number of crawled web servers and number of documents.

File Traverser

Files from any reachable file server can be retreived using the File Traverser. It scans specified file directories on file servers, retrieves content of various formats, and submits it to a collection in the same way as the Enterprise crawler. More than 400 file types can be processed, including popular document types such as Microsoft Office, Text, PDF and Adobe PDF files.

The file traverser crawls all sub directories starting at a given top directory. It then processes all files that match defined extensions, such as html, pdf and doc, and generates a URL per document based on a given prefix. Documents are then sent to the ESP Content API in configurable batches. The size of the batches is limited by two factors: total file size and number of documents.

You may configure the file traverser with the authorization levels in order to retrieve entitled content. Several entities of the traverser may be configured with different authorization level in order to handle multiple collections of data with different entitlements.

Database Connector

FAST ESP provides an index architecture that is well suited for both structured and unstructured information. Integrating the search engine with a relational database is performed for two main reasons:

  • Relational databases are not very efficient for handling large query volumes. Exporting the data to a dedicated search engine may dramatically off-load the database servers
  • Integrating a large number of different data sources into one index and one search bar provides a more convenient search experience

FAST ESP provides connectors for a number of relational database systems such as SQL Srever, Oracle, MySQL and DB2. The connectors support flexible indexing of structured content and document attachments. Database retrieval schemes may be configured using SQL statements.

Database retrieval may be configured using SQL statements, and content from multiple tables or databases may be pre-joined prior to indexing. This enables tailoring of the content schema in the Index to frequent queries. Document attachments may also be indexed together with the database content.

FAST ESP is very scalable and can be sized to index billions of database records if needed. An optimized incremental update feature is also provided that takes advantage of update notifications provided by commercial databases. In this way FAST ESP will only request the database for content that is known to be updated. This approach ensures that the latency from a database table change until it is updated in FAST ESP can be minimized, and the system will impose minimal load on the database host system.

Comments
  • PingBack from http://seo.linkedz.info/2009/04/23/fast-esp-different-ways-to-retrieve-content/

  • Hi Vedant,

    This blog helped me a lot but I am stuck with something related to crawler.

    Can you please post a blog with screen shots of Enterprise Crawler configuration.

    I did try with the GUI as well as I started the crawler from command prompt but it is giving an error message "Could not resolve host: www.##$$$.net; No data record of requested type  " to me every time.

    Pipeline used is the "General Pipeline"

  • I will not be able to post screenshots here.

    Please refer to the "File Traverser Guide" document in the FAST ESP product documentation. The relevant sections in it would be "Operating the File Traverser using the Admin GUI" and "Usage Examples" in "Using the File Traverser" chapter.

  • i Have Sharepoint site in 2007

    I want to Fetch that Site content in Fast esp..

    can u please tell me steps for that.

  • Thanks for your information.

    Do we can implement all of the content connector without to be purchased again the license?

    Thanks

  • Is there documentation or a support blog for the FAST SharePoint 2003 connector that provides specific field information about bugs and workarounds?  The product documentation does not cover the kinds of issues we are seeing.  For example, we are running into several problems with 2003 content being indexed as the wrong content type, configuration paramaters being ignored, etc..

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment