FAST ESP performs linguistic processing:
This helps it provide more relevant search results to users. On the query side, linguistic processing results in a query transformation, on the document side, linguistic processing results in document enrichment prior to indexing in order to cover grammatical forms and synonyms.
FAST ESP provides the following linguistic processing features:
Language detection and encoding - automatically recognizes 81 different languages in different encoding. This helps it select the language-specific dictionaries and algorithms during document processing and query processing. This automatic detection can be changed in cases where the document language is defined by the metadata
Configurable tokenization and character normalization – ESP detects white space characters and symbols that separate words from each other that are not relevant to the matching process. Configuration options like which characters will be treated as white space and which will split tokens etc are available. It also provides an open tokenization framework for easy integration of customer specific tokenizers. A comprehensive Asian language-processing feature provides tokenization and language normalization specific to Japanese, Chinese, Korean and Thai languages.
Lemmatization - Lemmatization means the mapping of a word to its base form and / or all its other inflectional forms. It allows for matching with grammatical variations of words. Lemmatization matches words with all inflections of that word and improves both recall and precision. For example, a search for ‘car’ matches both ‘car’ and ‘cars’. If enabled on the search front-end, the users can turn lemmatization on or off per query.
Normalization of national spelling variations - Character normalization is the replacement of characters or character sequences to improve search results. An example is the mapping of the the French e´ to e. This helps improve recall, but may, in some cases, have a negative impact on precision. Phonetic normalization is normalization using phonetic matching rules and is performed on the query and document side. Terms that are written differently but sound the same can give the same result.
Phrase Detection and Spell Check – This feature detects names/phrases and automatically rewrites queries or provides search tips which can be displayed to the end user. Custom phrase dictionaries may be created and phrasing can be combined with spell checking. FAST ESP supports
Synonyms and Thesauri – FAST ESP enables synonym and spelling variation expansion of queries or indexed documents. The query-side expansion adds synonyms or spelling variations to the query prior to the actual matching. The document-side expansion expands the document with synonyms in a separate part of the index. It may be controlled in the same way as lemmatization at query time. It is also possible to combine the two solutions.
Anti-phrasing – Anti-phrases are parts of a query that do not contribute essentially to the query's meaning, such as "Where can I find". Using standard dictionary, Anti-phrasing feature removes such phrases - the query ‘where is X’ is transformed to ‘X’, which improves query recall, particularly in the AND (“all words”) query mode
Entity Extraction - FAST ESP provides an extensive Entity Extraction framework in order to detect names, locations and other well defined elements in unstructured text. These entities can be annotated to the indexed documents, but can also be annotated to semantic structures in the text, such as paragraphs or sentences. Such annotation enables normalized matching of entities as well as contextual navigation into detected entities from search results. FAST ESP provides:
PingBack from http://iworker.cz/2009/05/03/fast-and-sharepoint-2007.html