Vedant Kulshreshtha

I have no special talents. I am only passionately curious.

FAST ESP: Linguistics Processing Features

FAST ESP: Linguistics Processing Features

  • Comments 1
  • Likes

FAST ESP performs linguistic processing:

  • at the document level – during document processing, and
  • at the query level – during query and result processing

This helps it provide more relevant search results to users. On the query side, linguistic processing results in a query transformation, on the document side, linguistic processing results in document enrichment prior to indexing in order to cover grammatical forms and synonyms.

FAST ESP provides the following linguistic processing features:

Language detection and encoding
  • Configurable tokenization and character normalization
  • Lemmatization
  • Nrmalization of national spelling variations
  • Phrase Detection and Spell Check
  • Synonyms and Thesauri
  • Anti-phrasing
  • Entity Extraction
  • Language detection and encoding - automatically recognizes 81 different languages in different encoding. This helps it select the language-specific dictionaries and algorithms during document processing and query processing. This automatic detection can be changed in cases where the document language is defined by the metadata

    Configurable tokenization and character normalization – ESP detects white space characters and symbols that separate words from each other that are not relevant to the matching process. Configuration options like which characters will be treated as white space and which will split tokens etc are available. It also provides an open tokenization framework for easy integration of customer specific tokenizers. A comprehensive Asian language-processing feature provides tokenization and language normalization specific to Japanese, Chinese, Korean and Thai languages.

    Lemmatization - Lemmatization means the mapping of a word to its base form and / or all its other inflectional forms. It allows for matching with grammatical variations of words. Lemmatization matches words with all inflections of that word and improves both recall and precision. For example, a search for ‘car’ matches both ‘car’ and ‘cars’. If enabled on the search front-end, the users can turn lemmatization on or off per query.

    Normalization of national spelling variations - Character normalization is the replacement of characters or character sequences to improve search results. An example is the mapping of the the French e´ to e. This helps improve recall, but may, in some cases, have a negative impact on precision. Phonetic normalization is normalization using phonetic matching rules and is performed on the query and document side. Terms that are written differently but sound the same can give the same result.

    Phrase Detection and Spell Check – This feature detects names/phrases and automatically rewrites queries or provides search tips which can be displayed to the end user. Custom phrase dictionaries may be created and phrasing can be combined with spell checking. FAST ESP supports

    • simple spell check: check against language specific dictionaries,
    • phonetic spell check: check for phonetic similarities and
    • advanced spell check: check against custom list of words and phrases

    Synonyms and Thesauri – FAST ESP enables synonym and spelling variation expansion of queries or indexed documents. The query-side expansion adds synonyms or spelling variations to the query prior to the actual matching. The document-side expansion expands the document with synonyms in a separate part of the index. It may be controlled in the same way as lemmatization at query time. It is also possible to combine the two solutions.

    Anti-phrasing – Anti-phrases are parts of a query that do not contribute essentially to the query's meaning, such as "Where can I find". Using standard dictionary, Anti-phrasing feature removes such phrases - the query ‘where is X’ is transformed to ‘X’, which improves query recall, particularly in the AND (“all words”) query mode

    Entity Extraction - FAST ESP provides an extensive Entity Extraction framework in order to detect names, locations and other well defined elements in unstructured text. These entities can be annotated to the indexed documents, but can also be annotated to semantic structures in the text, such as paragraphs or sentences. Such annotation enables normalized matching of entities as well as contextual navigation into detected entities from search results. FAST ESP provides:

    • Language dependent entity extractors for person, company, location and date
    • Language-independent Entity Extractors for price, measure, uppercase, acronym, email, filename, ISBN, university, url, newspaper (US), phone, zip code, ticker, date/time (ISO) and quotation
    • Noun Phrase Extraction Document Processor to extract phrases like ”competitive advantage”, ”key driver”, ”sellers market”
    • Tools to create domain-specific entity extraction rules and dictionaries. The entity extraction can be based on dictionary lookup, statistical, rules or a combination of those
    Your comment has been posted.   Close
    Thank you, your comment requires moderation so it may take a while to appear.   Close
    Leave a Comment