MOSS 2007 : Duplicate Search Results

MOSS 2007 : Duplicate Search Results

  • Comments 1
  • Likes

How is a duplicate document identified in search results?

Document similarity for purposes of identifying duplicates is based only on a hash of the content of the document.  No File properties (e.g. file name, type, author, create and modify dates) are input to this hash.  The MSSDuplicateHashes table in the SSP’s search database holds, for each document, all the 64bit hashes necessary to determine if one document is a near-duplicate of another.  This is read while doing a search if duplicate collapsing is enabled. Reference : http://blogs.technet.com/b/harikumh/archive/2008/11/14/some-interesting-facts-about-sharepoint-2007-search.aspx

MSSDuplicateHashes

Reference : http://msdn.microsoft.com/en-us/library/dd956874(office.12).aspx

The MSSDuplicateHashes table stores the identifier of item 's that is used for duplicate result removal. For any given item there MUST be exactly 6 rows in the MSSDuplicateHashes table. Each of these 6 rows SHOULD contain an identifier of the item.

The T-SQL syntax for the table is as follows:

TABLE MSSDuplicateHashes(

    DocId              int NOT NULL,

    HashVal            bigint NOT NULL

);

DocId: The unique identifier of an item.

HashVal: The identifier of the item.

2.2.5.6 MSSCrawlChangedTargetDocs

Reference: http://download.microsoft.com/download/8/5/8/858F2155-D48D-4C68-9205-29460FD7698F/%5BMS-SRCHTP%5D.pdf

The MSSCrawlChangedTargetDocs table stores all the document identifiers that the crawled documents point to.

The T-SQL syntax for the table is as follows: 27 / 170 [MS-SRCHTP] — v20100720 Search Topology Protocol Specification Copyright © 2010 Microsoft Corporation. Release: Tuesday, July 20, 2010

TABLE MSSCrawlChangedSourceDocs(

CrawlId int NOT NULL,

DocId int NOT NULL,

IsDuplicate bit NOT NULL

);

CrawlId: A unique identifier of the crawl

DocId: The document identifier(1) containing one or more links from the other crawled documents.

IsDuplicate: A bit that MUST be 1 if the document is a duplicate of another document; otherwise, it MUST be 0.

Database Refactoring Sequence

Reference: http://msdn.microsoft.com/en-ca/library/dd932432(office.12).aspx

SharePoint

For refactoring tasks of type "PropertyStoreCopy" information that is stored in the following tables is copied from the source metadata index to the destination metadata index:

  • MSSDocSdids
  • MSSDefinitions
  • MSSDuplicateHashes
  • MSSDocResults
  • MSSDocProps

These tables are documented in [MS-SQLPQ2].The source and destination metadata indexes are defined by SourceComponentID and DestinationComponentID parameters of the refactoring task. Rows that correspond to the documents that satisfy both of the following two conditions MUST be copied:

For refactoring tasks of type "PropertyStoreDelete" information that is stored in the following tables is deleted:

  • MSSDocSdids
  • MSSDefinitions
  • MSSDuplicateHashes
  • MSSDocResults
  • MSSDocProps

These tables are documented in [MS-SQLPQ2]. The metadata index from which information MUST be deleted is defined by SourceComponentID parameter of the refactoring task. Rows that correspond to the documents that satisfy both of the following two conditions MUST be deleted:

  • Document identifiers(1) is in the range defined by StartDocID and EndDocID parameters of the refactoring task batch
  • Document distribution identifier is in the set defined by the Refactoring Task Part Result Set returned from proc_MSS_GetRefactoringTask

If you do NOT want to view duplicate search results then a workaround is listed below.

Disable view duplicates search functionality. To fix this, just disable the view duplicates search functionality.

1. Execute a search in order to get to the search results page (by default results.aspx)

2. Choose Site Actions -> Edit Page.

3. Choose Modify Shared Web Part on the Search Core Results Web Part.

4. Uncheck the checkbox “Remove Duplicate Results”.

5. Choose OK and Publish the page.

Comments
Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment