How is a duplicate document identified in search results?
Document similarity for purposes of identifying duplicates is based only on a hash of the content of the document. No File properties (e.g. file name, type, author, create and modify dates) are input to this hash. The MSSDuplicateHashes table in the SSP’s search database holds, for each document, all the 64bit hashes necessary to determine if one document is a near-duplicate of another. This is read while doing a search if duplicate collapsing is enabled. Reference : http://blogs.technet.com/b/harikumh/archive/2008/11/14/some-interesting-facts-about-sharepoint-2007-search.aspx
MSSDuplicateHashes
Reference : http://msdn.microsoft.com/en-us/library/dd956874(office.12).aspx
The MSSDuplicateHashes table stores the identifier of item 's that is used for duplicate result removal. For any given item there MUST be exactly 6 rows in the MSSDuplicateHashes table. Each of these 6 rows SHOULD contain an identifier of the item.
The T-SQL syntax for the table is as follows:
TABLE MSSDuplicateHashes(
DocId int NOT NULL,
HashVal bigint NOT NULL
);
DocId: The unique identifier of an item.
HashVal: The identifier of the item.
2.2.5.6 MSSCrawlChangedTargetDocs
Reference: http://download.microsoft.com/download/8/5/8/858F2155-D48D-4C68-9205-29460FD7698F/%5BMS-SRCHTP%5D.pdf
The MSSCrawlChangedTargetDocs table stores all the document identifiers that the crawled documents point to.
The T-SQL syntax for the table is as follows: 27 / 170 [MS-SRCHTP] — v20100720 Search Topology Protocol Specification Copyright © 2010 Microsoft Corporation. Release: Tuesday, July 20, 2010
TABLE MSSCrawlChangedSourceDocs(
CrawlId int NOT NULL,
IsDuplicate bit NOT NULL
CrawlId: A unique identifier of the crawl
DocId: The document identifier(1) containing one or more links from the other crawled documents.
IsDuplicate: A bit that MUST be 1 if the document is a duplicate of another document; otherwise, it MUST be 0.
Database Refactoring Sequence
Reference: http://msdn.microsoft.com/en-ca/library/dd932432(office.12).aspx
SharePoint
For refactoring tasks of type "PropertyStoreCopy" information that is stored in the following tables is copied from the source metadata index to the destination metadata index:
These tables are documented in [MS-SQLPQ2].The source and destination metadata indexes are defined by SourceComponentID and DestinationComponentID parameters of the refactoring task. Rows that correspond to the documents that satisfy both of the following two conditions MUST be copied:
For refactoring tasks of type "PropertyStoreDelete" information that is stored in the following tables is deleted:
These tables are documented in [MS-SQLPQ2]. The metadata index from which information MUST be deleted is defined by SourceComponentID parameter of the refactoring task. Rows that correspond to the documents that satisfy both of the following two conditions MUST be deleted:
If you do NOT want to view duplicate search results then a workaround is listed below.
Disable view duplicates search functionality. To fix this, just disable the view duplicates search functionality.
1. Execute a search in order to get to the search results page (by default results.aspx)
2. Choose Site Actions -> Edit Page.
3. Choose Modify Shared Web Part on the Search Core Results Web Part.
4. Uncheck the checkbox “Remove Duplicate Results”.
5. Choose OK and Publish the page.