Content Index Grows from 50% to 167% of Database Size

Symptoms

1. Content Index (CI) files are usually 5-10% the size of a mailbox database in Exchange 2007 and Exchange 2010.

For further information, click on this link:

Understanding Exchange Search
https://technet.microsoft.com/en-us/library/bb232132(EXCHG.80).aspx

2. When the number and size of the CI files grow too large, a Master Merge is performed automatically by design to consolidate the indexes and shrink the overall size of the Content Index.

3. During a Master Merge of the Content Index files, the Catalog folder containing the content index files typically grow twice the normal size or up to 20% of the mailbox database size.

4. You may see the Catalog folder grow larger than 20% - from 50% to 167% of the mailbox database size during the Master Merge.

5. This issue has only been reported for Exchange 2007 RTM, Exchange 2007 SP1, and Exchange 2007 SP2. As of 8-8-11, there have been no reported cases on Content Index growing larger than 20% of the database size in Exchange Server 2007 Service Pack 3 or Exchange Server 2010.

If you find a case in which the Content Index grows larger than 20% of the database size in Exchange Server 2007 Service Pack 3 or Exchange Server 2010, please check to see if Loadsim for Exchange 2007 or Loadgen for Exchange 2010 was used to generate the mail (see below). If not, please reply to this post.

Cause

1. This issue has only been reported for MSSearch 3.0. Exchange 2007 RTM, Exchange 2007 SP1, and Exchange 2007 SP2 uses MSSearch 3.0.

2. Exchange 2007 SP3 and greater and Exchange 2010 RTM and greater uses MSSearch 3.1

3. The cause of the issue in MSSearch 3.0 is that Master Merges are not scheduled.

4. This issue has not been reported for MSSearch 3.1

5. As of 8-8-11, there have been no reported cases on Content Index growing larger than 20% of the database size in Exchange Server 2007 Service Pack 3 or Exchange Server 2010.

If you find a case in which the Content Index grows larger than 20% of the database size in Exchange Server 2007 Service Pack 3 or Exchange Server 2010, please check to see if Loadsim for Exchange 2007 or Loadgen for Exchange 2010 was used to generate the mail (see below). If not, please reply to this post.

Resolution

1. Upgrade to MSSearch 3.1 - either upgrade to Exchange Server 2007 Service Pack 3 or or greater or upgrade to Exchange Server 2010 RTM or greater.

2. As of 8-8-11 (except when adatabase isgenerated using Loadsim in Exchange 2007 or Loadgen in Exchange 2010 - see below), there have been no reported cases on Content Index growing larger than 20% of the database size in Exchange Server 2007 Service Pack 3 or Exchange Server 2010.

If you find a case in which the Content Index grows larger than 20% of the database size in Exchange Server 2007 Service Pack 3 or Exchange Server 2010, please check to see if Loadsim for Exchange 2007 or Loadgen for Exchange 2010 was used to generate the mail (see below). If not, please reply to this post.

Loadgen and Loadsim Exception

NOTE: There is now an exception to the rule of 5% to 10% above - the issue has now been reported in Exchange 2010 SP1 when using Loadgen for Exchange 2010 to create mailboxes and send mail. Using Loadgen can cause CatalogData sizes up to 40% of the database size. This would also apply to databases created with Loadsim for Exchange 2007 SP3 or greater.

Results of Testing Content Index Generation for Exchange 2010

1. We generated mail using Loadgen for Exchange 2010 and then dumped out some of the content indexing files and sorted them by subject alphabetically

Here is an example – these are words found in the index:

1ea26 : 69b9 100a : 'acvrathzwimqhj8'
1ea26 : 6aa8 100a : 'acvratia8m'
1ea26 : 6b55 100a : 'acvratiaopjygi83r92qvgch6'
1ea26 : 6ce2 100a : 'acvraticmxoapjrar'
1ea26 : 6dff 100a : 'acvratieq8lohlemt8ceww1cuenfag'
1ea26 : 6fec 100a : 'acvratietgm7vxqks'
1ea26 : 70f9 100a : 'acvratiha6gyjydiryszimlorpci6w'
1ea26 : 72e6 100a : 'acvratihakmy'
1ea26 : 7393 100a : 'acvratiho4stib4et863w0ej0b5akq'
1ea26 : 7570 100a : 'acvratihutgxpgnsrngvakjzhjzlea'
1ea26 : 774d 100a : 'acvratij6fa5a4qqsaccv7'
1ea26 : 78ba 100a : 'acvratijc4kyuqunsjmtxjtps066zg'
1ea26 : 7a97 100a : 'acvratije5hqnmi1qayaku'
1ea26 : 7bf4 100a : 'acvratije7qzx0c9reu04kayls05dg'

2. The default Loadgen word mix is an artificial Latin mix that has very few unique words concatenated together in various ways to make random sentences. This is not a good representative mix for content indexing.

3. Loadgen does generate attachments but only .txt attachments which are always searchable. In production, many attachments such as .jpg and .gif are not searchable.

4. Based on testing and based on experience on Microsoft Exchange 2010 servers in production, Loadgen simulations do not have a varied enough message mix to provide a realistic catalog size.

5. Loadgen based catalog sizes can be as much as 30-40% or greater of the database size, whereas in production we consistently see 5-10%.

6. The 5-10% catalogdata folder size estimate is based partially on an average of these elements:

Distribution of words
Distribution of phrases
Distribution of attachment types

7. Loadgen does not use an average distribution for any of these things.

8. Therefore, this is the reason that Loadgen based catalog size files can be much larger than the usual 5-10% sizes consistently found in production and why Loadgen based catalog sizes can be as much as 30-40% or greater of the database size.

Bob Want, Senior Support Escalation Engineer, Enterprise Communications Services, Microsoft

Mike Hendrickson, Escalation Engineer, Enterprise Communications Services, Microsoft