Many have been asking for what i need to check to make sure SCCM is healthy. According to me if I were you these are things I would have checked :)
This might not be complete but had tried to include what ever I could think of.
Daily Administrative Task
Daily Site Monitoring Tasks
To best maintain your system, perform the following monitoring tasks on a daily basis. If there is any indication of a problem, isolate and repair the problem to ensure that the site remains healthy.
Daily site monitoring tasks include:
Use the SQL Server DBCC command to check the health of the SCCM site database. Use any other tools available to test the health of the SCCM site database.
View site status summary information in the SCCM Administrator console, or create reports that summarize the server activity and status (such as the Clients that Received a Specific Advertised Program report). If necessary, check status messages of individual components. For further details, in case status messages indicate that a problem exists, view the relevant log files. Isolate and fix conditions that generate errors or warnings. If appropriate, reconfigure the status system so that only relevant and helpful messages are recorded.
Check the status of items such as:
Check the state of site systems throughout the SCCM hierarchy. Use the status system and, if necessary, use log files to determine if site systems are having problems, such as:
Check the state of clients in the hierarchy. Run queries on status messages to detect any problems that clients might be having, such as:
You can monitor a client's status only if it creates status messages, and these status messages reach the site server.To detect clients from which you are missing status messages, you need to run a query that returns all clients that have not reported a status message within the last <n> days. In this query, <n> is the length of time you would expect to receive a status message from that client (taking into account the frequency of hardware or software inventory and the regular time it takes for status messages to reach the site server.)
On key servers, check the application, system, and security system event logs. You can access those through the Event Viewer administrative tool. Look for messages that indicate error conditions or developing problems. Isolate and repair the conditions that generate error or warning messages.
When installing an SCCM site server, its default configuration is to write status messages to the event log. This helps you identify any developing problems with SCCM.
Note :When SCCM is configured to write status messages to Windows event logs, SCCM error status messages are written as Information events, not Error events.
Save instances of the most recent event log files for future comparison. When you can compare current log files with previous log files, it is easier to detect problems that are developing. After saving the log files, you can clear them from the event log so it is easier to detect new problems.
To check whether the site server and component servers have sufficient resources and that SCCM site services are running optimally, you must monitor site server and component server performance. Use performance-monitoring tools such as System Monitor in the Performance console. Check the status of critical components on the site server, on the computer running SQL Server, and other SCCM site systems.
SCCM installs many performance monitor counters, but you can add, remove and configure counters as needed. You can also use the SQL Server performance counters.
Save performance log files for future comparison. It is easier to detect performance trends, and to identify potential bottlenecks, when comparing current performance log file to previous performance log files.
Check the Inboxes to Monitor
Listed here is a list of the ConfigMgr 2007 inboxes that should be checked on a regular basis to ensure that your site(s) function as expected.Auth\Dataldr.Box : A backlog of files can indicate problems accessing the site database. Auth\Dataldr.Box\Process :A backlog of files can indicate problems accessing the site database.
Auth\Ddm.box\Bad_DDRs :A backlog of files can indicate a network corruption problem or a problem with the DDMAuth\Sinv.Box :A backlog of files can indicate that the Software Inventory Processor cannot connect to the site database or that too many files were received.Auth\Sinv.Box\Orphans :A backlog of files can indicate problems with specific clients, with management points, or with the network that could cause data corruption. Compsumm.Box : A backlog of files can indicate that the Component Status Summarizer cannot process the volume of messages. Dataldr.Box :A backlog of files can indicate problems accessing the Systems Management Server (SCCM) databaseDataldr.Box\Badmifs :A backlog of files can indicate a bad custom MIF file or that a client computer cannot transfer the file correctly. Ddm.Box :A backlog of files can indicate a bad DDR is preventing other DDR’s to process.Ddm.Box\Bad_DDRs :A backlog of files can indicate a network corruption problem or a problem with the DDMOfferSum.Box : A backlog of files can indicate a performance problem that is caused by a large number of messages. Policypv.Box :A backlog of files in the policypv.box folder indicates that the policy provider component is not running. Replmgr.Box\Ready:A backlog of files can indicate that the Scheduler is backlogged or is already processing files of the same prioritySchedule.Box:A backlog of files can indicate that the Sender cannot connect to or cannot transfer data to another site. Schedule.Box\Outboxes :A backlog of .srq files indicates that the sender cannot process the number of jobs scheduled for that sender or that the sender cannot connect to or transfer data to another site.Schedule.Box\Tosend :A backlog of files can indicate that many send requests are not completed or that the Scheduler has not yet deleted the files. Sinv.Box :A backlog of files can indicate that the Software Inventory Processor cannot connect to the site database or that too many files were received.Sinv.Box\BadSinv :A backlog of files can indicate problems with specific clients, with management points, or with the network, causing data corruption. SiteStat.Box :A backlog of files can indicate a performance problem. Examine status messages for the Site System Status Summarizer for possible problems. Statmgr.Box\Futureq :A backlog of files can indicate that some site systems' clocks are not synchronized with the site server. Statmgr.Box\Queue :A backlog of files can indicate a problem with the Status Manager or that the component is trying to process too many messages.Statmgr.Box\Retry :A backlog of files can indicate problems with the connection to the computer that is running SQL Server.Statmgr.Box\Statmsgs :A backlog of files can indicate a problem with the Status Manager or that the Status Manager is trying to process too many messagesSwmproc.Box :A backlog of .sum and .sur files can indicate that the Software Metering Processor component cannot connect to the SCCM database.
Check Daily Maintenance Task
Check and make sure that the daily Maintenance Task if any. We can use the smsdbmon.log for more details.
Weekly Administrative Task
Weekly Site Monitoring Tasks
To best maintain your system, perform the following monitoring tasks on a weekly basis. If there is any indication of a problem, isolate and repair the problem, to ensure that the site remains healthy.
Weekly site monitoring tasks include:
To find the amount of space used by database devices, run the SQL Server stored procedure sp_spaceused against the SCCM site database. For more details about sp_spaceused, see the SQL Server Help. Check the tempdb device at peak usage, when several instances of the SCCM Administrator console are using the database and the site is actively processing objects.
Check the amount of available disk space on the site server, the SCCM site database server, and other SCCM servers. Ensure that the amount of free disk space is sufficient for SCCM and SQL Server to perform properly during regular and increased activity load.
To use the Status System to view information about site system disk space
Configuration Manager > Site Database > System Status> Site Status> <site name>> Site System Status
Weekly Site Maintenance Tasks
To best maintain your system, perform the maintenance tasks in this section on a weekly basis. You can automate some tasks by scheduling predefined maintenance tasks or custom maintenance tasks, as appropriate, to run on a weekly basis.
Weekly site maintenance tasks are:
Weekly Automated Tasks
The following predefined maintenance tasks should be scheduled to run on a weekly basis. For more information about these tasks, see the "Predefined Site Maintenance Tasks" section earlier in this chapter.
If Management Information Format (MIF) files or IDMIFs are used to extend hardware inventory in your site, then any MIF files that are not valid are placed in the SCCM\inboxes\dataldr.box\BADMIFS folder and SCCM never removes them. You must empty this folder manually. If a large number of MIFs are placed in the BADMIFS folder, it is likely that a MIF generating tool is producing the MIFs with an incorrect format. Investigate and repair the cause of the bad MIFs.
Delete objects such as collections, queries, and packages that are no longer needed at the site. Deleting unnecessary objects saves disk space, reduces intersite replications, and increases performance.
Caution :When deleting a collection, any advertisements to that collection are also deleted.
Over time, disk volumes on SCCM site systems become fragmented. Site operations such as distributing large software packages might significantly increase fragmentation on site servers and distribution points. As fragmentation increases, disk operations take longer, thus, the overall site performance decreases.
Run disk defragmentation tools on the SCCM site server and all other site systems to maintain the performance level of disk operations.
Windows event logs can get full, and by default, new items will start to overwrite older items. To diagnose problems, and for other reasons, it might be necessary to refer to an older event log. It is recommended that you back up Windows event logs, and store the backups in a safe and accessible location. If necessary, increase default logs file size to accommodate larger amounts of data.
Periodic Administrative Task
Periodic Site Maintenance Tasks
To best maintain your system, perform the following tasks periodically. Use the predefined maintenance tasks when appropriate.
Periodic site maintenance tasks include:
To properly recover a site server, you must have information about the accounts that SCCM used before the site failed. Account data is stored in domain controllers.
Use Microsoft tools, such as the NTBackup.exe tool that comes with Windows Server, or third-party tools to back up account data as follows:
To maintain the level of security in your hierarchy, you must periodically change the passwords and the accounts that SCCM sites use. Report any changes to the security staff so that security administrators know that these changes are planned and authorized.
To develop an effective security maintenance plan for your SCCM hierarchy, you must thoroughly understand how security is deployed in your hierarchy and make the following decisions:
Check the available bandwidth and error rates on the networks used by the SCCM hierarchy. Use Network Monitor to capture and analyze network frames so you can diagnose network problems and look for optimization opportunities.
SCCM evolves with time. User roles change, and people might no longer need access to some or any of the SCCM functions. Although most changes in access permission should be implemented after role or staff changes, you should also periodically review the access for all users or groups to identify and delete unauthorized access permissions.
The security plan implemented for the SCCM hierarchy in your organization needs to support the risk assessment of your organization. As your organization changes, policies can become ineffective.
Review security-related settings such as:
Use the maintenance plan document to review the SCCM maintenance plan. SCCM evolves with time, and it might be necessary to adjust the maintenance plan to accommodate growth, development, and other changes in your organization.
If there were any changes in your organization's security strategy, backup and recovery strategy, or any other strategy that affects SCCM, then determine if the maintenance plan needs to be adjusted to reflect these changes.
Review maintenance tasks configuration. Check the amount of data in the site database and evaluate the usefulness of that data against the amount of space that it occupies in the database. If necessary, adjust the settings that determine the number of days that data is retained in the database.
Update the maintenance plan document to reflect any changes to the maintenance plan, and then distribute it to all SCCM administrators that are using it.
A recovery test should follow the recovery plan developed for the production environment. Plan to perform a recovery test of the central site, and of any other systems deployed in your hierarchy. A recovery test should test all phases of recovery, including:
Periodic Site Monitoring Tasks
To best maintain your system, perform the following monitoring tasks periodically. If there is any indication of a problem, isolate and repair the problem to ensure that the site remains healthy.
Periodic site monitoring tasks include:
Even high-quality hardware occasionally fails. Sometimes, it fails gradually, so there might be early signs. Replacing hardware before it completely fails is a key step in preventing site failure. Both Windows and SCCM provide performance counters, which you can use to monitor the performance and state of the hardware used in the site.
As soon as you notice any signs of hardware-related unreliable behavior of an SCCM server, replace the hardware. To properly replace server hardware, you must use the Recovery Expert. For more information about swapping the computer of SCCM servers, see the "Swapping the Computer of a Site Server" section later in this chapter.
It is recommended that you periodically perform a more thorough health check, as follows:
If any of these tests fail, you need to diagnose the problem and repair it.
At the end of every site backup cycle you should check the validity of the backup snapshot. Periodically, you should perform a more thorough check to ensure that the site's backup snapshots can be successfully used for recovery.
Restore a recent backup snapshot to a disk and examine file continuity, file size, and other file properties to ensure that they do not seem corrupted. Check critical files by restoring these files to their respective applications to ensure that the application can use the restored file.
Very Helpful - thanx
Nice job with this... helpful. Thank you.
Excellent... every info captured
Do you have script for Daily activuty checking
Very helpful for me ..Thank you.
I assume this applies with 2012 as well, but if there are modifications or additions, could we get an updated list? It would be most appreciated.
very well documented
How to automate this
Excellent post. Now if you can explain more on the details especially on site-to-site communication healthcheck that'll be great.