When you have a cluster with a few hundred nodes, in a network environment that's often beyond your control, running applications that you didn't write, used by users who are not always predictable, you are expected to see some errors most of the time. Troubleshooting and fixing a cluster is hard with all these variables. We heard this from our customers and also experienced these pains first hand when we did TOP500 deployments of Windows clusters.

 In Windows HPC V2 (2008) we started working on a solution. V2 shipped with 16 built-in tests that we developed to cover most common errors in the OS and network. After V2 was shipped, we heard from both customers and our field engineers that these tests really helped their diagnostics work. And we also got a lot of requests on additional tests that would be useful: networks, applications, drivers, GPGPUs etc.


We are adding more tests to V3 especially to help deployment and network troubleshooting. We also figured that another good way to increase coverage of diagnostics tests is by allowing our partners such as ISVs, IHVs, OEMs and Solution Integrators to develop tests on our platform. These partners have the expertise in the application or hardware that they make, and we can help by providing an execution environment and user experience. Hence the V3 diagnostics extensibility feature  was born.

The V3 diagnostics platform allows adding custom tests to Cluster Manager through an XML file. In the XML file, a test is defined by specifying CLI commands used run ClusRun or to create a job. When we designed this feature, we know that many partners and customers already have scripts that they use for troubleshooting and we want to make sure these can easily be used as custom tests with minimal work.

Reading returned data from a few hundred nodes is not easy. So we are providing built-in mechanism for aggregating and formatting test results for a few common scenarios such as checking consistency among compute nodes. The Cluster Manager UI can display any test result in html - which means any custom result display in HTML is acceptable.


If you find this blog interesting, here are a few videos demoing how to build custom tests:

If you have more questions or thoughts or just want to tell me that you like it, send me an email at rae.wang@microsoft.com.