Hi Everyone:
I Have to confess it has been almost 8 months since my last blog entry.
Apparently the world at large has not noticed, but I felt it was important to make the effort and provide a quick update on the status of Actuarial Modeling and Window HPC Server.
We continue to make progress in gathering the support of key partners in this market space. Our partners now include Milliman, Towers Perrin, GGY, SunGard, Watson Wyatt, Polysystems, ARCVal, and a variety of Hedging codes.
We have been active at trade shows, including ERM last April in Chicago and the SOA annual show just last October in Boston.
And most importantly we continue to help our customers in the Insurance industry save money and improve their ability to model risk. If you are curious who is using Windows HPC Server and Window HPC OS drop me an email and I can fill you in.
Up until this point we have focused on the Life Insurance industry, but I hope to branch out soon to Catastrophic modeling for P&C, and Web based analytics for a variety of other applications.
Last but not least, I participated in a very content heavy webinar with Jim Brackett of the Milliman Financial Risk Management Practice and Brian Reid of the Milliman MG-ALFA team, around Financial reporting requirements and their impact on Actuarial Modeling. The Webinar was titled "Leveraging Windows HPC Server for Financial Reporting and Financial Risk Management using Milliman Tools" .
Please let me know if you have any questions about Windows HPC and Risk Modeling, Economic Capital modeling, or Enterprise Risk Management and we will do our best to help.
Your advocate for Actuaries at Microsoft
Dave Dorfman
Our first Beta release is now available! You can read the full press release at http://www.microsoft.com/presspass/press/2009/nov09/11-16SC09PR.mspx if you're in to reading that kind of thing ;)
Windows HPC Server 2008 R2 delivers productivity, performance and ease-of-use improvements in several areas, including the following:
- Improved scalability, with Windows HPC Server 2008 R2 offering out-of-the-box support for deploying, running and managing clusters up to 1,000 nodes
- New configuration and deployment options such as diskless boot, mixed-version clusters and support for a remote head node database
- Improved system management, diagnostics and reporting including an enhanced heat map, multiple customizable tabs, an extensible diagnostic framework and the ability to create richer custom reports
- Improved support for service-oriented architecture (SOA) workloads including a new fire-and-recollect programming model, finalization hooks, improved Java interoperability, automatic restart and failover of broker nodes, and improved management, monitoring, diagnostics and debugging
- Message Passing Interface (MPI) and networking enhancements including optimizations for new processors, enhanced support for RDMA over Ethernet and InfiniBand, improved MPI debugging, and a pushbutton HPC LINPACK optimization wizard
- New ways to accelerate Microsoft Office Excel workbooks such as support for Cluster-Aware User-Defined Functions and the capability to run distributed Excel 2010 for the cluster
Come and join our beta program to give it a try, and you can give us feedback (positive or negative) on it: http://connect.microsoft.com/HPC/content/content.aspx?ContentID=6923
Today Clustercorp introduced their new hybrid solution, ROCKS(R)+HYBRID, for supporting Windows HPC Server and Linux on the same hardware. They've pulled together a Rocks-based deployment solution that is consistent with both Windows HPC Server 2008 and Rocks philosophies. I think it will be welcomed by our customer base and that it greatly eases the burden of deployment and management of dual-boot clusters. Many of our enterprise customers have expressed a desire to get more out of their cluster hardware by being able to use the same hardware for both Windows and Linux workloads. This new solution will be an important milestone for how easily those use cases are deployed and managed.
While the devil is in the user and admin experience details, it looks like the combination of Rocks and the Hybrid Roll deployment is consistent with the Rocks Philosophy of shoot the node and redeploy. Importantly, this offering also uses existing Windows HPC Server 2008 deployment features to deploy the Windows OS to compute nodes. It really looks like the best of both worlds.
This Rocks+Hybrid solution looks like it could be a win for both administrators and users. I've got to set aside some time to bring it up on my test cluster. Big fun and thanks Clustercorp.
"Son of a gun, we'll have big fun on the bayou." Hank Williams, 1952
Frankie
I find a lot of people are starting to build demonstration clusters on a small scale with some sort of minimal ‘enterprise’ network. In my case the enterprise network is my home router and the private network is a little 5 port GigE switch.
To make it a bit more realistic and to be able to test configurations where the head node is _not_ the Active Directory Domain controller, I created a Domain Controller for my test network. I then added the cluster head node to this domain and installed the HPC 2008 Pack. The setup worked like a champ.
Then about one month later I was trying to run some tests when I noticed that the head node reported that it was ‘Unreachable’. How can this be I thought? All its networks were active, it could ping itself and the router and other private network nodes. Finally I stumbled onto trying to ping the Domain Controller. Surprise! No response from the DC. It seems that my 5 year old NIC on the Domain Controller had crossed the digital divide to the land of failed hardware.
Replacing the NIC, and installing a valid 64-bit driver for it brought the DC back into the network and in next to no time the head node self-reported as reachable.
When a domain joined system boots, it tries to contact its domain controller. If it can’t, it will come up and allow console logins on cached credentials, but periodically look for a domain controller connection to see if it can trust itself.
In my case the broken NIC prevented DC access, but the cached credentials allowed console logins.
I have seen similar Unreachable situations when a compute node cannot reach its domain controller. If you log on to a node immediately after it has rebooted and use the Event Viewer to look at the Windows Logs-> Security events you will see numerous Logon and Special Logon events. This is how the node establishes itself as being a member of the cluster, and if these fail, the node will appear as Unreachable.
Frankie
An issue came to our attention after Windows HPC Server 2008 shipped regarding the way we set affinity on the processes within a job. There are actually two places where we set affinity:
1. The Node Manager (the service on each node responsible for starting jobs and tasks) sets processor affinity on each task to prevent that task from using processors which are not assigned to it.
2. MPIEXEC (which is used to start MS-MPI applications) can, when the –affinity flag is provided, set affinity on all ranks within the MPI application.
The problem that we encountered is this: Due to the way affinity setting works on Windows Job Objects (which we use to run tasks) and the processes within them, you cannot set affinity at both layers. That means that in MPI Tasks which are allocated less than an entire node, the –affinity flag will end up being ignored on the MPIExec command line, since the affinity has already been set by the scheduler and cannot be set in two places. This caused problems for some applications, especially those developed to work against the Compute Cluster Pack (which didn’t set affinity at all).
The problem is particular serious for jobs which specify the –exclusive option; when a job specifies the –Exclusive option it will be allocated an entire node. But the scheduler will set affinity on tasks within the job despite this. So an exclusive job with a 4 core task that is assigned an 8 core node would cause the scheduler to affinitize the task to only 4 cores: This leaves the other 4 cores idle if there are no other tasks in the job and is awfully confusing for some people and applications! Such a job would also not have MPI rank affinity, even if the –Affinity flag was specified.
Our solution is to introduce a new cluster parameter called AffinityType. AffinityType has three possible settings which work as follows:
· AllJobs- When AffinityType is set to AllJobs, the Node Manager will set affinity on any task that isn’t allocated an entire node. This is the behavior described above, and is probably the best choice for applications which may run multiple instances per node (e.g. Parameter Sweeps and SOA Jobs) and want these instances to be isolated from each other.
· NonExclusiveJobs (Default)- With this setting, the Node Manager will not set affinity on jobs which are marked as exclusive. This is the ideal choice for jobs with only 1 task, since that task will be able to take advantage of all cores on the nodes that it is assigned. We’ve made this the new default since it provides what is generally the preferred behavior for MPI tasks, which are most likely to be sensitive to affinitization. With this choice selected, MPI tasks in exclusive jobs can take advantage of the –Affinity flag to MPI even if they are not allocated an entire node.
· NoJobs- With this setting, the Node Manager will never set affinity on any task. This is an excellent choice for those running MPI Tasks who want to make sure they can take advantage of MPI’s –Affinity flag even when jobs may share nodes. This is also useful for applications which want to set their own affinity.
Note that Windows Server 2008 R2 will allow the setting of process affinity at both the Windows Job Object and Process level simultaneously. So hopefully in v3 of the HPC Pack there will no longer be an issues with the conflict between these two setting.
You can learn more about how the new AffinityMode flag works here: http://msdn.microsoft.com/en-us/library/microsoft.hpc.scheduler.properties.affinitymode(vs.85).aspx
Some customers have reported issues with the HPC Job Scheduler's event logs after installing Service Pack 1 for the HPC Pack. After some internal investigation, it looks like there is an issue with the SP1 Patch Installer which fails to correctly update the event log manifests (making it so you don't see events in the Event Viewer).
We are working on a patch for this issue. In the meantime, you can manually update the manifest as follows:
- Download the file attached to this blog post (SchedulerEvents.man)
- Install this manifest on the affected machine by running the following from an elevated command prompt: "wevtutil im SchedulerEvents.man"
That's it! This workaround should restore event log functionality. We are hard at work on a patch that will make this workaround obsolete.
UPDATE: We've managed to fix our patch installer to correct this problem. Installing any post-SP1 patch (for example, this one: http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=9d76e266-92d1-49b0-8bf3-cd811b6a5a4c) to the HPC Pack should fix this issue if you encounter it.
A common scenario is to have the Windows HPC Server 2008 Private Network DNS on the head node of the cluster. Because the HPC Pack normally manages name resolution for the private network via Hosts files it propagates to all nodes, we do not normally set the private network interfaces to register with the DNS on the head node.
I often find that I want to have other servers such as file servers or non-Windows OS servers located on the private network. These servers need to have all the names on the private network correctly resolved for them.
To get the behavior I desire from the head node private DNS is very simple, but not documented as widely as we’d like. Thus, this post seems like a good start at getting the trick documented.
On the head node go to Start->Microsoft HPC Pack->HPC PowerShell and right click selecting Run As Administrator. At the prompt enter the following:
PS C:\Windows\System32> Set-HpcNetwork -PrivateDnsRegistrationType WithConnectionDnsSuffix
This executes silently. I noticed that not many new Host(A) records appeared in my DNS console. I was told, “Be patient.” Instead, I just rebooted all nodes, including the non-Windows OS node Node-08.
Voila! I got my Host(A) records for everything.
Microsoft Research has an interesting project called Dryad which is investigating programming models for distributed data-parallel problems. If you've heard some of the hype around map/reduce programming or the Hadoop framework, you might be familiar with these type of problems. If you have gigabytes, terabytes, or even petabytes of data that you need to churn through, this is a powerful tool. On top of the Dryad framework is a programming model called DryadLinq which brings the power and familiarity of .NET's LINQ syntax to cluster computing. You can write your data analytics algorithm as a LINQ expression in your favorite .NET language and submit cluster jobs directly from Visual Studio.
I am delighted to announce that Microsoft Research has released a distribution of Dryad and DryadLINQ that runs on top of HPC Server 2008. You can download this release from here and contribute feedback and suggestions at the Microsoft Connect site here. If you want to learn more about Dryad and DryadLINQ, you can find more information on the Microsoft Research site. The source for DryadLINQ is also included in this release.
Releasing this has been a great collaboration between the HPC team and Microsoft Research. I am excited to get this work out there and see what people can do with it. So try it out and give us some feedback!
John Vert
Architect
Windows HPC Server
Update - Channel9 has a video interview with Erik Meijer and Roger Barga discussing this: http://channel9.msdn.com/posts/Charles/Expert-to-Expert-Erik-Roger-Barga-Introduction-to-Dryad-and-DryadLINQ/
If you read the documents posted by Mellanox about their new 2..0.5 build 4453 InfiniBand drivers you may have noticed the advice to update your firmware. If so, you will need to discover your PSID. This should be pretty straight forward, just install the drivers and then using the HPC Management console Run a Command feature run vstat on the node you wish to update. If you are lucky, you’ll see something like this:
NODE-08 -> Finished
-------------------------------------------------------------------------------------------------
hca_idx=0
uplink={BUS=PCI_E, SPEED=2.5 Gbps, WIDTH=x8, CAPS=2.5*x8}
vendor_id=0x08f1
vendor_part_id=0x6278
hw_ver=0xa0
fw_ver=4.08.0200
PSID=VLT0040010001
node_guid=0008:f104:0399:2054
num_phys_ports=2
port=1
port_state=PORT_ACTIVE (4)
link_speed=5.0 Gbps (2)
link_width=4x (2)
rate=20 Gbps
port_phys_state=LINK_UP (5)
active_speed=5.0 Gbps (2)
sm_lid=0x0001
port_lid=0x0009
port_lmc=0x0
max_mtu=2048 (4)
port=2
port_state=PORT_DOWN (1)
link_speed=NA
link_width=NA
rate=NA
port_phys_state=POLLING (2)
active_speed=2.5 Gbps (1)
sm_lid=0x0000
port_lid=0x0000
port_lmc=0x0
max_mtu=2048 (4)
If, like me, you are unlucky, you will not have a PSID line in the output. Like this:
NODE-07 -> Finished
--------------------------------------------------------------------------------------
hca_idx=0
uplink={BUS=PCI_E, SPEED=2.5 Gbps, WIDTH=x8, CAPS=2.5*x8}
vendor_id=0x066a
vendor_part_id=0x6274
hw_ver=0xa0
fw_ver=0x100020000
node_guid=0006:6a00:9800:f356
num_phys_ports=1
port=1
port_state=PORT_ACTIVE (4)
link_speed=5.0 Gbps (2)
link_width=4x (2)
rate=20 Gbps
port_phys_state=LINK_UP (5)
active_speed=5.0 Gbps (2)
sm_lid=0x0001
port_lid=0x000a
port_lmc=0x0
max_mtu=2048 (4)
If this happens to you, don’t waste time trying to find the PSID of your HCA. As soon as I find a way to pick the right firmware upgrade for my own HCAs without PSIDs, I’ll post how to do it. Until then, you and I must run on whatever firmware we already have.
Sorry,
Frankie
Mellanox has released a new version of its WHQLed WinOF drivers. V2.0.5 build 4453
See: http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family=32&menu_section=34
Click the download tab in the middle of the page and select the MLNX WinOF MSI v2.0.5 for x64 Platforms shortcut to begin the .msi download from http://www.mellanox.com/downloads/WinOF/MLNX_WinOF_2_0_5_wlh_x64_fre_2_0_5_4453.msi . I give the first link because it references other documents of interest to anyone installing these drivers.
This release includes both a .msi package mentioned above and an INF compatible package.
To update drivers on a preinstalled system use the HPC Management console Run Command feature on group of nodes and the command:
msiexec /quiet /forcerestart /i \\headnode\Home\LocalAdmin\MLNX_WinOF_2_0_5_wlh_x64_fre_2_0_5_4453.msi
I was able to test on on a single node and it had the somewhat disconcerting effect of claiming in the Run a Command window that the command had failed. This was because it never completed cleanly due to the forced restart. This is likely to also apply to running this command from clusrun.
To use the INF install, first unzip the file MLNX_WinOF INF file v2.0.5 for x64 Platforms from the page you arrived at from clicking the Download tab above: http://www.mellanox.com/downloads/WinOF/MLNX_WinOF_HPC_x64_2_0_5.zip .Then use the HPC Management console Configure->To-do List ->Manage Drivers link point to the INF directory. You may have to first remove any references to the older version of the drivers that are there from your earlier insertion of InfiniBand drivers.
Have fun and compute well,
Frankie
An HPC Server 2008 user reported that his cluster was up and running and that all nodes could ping each other over all networks but the built-in MPI diagnostic was failing with an uninformative message "Failed To Run".
He had a topology number three with the head node connected to the Enterprise network and all compute nodes connected to the head node via Ethernet as the Private network and Infiniband as the Applications network.
Please be aware that "Failed To Run" is a separate category from "Failure" and when a test doen't succeed, you may have to check both places in the Diagnostics tree Test Results branch. Once you find this tab you don't get much information beyond the result "Failed To Run". However if you click on the red ! labelled line you will see the bottom pane light up, but it still only says "Test Failed to Run". Look to the right side of that banner and you will see a bright red "Result" followed by a v in a circle. Click on the v and you get more information about the failure.
The test did not run. Please navigate to 'Progress of the test' to view log and error messages.
So where is the Progress of the test to be found? Well, if like me you often don't have the Actions pane open, you better click on the Actions tab near the top of the console. Now near the top of the Action pane you will see the link to "Progress of the Test". This is progress, of a sort. You'll likely see just a single line with the red ! and a State of "Reverted". Now click on that line.
Oh, boy, we're rockin' now. Here are the real error messages. This is so information rich it's almost embarassing.
Time Message
6/29/2009 10:16:53 AM Reverted
6/29/2009 10:16:53 AM The operation failed due to errors during execution.
6/29/2009 10:16:53 AM The operation failed and will not be retried.
6/29/2009 10:16:53 AM ---- error analysis -----
6/29/2009 10:16:53 AM
6/29/2009 10:16:53 AM mpi has detected a fatal error and aborted mpipingpong.exe
6/29/2009 10:16:53 AM [2] on NODE-03
6/29/2009 10:16:53 AM
6/29/2009 10:16:53 AM ---- error analysis -----
6/29/2009 10:16:53 AM
6/29/2009 10:16:53 AM [3-6] terminated
6/29/2009 10:16:53 AM
6/29/2009 10:16:53 AM Check the local NetworkDirect configuration or set the MPICH_ND_ENABLE_FALLBACK environment variable to true.
6/29/2009 10:16:53 AM There is no matching NetworkDirect adapter and fallback to the socket interconnect is disabled.
6/29/2009 10:16:53 AM CH3_ND::CEnvironment::Connect(296): [ch3:nd] Could not connect via NetworkDirect to rank 1 with business card (port=58550 description="10.1.0.2 192.168.0.28 192.168.0.39 NODE-02 " shm_host=NODE-02 shm_queue=3204:428 nd_host="10.1.0.2:157 " ).
6/29/2009 10:16:53 AM MPIDI_CH3I_VC_post_connect(426)...: MPIDI_CH3I_Nd_connect failed in VC_post_connect
6/29/2009 10:16:53 AM MPIDI_CH3_iSendv(239).............:
6/29/2009 10:16:53 AM MPIDI_EagerContigIsend(519).......: failure occurred while attempting to send an eager message
6/29/2009 10:16:53 AM MPIC_Sendrecv(120)................:
6/29/2009 10:16:53 AM MPIR_Allgather(487)...............:
6/29/2009 10:16:53 AM MPI_Allgather(864)................: MPI_Allgather(sbuf=0x00000000001FF790, scount=128, MPI_CHAR, rbuf=0x0000000000B70780, rcount=128, MPI_CHAR, MPI_COMM_WORLD) failed
6/29/2009 10:16:53 AM Fatal error in MPI_Allgather: Other MPI error, error stack:
6/29/2009 10:16:53 AM [2] fatal error
6/29/2009 10:16:53 AM
6/29/2009 10:16:53 AM [0-1] terminated
6/29/2009 10:16:53 AM
6/29/2009 10:16:53 AM [ranks] message
6/29/2009 10:16:53 AM job aborted:
6/29/2009 10:16:53 AM
First it tells us that Node-03 had a problem. Then it tells us to look at the Node-03 local Network Direct connection. Then it tells us that the environment is set to not fall back to Winsock Direct or TCP/IP. This is because falling back when people are expecting Network Direct performance can cause applications to run very slowly and is hard to diagnose. Trust me. I missed sleep over that one.
Then we have serveral lines of MPI error messages, which I generally summarize as the 'eager message no business card' error. You can ignore the rest of the message but keep in mind that whenever you see the eager message no business card error you should suspect your MPI network has a problem.
So, let's follow the advice at the beginning of the error messages, and check the InfiniBand status on Node-03. I use the Run Command feature of the Management Console to run the ndinstall tool. To make life easier, I copy the .exe for this tool to all of the nodes in C:\Windows\System32\ndinstall.exe . This tool is usually installed by the .msi install of the drivers on the head node. Search your system drive after you install the drivers and find this tool. Then put it on a head node share the compute nodes can see and use clusrun or the Run Command GUI to copy it to all the compute nodes. Here's the output from Node-03 (bad node no business card) and Node-02 (good node, pat pat).
Node-03
0000001001 - MSAFD Tcpip [TCP/IP]
0000001002 - MSAFD Tcpip [UDP/IP]
0000001003 - MSAFD Tcpip [RAW/IP]
0000001004 - MSAFD Tcpip [TCP/IPv6]
0000001005 - MSAFD Tcpip [UDP/IPv6]
0000001006 - MSAFD Tcpip [RAW/IPv6]
0000001007 - RSVP TCPv6 Service Provider
0000001008 - RSVP TCP Service Provider
0000001009 - RSVP UDPv6 Service Provider
0000001010 - RSVP UDP Service Provider
Node-02
0000001001 - MSAFD Tcpip [TCP/IP]
0000001002 - MSAFD Tcpip [UDP/IP]
0000001003 - MSAFD Tcpip [RAW/IP]
0000001004 - MSAFD Tcpip [TCP/IPv6]
0000001005 - MSAFD Tcpip [UDP/IPv6]
0000001006 - MSAFD Tcpip [RAW/IPv6]
0000001007 - RSVP TCPv6 Service Provider
0000001008 - RSVP TCP Service Provider
0000001009 - RSVP UDPv6 Service Provider
0000001010 - RSVP UDP Service Provider
0000001011 - OpenIB Network Direct Provider
Notice there is no 0000001011 OpenIB Network Direct Provider on Node-03. So, actually the diagnostic got it right immediately. I just took a long time to prove it. So now let's run ndinstall -i on Node-03. Again with the Run Command, eh? All we get is a Finished. Then run ndinstall -l again and verify that we get the 0000001011 OpenIB Network Direct Provider line. Yes we do, but do not be confused if, for you like for me, it is "0000001012 - OpenIB Network Direct Provider". The sequence number is not important.
And finally let's run the diagnostic again and look in Diagnostics->Test Results->Success. Ah, now that's sweet!
Test Name Result Test Suite Target Last Updated
MPI Ping-Pong: Quick Check Success Performance 7 nodes 6/29/2009 10:54:58 AM
That's it for now, "Transfer fast and prosper."
Frankie
Recently a customer asked me to create a document to briefly describe the charting and reporting functionality in V2. After completing the document I felt that it would make a good blog posting to share with the HPC community. The document is attached to this blog posting. Please feel free to provide feedback.
Thanks.
This might be of interest to HPC PowerShell users.
---------------
PowerShellCommunity.org Joins Forces with Microsoft Scripting Guys to Host 2009 Summer Scripting Games
LOS ANGELES – At Microsoft TechEd 2009, PowerShellCommunity.org, an online community where script writers connect and share knowledge, today announced a key alliance with the Microsoft Script Center (aka Scripting Guys) and PoshCode.org to host the 2009 Scripting Games, June 15–26, 2009.
“We started the Games to challenge scripters everywhere, invite them in to become part of a fun community, and to learn in a cost-effective way,” said John Merrill, IT content evangelist and publishing manager in the Windows Server Division User Assistance group. “We are pleased to work together with PowerShellCommunity.org in helping deliver a premium scripting experience for the two weeks in June when we host the Games.”
The Scripting Games are a chance for IT professionals to practice and test their scripting skills during 10 events using either Microsoft Windows PowerShell or Microsoft VBScript. The Games begin as a live event with contestants submitting entries that are judged and scored by the community.
“We are looking forward to the Scripting Games and being part of the community in helping up to 1,000 or more script writers in showcasing their craft,” said Hal Rottenberg, director of PowerShellCommunity.org. “Sponsors of PowerShellCommunity.org like Idera, Quest Software, Inc., Compellent and SAPIEN Technologies, Inc. help us provide this venue for Windows PowerShell users to collaborate and communicate.”
To enter the Scripting Games visit http://www.microsoft.com/technet/scriptcenter/funzone/games/.
If you ever write MPI program in Windows HPC cluster, you should be familiar with MPI Cluster Debugger in Visual Studio2005/2008. Also, you can find much resource online talking about it (such as blog, white paper). Do you like to debug MPI program in cluster? Is the debugger easy to use? Visual Studio 2010 Beta1 is already released now. HPC team investigates much effort to improve the MPI Cluster Debugger. Let’s go through it.
At the same place (Project Property Page), we find the MPI Cluster Debugger. The difference is that much more properties are there now. Don’t worry, although there are about 20 properties. You will be familiar with them soon. In most cases, we only need to concern 3 properties. Default values will be used for other properties, if leave them empty.

The most important thing is to specify the Headnode, when we want to debug program in a cluster. “Run Environment” is the first mandatory property. Click “Edit Hpc Node…”, the “Node Selector” dialog pops up. We can specify the Headnode and choose computer nodes here. Either specify the total number of the MPI processes, either precisely specify the number of MPI processes on the selected nodes. In this page, we also can get the real time CPU usage of each node in the cluster. If we only need to debug the program on local machine by 4 processes, just input “localhost/4”.


Another mandatory property is “Working Directory”. It must be a local path. MPI Cluster Debugger will help us to create it if it doesn’t exist. The last mandatory property is “Application Command”. We can use the VS build-in macro there, such as $(TargetFileName)”.

“Deployment Directory” is optional, its default value is \\<HeadNode>\CcpSpoolDir\<UserName>. CcpSpoolDir is created during the installation of the Windows HPC Cluster. If we don’t like to use the default value, input ours. Make sure it is a shared path, and we have permission to read and write file there.
We can select different debugger engine through “Debugger Type” property. If we want to debug MPI .Net program, “Managed Only” is the choice.

Each property has an explanation at the bottom of that page, so I don’t go through them one by one. Let me know if you don’t quite get it.
When the mandatory properties are specified, we can use the basic feature of the MPI Cluster Debugger now. Press “F5”. After a while, the MPI processes are launch on the selected nodes, and attached by the vs. In the “Output View” of vs, we can get to know what happened. If error occurs, detail information will be printed here. In”Processes View”, we can find the MPI processes. The break point in the source file will be hit when process passes by there.


We briefly go through the MPI Cluster Debugger above. Some small changes may happen in Beat2 or the RTM version. If you have any feedback or suggestions, reply to me. Thank you!