Did you think that High Performance Computing was boring? Well, think again: we spent most of our time last week on a videogame :-) At our last HPC Lab we had CCP Games, maker (and hoster) of Eve-Online (http://www.eve-online.com/), a massive multi-player online game set in a fictional but extremely realistic universe. CCP has more than 200,000 subscribers, of which 35,000 play online at the same time. So, why would they be interested in HPC?
Specifically, they were interested in MPI as a mechanism to bootstrap their cluster communication layer and of course to pass messages among nodes. Today, each node of their home-grown cluster handles a subset of the game universe and all the players in that subset. As players move or interact from system to system, messages are sent to coordinate amongst processes running on different nodes. All the communication layer and the scheduling of processes has been written in house. This represents a considerable amount of complex code that needs regular maintenance and is not part of CCP’s core business. By using our scheduler and MPI, they will be able to eliminate gradually that portion of code, thus gaining in reliability and manageability.
Is it all that simple? Not really. Between each maintenance interval, the cluster is basically running one long, large job. MPI today is not really designed for such scenarios. MPI error handling is minimal and the loss of one process on one node may cause the termination of the whole job. Fault-tolerant MPI is in the works, but it will take some time for it to be part of the standard specification. In the meanwhile, CCP can work around this problem by using MPI as a way to bootstrap and set up communications channels for its cluster nodes. Once that job is finished, the nodes can keep talking over standard sockets. Anyway, this prompts me to do some more research in the fault-tolerant MPI field. I'll keep you posted.