Low-latency links are important for message-passing (MPI) applications. Typically several instances of the same MPI process will run on different nodes and depend on data passed from other nodes to complete their computation. Latency is therefore one of the performance-limiting factors to be taken into account. The latency of GbE tends to be of the order of 100 microseconds or more. Also, GbE makes relatively efficient use of its available bandwidth with a few, large packets. For some parallel applications that rely on lots of small (512B - 4 KB typically) messages, that is not acceptable. So, what are the alternatives?

Infiniband
Infiniband is a switch-based point-to-point interconnect. An infiniband fabric is usually called a subnet and is under the control of one master subnet manager. Other subnet managers on that fabric will be redundant for high availability. At startup, an infiniband device will register with the subnet manager using its unique GUID. The subnet manager will assign dynamically an appropriate address to the GUID. The subnet manager will also set up static routes for the applications to use.

An infiniband adapter will operate with a signaling rate that is a multiple of 2.5Gb/s (Single Data Rate) with a 8B/10B encoding mechanism, i.e. using 8 bytes of payload every 10 bytes transmitted. In practice:
  • 4X SDR is the most common implementation today. This will give you 8 Gb/s of useful bandwidth.
  • Data travels in packets up to 4 KB in size. 1 or more packets make up a message.
  • 3-5 microseconds of message latency from application to application buffer are possible.
RDMA and reliable connection handling are supported by the hardware and its drivers to achieve such performance. Several protocols can be used over Infiniband:
  • IPoIB (IP over InfiniBand) enables standard TCP/IP applications to run unmodified.
  • SDP (Socket Direct Protocol) or WSD (Winsock Direct): reduces latency in software stack by using RDMA, reliable connection handling and offloading to the channel adapters.
  • SRP (SCSI RDMA Protocol): allows block storage operations over infiniband, taking advantage of the low latency and high bandwidth. IB / Fibre Channel gateways are often required to access the storage device.
  • uDAPL (Direct Access Programming Library): API specification for all RDMA-capable transports. Used for instance by Oracle RAC.

Some implementations of MPI (e.g. MVAPICH2) use uDAPL, mVAPI (Mellanox’s verbs API) or OpenFabrics Gen2 VAPI to communicate with the Infiniband adapters (or better with their drivers, which can do most of the processing in user mode). Thus, they afford more efficient use of resources than a simple socket-based implementation (e.g. “vanilla” MPICH2 or MSMPI).


Microsoft implements MPI using sockets over the standard tcp stack or winsock direct, for maximum flexibility. We rely on 3rd parties (e.g. Mellanox, a manufacturer of infiniband adapters or the OpenFabrics consortium) for any other protocols.

Getting Started
 
Familiarize yourself with Infiniband at http://www.infinibandta.org   
Familiarize yourselves with the OpenIB stack at http://www.openfabrics.org
Check out MVAPICH at http://mvapich.cse.ohio-state.edu/overview/mvapich2/
Read about IB applications on http://www.mellanox.com/support/whitepapers.php
Read more about infiniband support on our performance tuning whitepaper at http://www.microsoft.com/windowsserver2003/ccs/technology.aspx