posted Sunday, February 18, 2007 1:04 PM by dongarra | 1 Comments
The Performance API (PAPI) project specifies a standard application programming interface (API) for accessing hardware performance counters available on most modern microprocessors. PAPI provides portability across different platforms and uses the same routines with similar argument lists to control and access the counters But to be successful, the PAPI library needs a little help from the operating system to gain access to the information in the counters.
Presently, we have the latest version of PAPI (v3.5) running on the Cluster. Recompiling the test harness and the dll proved to be relatively straightforward; the majority of the difficulty came in sorting through the assembly level portions of the kernel driver that provides access to the counters. The AMD64 environment provides no inline assembler. The WinPMC kernel driver relied on inline assembly to access the hardware counters. Also, there was some inconsistency in the availability of compiler intrinsics to provide access to the assembly instructions needed to access to the PMC registers. This revolved around implementations of the cpuid instruction and the readpmc instruction.
The C test programs provided with a normal PAPI distribution were built and tested as appropriate for the Windows environment. Most converted and ran cleanly in the Windows 2003 Server environment; some had features that were no longer applicable. The Fortran test and example programs were not converted, since at the time of this work, a suitable Fortran compiler replacement for the older Compaq Fortran compiler had not been identified.
Remaining work revolves around two areas. The first involves completing the test and example programming to bring it up to par with what’s available in other PAPI distributions. The second is significantly more involved and requires some explanation.
PAPI is primarily intended as a ‘first-person’ mechanism for attributing hardware counter events to portions of program code. In order to do that, the programmer (or a higher level tool) inserts calls into the user code to start, stop and read the hardware counters at specific points. This fundamentally assumes that the counts occurring between the start call and the stop (or read) call can all be attributed to the user’s code. Such a situation can only be approximated in a multitasking system and can be wildly inaccurate in a busy system. The only way to guarantee that counts can be properly attributed is for the operating system’s context switch routine to save and restore the state of the performance monitoring registers. This is how PAPI behaves in Linux systems. On Windows, the WinPMC driver currently simply controls the state of the counters and hopes for the best. This works acceptably well on laptop or single user systems; not so well on clusters.
We would like to work with Microsoft engineers to determine the feasibility of modifying the Compute Cluster kernel software to support functionality similar to the open source perfmon2 performance interface http://sourceforge.net/projects/perfmon2 that is being incorporated into the Linux kernel and rapidly adopted as the standard mechanism for accessing hardware performance counters.