posted Sunday, February 18, 2007 1:04 PM by dongarra | 1 Comments


 The Performance API (PAPI) project specifies a standard
 application programming interface (API) for accessing hardware
 performance counters available on most modern microprocessors.
 PAPI provides portability across different platforms and uses
 the same routines with similar argument lists to control and
 access the counters But to be successful, the PAPI library
 needs a little help from the operating system to gain access
 to the information in the counters.

Current Status

 Presently, we have the latest version of PAPI (v3.5) running on
 the Cluster. Recompiling the test harness and the dll proved to
 be relatively straightforward; the majority of the difficulty
 came in sorting through the assembly level portions of the kernel
 driver that provides access to the counters. The AMD64 environment
 provides no inline assembler. The WinPMC kernel driver relied on
 inline assembly to access the hardware counters. Also, there was
 some inconsistency in the availability of compiler intrinsics to
 provide access to the assembly instructions needed to access to
 the PMC registers. This revolved around implementations of the
 cpuid instruction and the readpmc instruction.

 The C  test programs provided with a normal PAPI distribution were
 built and tested as appropriate for the Windows environment.
 Most converted and ran cleanly in the Windows 2003 Server environment;
 some had features that were no longer applicable. The Fortran test and
 example programs were not converted, since at the time of this work,
 a suitable Fortran compiler replacement for the older Compaq Fortran
 compiler had not been identified.

Future Work

 Remaining work revolves around two areas. The first involves completing the
 test and example programming to bring it up to par with what’s available in
 other PAPI distributions. The second is significantly more involved and
 requires some explanation.

 PAPI is primarily intended as a ‘first-person’ mechanism for attributing
 hardware counter events to portions of program code. In order to do that,
 the programmer (or a higher level tool) inserts calls into the user code to start,
 stop and read the hardware counters at specific points. This fundamentally assumes
 that the counts occurring between the start call and the stop (or read) call can
 all be attributed to the user’s code. Such a situation can only be approximated
 in a multitasking system and can be wildly inaccurate in a busy system. The only
 way to guarantee that counts can be properly attributed is for the operating system’s
 context switch routine to save and restore the state of the performance monitoring
 registers. This is how PAPI behaves in Linux systems. On Windows, the WinPMC driver
 currently simply controls the state of the counters and hopes for the best. This
 works acceptably well on laptop or single user systems; not so well on clusters.

 We would like to work with Microsoft engineers to determine the feasibility of
 modifying the Compute Cluster kernel software to support functionality similar
 to the open source perfmon2 performance interface
 that is being incorporated into the Linux kernel and rapidly adopted as the
 standard mechanism for accessing hardware performance counters.