The RDMSR machine instruction has to be executed in kernel mode, so you have to pay at least one kernel crossing penalty any time you have to read one or more MSRs.
You don't need to use MSRs to read the core performance counters. There is a bit in processor configuration register CR4 (CR4.PCE) that when set allows user-mode programs to execute the RDPMC instruction. Some recent versions of Linux set this bit by default, and it is pretty easy to build a loadable kernel module that will set this bit if Linux clears it by default. With this bit set, the overhead of reading two counters is about 1/100th of the overhead through PAPI -- on the order of 30 cycles per counter. With the user-mode RDPMC approach, all cores are also able to read their counters concurrently -- avoiding any possible serialization in the OS driver. (This is especially important on Xeon Phi, since there are 244 logical cores, each with two core counters.)
I don't know how to build a kernel module for OS/X, but if you can build and install PCM, then you ought to be able to build and install a simple module to set CR4.PCE.
Back to the kernel: I have not played with this on OS/X, but on recent Linux systems I have found that simple ioctl() calls to kernel device drivers have overheads in the range of 500 cycles. Unfortunately most performance monitoring hardware has a lot of additional software layers to traverse, so it is typically quite a bit slower -- on my Linux (RHEL6.4) systems with 3.1 GHz Xeon E5 processors, PAPI takes an average of over 7000 cycles (over 2.3 microseconds) to read two performance counter values.
You definitely have the right idea about coalescing transactions -- once you are in the kernel you want to read all the MSRs in a single call. Unfortunately this probably means writing your own device driver (since none of the interfaces I know of will do this for you). The standard Linux msr device driver will accept a "length" argument, but instead of reading a set of contiguous MSRs, it just reads the one MSR multiple times.
And just to add more complexity: Starting with Sandy Bridge, Intel has put some of the performance counters into PCI configuration space or general memory-mapped IO space. These can be read using kernel drivers, or a device driver can provide an mmap() function that allows direct user-space access to the corresponding physical addresses. The resulting accesses are handled as uncached by the hardware, but this is a lot faster than going into the kernel and then doing the uncached access, and also provides the ability for multiple cores to read from different counters at the same time. On the Xeon Phi, for example, the memory controller counters are located in memory-mapped IO space, and I measured direct user-mode access latency at slightly over 200 ns (per 32-bit load). Xeon E3's also have their memory controller counters in memory-mapped IO space, but I don't think I measured the latency to read those counters.
Thanks Dr. McCalpin,
In addition to the above, the multiple OS complication raises more issues. On windows I actually have an internal to Intel version of the WinRIng0 driver which allows reading multiple MSRs. But we can't release it due to security concerns... we are left asking users to download the public WinRing0 driver or build the PCM driver themselves.
On Linux, the /dev/cpu/*/msr (or /dev/msr*) API doesn't support multiple rdMSRs.
It is not clear to me that much would be gained for PCM performance. The read MSRs are done in parallel (I think... it has been awhile since I looked). They don't take very long relative to the 1 second sleep interval. I'm guessing that the printing of the results takes longer than the reading of the MSRs.
I think the general philosophy of PCM was to make it 'good enough' and then, if people want higher performance or 'features requiringmany man-hours' then they can use tools like Linux perf or VTune where massive effort has gone into making them fast and extremely capable.
Other possible enhancements to PCM (present in some tools) include 1) support for reading event files (such as the VTune event files) and letting the user pick any of the events in the file and 2) having general metrics computed on the fly, 3) supporting event multiplexing, 4) having an event scheduler, 5) writing results to a file and replaying the results, 6) a GUI interface.