PCM: Faster way to query MSR on OS/X

Gilbert_C_ · ‎09-07-2014

Hi all: I'm working on profiling an application in a platform-independent way. Since I've had mixed results with trying to use PAPI on OS/X, I was looking into alternatives and stumbled onto the (very cool) Intel PCM library. However, upon trying to use the PCM code (version 2.6) to instrument my application, I've found that calls to read an MSR appear to be quite slow on OS/X. Specifically, on my Ivy Bridge i7, I've found that a call to msr->read takes somewhere between 5us and 50us, averaging somewhere around 10us (measured with mach_absolute_time across a few million calls). I have 8 cores, and I need to read 6 counters from each of them (so 8*6 = 48 calls total). 48 * 10 = 480us to read a complete set of performance counters. Assuming I record statistics 100 times / second, that's 48 ms of system time / second burned in msr->read. That's a *lot* of overhead for a relatively coarse recording frequency, I think. Also, note that the above is tuned slightly: by default, PCM::getSystemCounterState() would make a *lot* more than 6 MSR read calls, but I went ahead and ripped out the logic to simply query the counters I needed for the statistics I wanted. So, I guess my question is where I should be going from here. I have a few ideas to make this work a little faster, most of which involve finding a way to add a bulk read method for MSRs to reduce the aggregate load on the system ... but I'd be interested to hear if / how other folks have approached this problem. Alternatively, could be that I'm doing something completely wrong here, which I'd be interested to hear as well :) Discussion / thoughts / insight would be appreciated!

McCalpinJohn · ‎09-08-2014

The RDMSR machine instruction has to be executed in kernel mode, so you have to pay at least one kernel crossing penalty any time you have to read one or more MSRs.

BUT...

You don't need to use MSRs to read the core performance counters. There is a bit in processor configuration register CR4 (CR4.PCE) that when set allows user-mode programs to execute the RDPMC instruction. Some recent versions of Linux set this bit by default, and it is pretty easy to build a loadable kernel module that will set this bit if Linux clears it by default. With this bit set, the overhead of reading two counters is about 1/100th of the overhead through PAPI -- on the order of 30 cycles per counter. With the user-mode RDPMC approach, all cores are also able to read their counters concurrently -- avoiding any possible serialization in the OS driver. (This is especially important on Xeon Phi, since there are 244 logical cores, each with two core counters.)

I don't know how to build a kernel module for OS/X, but if you can build and install PCM, then you ought to be able to build and install a simple module to set CR4.PCE.

Back to the kernel: I have not played with this on OS/X, but on recent Linux systems I have found that simple ioctl() calls to kernel device drivers have overheads in the range of 500 cycles. Unfortunately most performance monitoring hardware has a lot of additional software layers to traverse, so it is typically quite a bit slower -- on my Linux (RHEL6.4) systems with 3.1 GHz Xeon E5 processors, PAPI takes an average of over 7000 cycles (over 2.3 microseconds) to read two performance counter values.

You definitely have the right idea about coalescing transactions -- once you are in the kernel you want to read all the MSRs in a single call. Unfortunately this probably means writing your own device driver (since none of the interfaces I know of will do this for you). The standard Linux msr device driver will accept a "length" argument, but instead of reading a set of contiguous MSRs, it just reads the one MSR multiple times.

And just to add more complexity: Starting with Sandy Bridge, Intel has put some of the performance counters into PCI configuration space or general memory-mapped IO space. These can be read using kernel drivers, or a device driver can provide an mmap() function that allows direct user-space access to the corresponding physical addresses. The resulting accesses are handled as uncached by the hardware, but this is a lot faster than going into the kernel and then doing the uncached access, and also provides the ability for multiple cores to read from different counters at the same time. On the Xeon Phi, for example, the memory controller counters are located in memory-mapped IO space, and I measured direct user-mode access latency at slightly over 200 ns (per 32-bit load). Xeon E3's also have their memory controller counters in memory-mapped IO space, but I don't think I measured the latency to read those counters.

Patrick_F_Intel1 · ‎09-08-2014

Thanks Dr. McCalpin,

In addition to the above, the multiple OS complication raises more issues. On windows I actually have an internal to Intel version of the WinRIng0 driver which allows reading multiple MSRs. But we can't release it due to security concerns... we are left asking users to download the public WinRing0 driver or build the PCM driver themselves.

On Linux, the /dev/cpu/*/msr (or /dev/msr*) API doesn't support multiple rdMSRs.

It is not clear to me that much would be gained for PCM performance. The read MSRs are done in parallel (I think... it has been awhile since I looked). They don't take very long relative to the 1 second sleep interval. I'm guessing that the printing of the results takes longer than the reading of the MSRs.

I think the general philosophy of PCM was to make it 'good enough' and then, if people want higher performance or 'features requiringmany man-hours' then they can use tools like Linux perf or VTune where massive effort has gone into making them fast and extremely capable.

Other possible enhancements to PCM (present in some tools) include 1) support for reading event files (such as the VTune event files) and letting the user pick any of the events in the file and 2) having general metrics computed on the fly, 3) supporting event multiplexing, 4) having an event scheduler, 5) writing results to a file and replaying the results, 6) a GUI interface.

Pat

Gilbert_C_ · ‎09-09-2014

Hi all: Thanks for the replies! Dr. McCalpin: This response made for some great reading, and did offer some useful direction. Also, thanks much for the CR4 / RDPMC notes: I'll need to look into that. Pat: I appreciate the information and the thoughts there. It's a shame that y'all can't release the updated winring0 driver,. With that said, the PCM code that is currently publicly released has been a nice reference for me thus far, so thanks for that. Regardless, sounds like I have some more research to do here, and probably some coding to do as well :) -Gilbert

Gilbert_C_ · ‎09-12-2014

Hi: As a brief update to this, I hacked on the OS/X version of the driver to batch reads. Doing this seemed to significantly improve the performance I saw. In the case of grouping MSR reads to query a single MSR on all processors at once, this is not surprising to me: see [1] for an explanation of why I believe this to be the case. The implementation of the driver did raise a slight concern about accuracy of the readings of a function like getSystemCounterState(), though: see [2] for that. The idea that reading multiple MSRs would be about as expensive as reading a single MSR was more surprising to me: my theory would be that the mp_rendezvous_no_intrs() call used to read the counters accounts for the majority of the expense of reading the MSRs in this case (since the synchronization involved strikes me as being a little tricky). It seems scary to execute *too* much code with interrupts disabled and / or to be working with a large amount of data at that level, so I've limited the number of CPUs / number of MSRs that can be queried in one syscall to be 16 and 8, respectively. This is plenty to support the machine I'm using, so it works for me :) See [3] for a few measurements I've gathered. Note that I'm using mach_absolute_time() around a *single call* to each of these methods for the numbers below, so they may or may not be representative. In any event, 12us is probably good enough for what I'm doing, so I don't know if I'm going to keep going at this point. If I do, though, I'm thinking I'd look at: * RDPMC (as suggested above) * instrument mp_rendezvous_no_intrs() to see if that is indeed the source of the overhead. If so, it might be worthwhile to spend some time on a more efficient way to handle the MSR reads in the driver. Regardless, thanks again for the guidance, all! -Gilbert [1] Code lifted from the PCM driver that is called by mp_rendezvous_no_intrs (read: executed on all cores in parallel): void cpuReadMSR(void* pIData) { pcm_msr_data_t* data = (pcm_msr_data_t*)pIData; volatile uint cpu = cpu_number(); if(data->cpu_num == cpu) { data->value = RDMSR(data->msr_num); } } Since that's executed on all cores, removing that if() and reading from multiple cores is more-or-less free. I say more-or-less because tI'd wonder if there would be some impact on keeping caches synchronized between the cores ... but I'm also not yet familiar enough with the architecture to understand how caching works across multiple cores all that well :) [2] One thing that concerns me a little bit about [1] is that it seems like this would mean that reading counts from cores is going to result in some loss of accuracy due to the lag between reading MSRs individually from each: core[0] = +0us core[1] = +12us+scheduling core[2] = +24us+scheduling ... core[7] = +84us+scheduling That probably doesn't matter in practice ... and maybe there's a reason I'm wrong? [3] A few measurements below. These are intended to be purely informational: YMMV This is a timing for reading a single MSR value from a single CPU: msr->read (8): 11us -- 161820419390 This is a timing for reading a single MSR value from all CPUs: msr->readGroup (128): 12us msr[0] = 161820267332 msr[1] = 45343604000 msr[2] = 102241532524 msr[3] = 42325588523 msr[4] = 105899453665 msr[5] = 41519634157 msr[6] = 93480892200 msr[7] = 41046934404 This is a timing for reading multiple MSR values from all CPUs at once (note: the trailing 0 is because I'm only reading 7 out of the possible 8 MSRs my driver hack supports): msr->readMulti (1024): 11us msr[0] = 161820326972,273375514808,240725126328,286534971,115867607,64045204,236594683,0 msr[1] = 45343633095,142384088597,108762167232,17093959,51942725,61478130,70529901,0 msr[2] = 102241655510,199008226014,162194848272,130201080,61166155,53448332,150618921,0 msr[3] = 42325617573,141864156752,107964138048,16501878,40592217,53645809,63804252,0 msr[4] = 105899496859,201136101301,159665507856,146597712,44129146,34726272,150211021,0 msr[5] = 41519658544,142329032286,108198137928,16025735,27081322,38941627,57578092,0 msr[6] = 93480947808,185000523474,145553508552,96500340,26153753,27943254,130693398,0 msr[7] = 41046957129,141023012541,107145927168,15386137,22328146,34047050,52089244,0 As a baseline, this is a timing for how long getSystemCounterState() takes to execute: getSystemCounterState: 752us