[PCM] Kernel panic on OS X 10.9

Danilcha · ‎06-11-2015

Hello,

My computer restarts every time I try to launch a simple program:

	PCM *m = PCM::getInstance();
	if (m->program(PCM::DEFAULT_EVENTS, NULL) != PCM::Success)
	{
		std::cerr << "Failed to start PCM" << std::endl;
		exit(1);
	}
	SystemCounterState before = getSystemCounterState();
	SystemCounterState after = getSystemCounterState();
	std::cout << "Instructions per Clock: " << getIPC(before, after) <<
	"\nL3 cache hit ratio: " << getL3CacheHitRatio(before, after) <<
	"\nL2 cache hit ratio: " << getL2CacheHitRatio(before, after) <<
	"\nWasted cycles caused by L3 misses: " << getCyclesLostDueL3CacheMisses(before, after) <<
	"\nBytes read from DRAM: " << getBytesReadFromMC(before, after) <<
	std::endl;
	m->cleanup();

I get kernel panic. The same happened after running pcm.x for a minute. OS X 10.9.5. This is the report:

Thu Jun 11 13:53:17 2015
panic(cpu 0 caller 0xffffff80010dcc1d): Kernel trap at 0xffffff7f81a97bfc, type 13=general protection, registers:
CR0: 0x000000008001003b, CR2: 0x000000076d6e3000, CR3: 0x000000006e2bd01c, CR4: 0x00000000001606e0
RAX: 0x0000000000000000, RBX: 0xffffff802995af84, RCX: 0x0000000000000c8f, RDX: 0x0000013d31b9ca72
RSP: 0xffffff80e406dec0, RBP: 0xffffff80e406ded0, RSI: 0x0000013e4fd3f19f, RDI: 0xffffff802995af84
R8:  0x0000000000000001, R9:  0x00000000cccccccd, R10: 0x00000001048519a8, R11: 0x000000076d6e3b50
R12: 0xffffff8001517415, R13: 0xffffff800165a8e0, R14: 0x0000000000000000, R15: 0xffffff80015173cd
RFL: 0x0000000000010046, RIP: 0xffffff7f81a97bfc, CS:  0x0000000000000008, SS:  0x0000000000000000
Fault CR2: 0x000000076d6e3000, Error code: 0x0000000000000000, Fault CPU: 0x0

Backtrace (CPU 0), Frame : Return Address
0xffffff80e4079c50 : 0xffffff8001023139 
0xffffff80e4079cd0 : 0xffffff80010dcc1d 
0xffffff80e4079ea0 : 0xffffff80010f4486 
0xffffff80e4079ec0 : 0xffffff7f81a97bfc 
0xffffff80e406ded0 : 0xffffff80010e402e 
0xffffff80e406df10 : 0xffffff80010e394e 
0xffffff80e406df50 : 0xffffff80010e2c96 
0xffffff80e406df80 : 0xffffff80010dc05f 
0xffffff80e406dfd0 : 0xffffff80010f4649 
0xffffff811eb33c90 : 0xffffff80010a3bc0 
0xffffff811eb33cd0 : 0xffffff800108ed72 
0xffffff811eb33d50 : 0xffffff800107977e 
0xffffff811eb33f20 : 0xffffff80010dd05c 
0xffffff811eb33fb0 : 0xffffff80010f438b 
      Kernel Extensions in backtrace:
         com.intel.driver.PcmMsr(1.0)[8E137983-87E4-37B1-8E6C-A6D8BC38C80B]@0xffffff7f81a97000->0xffffff7f81a9afff

BSD process name corresponding to current thread: clion
Boot args: -v

Mac OS version:
13F1077

Kernel version:
Darwin Kernel Version 13.4.0: Wed Mar 18 16:20:14 PDT 2015; root:xnu-2422.115.14~1/RELEASE_X86_64
Kernel UUID: 8B1A8FD1-2344-36C0-A7F5-D9D485A995FA
Kernel slide:     0x0000000000e00000
Kernel text base: 0xffffff8001000000
System model name: MacBookPro11,1 (Mac-189A3D4F975D5FFC)

Patrick_F_Intel1 · ‎06-11-2015

Hello Danicha,

It looks like the PCM driver is panic'ing the kernel (since the pcm driver is in the backtrace). In every case that I've found when this happens, it is because PCM is trying to read an MSR which is not readable or trying to write a bit which is reserved. Unfortunately Mac OSX doesn't provide 'safe read/write MSRs' routine with exception handlers to catch any GP faults. So the kernel crashes. Every other modern OS provides these safe rd/wr msr routines. So you have to ensure that PCM is not accessing any invalid MSR. Figuring out which MSRs are allowed on every platform is a daunting task.

As a side note, I have a hack which captures the invalid rd/wrmsr but I do not have permission to make it public. it works but I don't know enough about it to know how safe or robust it is. I also am working on a script to list which MSRs are read/write-able on any platform but even given all the information to which I have access figuring out the MSR list is a still a daunting task. Perhaps given this list we could finally provide signed windows and MacOSX driver binaries for PCM. The Intel security folks do not want to provide drivers which allow reading/writing arbitrary MSRs.

If the above crash dump is indeed for reading/writing an invalid MSR then RCX shows which MSR you are accessing. In this case it is 0xc8f. This is the MSR IA32_PQR_ASSOC.

The MSR is accessed in 2 places in cpucounter.cpp.

in PCM::initL3CacheOccupancyMonitoring() and
PCM::freeRMID()

In 1), it looks like reads/write to 0xc8f are protected by a check:

	if(!L3CacheOccupancyMetricAvailable())
        {
            return;
        }

This check is not present in 2). Can you try adding the code snippet above to freeRMID() and see if the crash goes away?

Pat

Danilcha · ‎06-11-2015

Hey Patrick,

I added the code in the beginning of PCM::freeRMID, it worked!

Thank you!