I've been writing a linux kernel driver, and i've been trying to do some various forms of performance measuring. In particular i've been having trouble with branch counting.
To sum it up: disable interrupts, clear & enable PMC0, do a single nop, disable & read PMC0, re-nable interrupts. Output branch count varies!
Since there are no branches between disabling and re-enabling interrupts, one would expect the output of this to simply be 0. However, I've seen the output on E5-2650's to be between 1 and 19 (often). On i7-6700's I've seen the output range from 118 to 119 (fairly consistently). And on an E5-1650, I've see no errors at all.
My first thought was that the logical processors shared on a common physical core could be the culprit. So I added a global spinlock to prevent more than a single core from executing the critical code (InterruptTest). The result was the same.
My next thought was NMI's (specifically performance monitoring interrupts), since those do seem to be occuring regularly (the count climbs slowly in /proc/interrupts). If this is the case, does that mean i cannot ever expect determinism from PMC with regard to branch counting?
Finally, I'm quite confused as to why the i7 and E5 are producing drastically different numbers on the same kernel version. Shouldn't I expect similar results?
I've included the relevant code below, but if you wish to test for yourself, a driver demonstrating the issue can be found at https://github.com/gourryinverse/interruptor ;
Any help would be greatly appreciated. Thank you.
local_irq_save(flags); __asm__ __volatile__ ( // Disable IA32_PMC0 "xorl %%eax, %%eax\n\t" "xorl %%edx, %%edx\n\t" "movl %[perfevtsel0], %%ecx\n\t" "wrmsr\n\t" // Clear IA32_PMC0 "xorl %%eax, %%eax\n\t" "xorl %%edx, %%edx\n\t" "movl %[pmc0], %%ecx\n\t" "wrmsr\n\t" // Enable IA32_PMC0 with branches retired "movl %[perfevtsel0], %%ecx\n\t" "movl %[branchCounterEnable], %%eax\n\t" "wrmsr\n\t" // Spin for a little bit "nop\n\t" // Disable IA32_PMC0 "xorl %%eax, %%eax\n\t" "wrmsr\n\t" // Read IA32_PMC0 "xorl %%eax, %%eax\n\t" "xorl %%edx, %%edx\n\t" "movl %[pmc0], %%ecx\n\t" "rdmsr\n\t" // Write the branch count "movl %%eax, %[branches]\n\t" // Clear IA32_PMC0 "xorl %%eax, %%eax\n\t" "xorl %%edx, %%edx\n\t" "wrmsr\n\t" : [branches] "=rm" (branches) : [perfevtsel0] "i" (IA32_PERFEVTSEL0) , [pmc0] "i" (IA32_PMC0) , [branchCounterEnable] "i" (PERFEVTSEL_BRANCH_INSTRUCTION_RETIRED | PERFEVTSEL_ENABLE | PERFEVTSEL_INTERRUPT | PERFEVTSEL_OS | PERFEVTSEL_USER ) : "cc", "eax", "ecx", "edx" ); // Re-enable interrupts local_irq_restore(flags);
I don't know what is going wrong with your measurements, but I have had reasonably good luck counting branches in user mode using the RDPMC instruction.
NMI's should only occur if you have either the NMI watchdog timer enabled or if you have otherwise enabled a programmable or fixed-function performance counter with the "interrupt on overflow" bit set.
There is no need to clear the counters -- taking differences works fine. There is no need to disable the counters before reading. The RDPMC instruction is not ordered with respect to surrounding instructions (nor are the RDMSR/WRMSR instructions), so it is possible to get surprising results when you are trying to measure intervals that are within the out-of-order capabilities of the processor.
The clearing of counters was simply to limit the number of potential issues. If we know it's 0 to start, we should expect 0 at the end, simplifying the logic as much as possible.
I was under the impression that RD/WRMSR instructions were serializing, but we also forced a flush with cpuid/rdtscp prior to RD/WRMSR. The results were the same on each of the processors.
Next we figured the interval may be too small, so we expanded the spin section to 1,000,000 NOPS, which also produced the same results.
I can confirm that NMI's are occurring, and that they are indeed Performance Monitoring Interrupts. I can watch the count tick up in realtime in /proc/interrupts. I beleive there is, indeed, a watchdog enabled by default on ubuntu, so i should check that.
Our current hypothesis is that it may be SMM/SMI's that are affecting the branch count, so i'm looking for a way to test that at the moment. Those are non-maskable, and completely transparent to the OS, so finding a way to fire SMI's during execution of the driver might provide some clues as to whether they are an issue.
You are right -- I had forgotten that WRMSR was a serializing instruction that (as a byproduct) serializes memory references. Section 8.3 of Volume 3 does not list RDMSR as a serializing instruction. CPUID is a pig -- I seem to recall overheads in the 250 cycle range. RDTSCP is only partially ordered.
You can count SMIs using MSR 0x34. I have seen rates SMI interrupt rates in the range of zero per day (after boot) to slightly over 8 per second on various systems that I have checked. My systems have the "freeze performance counters on SMI" bit set, but it is always hard to tell if there might be small numbers of increments around the edges of the transfer to/from SMM mode. I am not aware of any direct way to count SMM cycles, but the "reference cycles not halted" counters should stop incrementing in SMM mode, making SMM time look like halted time. Halted time also comes from frequency transitions, so those need to be accounted for as well -- you definitely want to pin the core frequencies to a fixed value before the test. I don't know if the APERF and MPERF MSRs are halted in SMM mode -- the documentation does not suggest that they are. If they continue to count in SMM mode, then the difference between "reference cycles not halted" and MPERF cycles may be an indication of SMM cycles. They should both quit counting during frequency transitions.
I found FREEZE_WHILE_SMM_EN (bit 14) and enabled it. This reduced the i7 branch count from 118/119 to 2, consistently. It has no effect on the E5-2650 (consistent 1's). Am I missing another Freeze on SMI bit?
I'm not too worried about CPUID hogging cycles, the biggest concern for me here is that i'd like a super accurate branch count. At this point i'm convinced i'm either missing options, i'm mis-understanding interrupt disabling with regard to PMC, or it's just not reliable due to NMI's or something.
Also, fun note, you can generate an SMI by pressing the power button, and on ubuntu it just pops up the power-menu, so you can press it lots of times fast. Not great, but it's something.
Also with regard to PMI's specifically - the spikers provided should force the machine to spike to 100% on all processors and actively manage cooling (such as fan speed) which is guaranteed to generate PMI's. Given that we see this most often during heavy usage, i'm becoming increasingly convinced this is heavily PMI/NMI related.
The strange thing is enabling FREEZE_PERFMON_ON_PMI (bit 12) actually increases the branch count to between 33 and 36 on my i7.-6700. Seems counter-intuitive to me.
For the i7 - confirmed it was the nmi_watchdog causing our issue. In fact, the documentation says the nmi_watchdog utilizes the first perf register. So if we're interrupted by a PMI, it stomps on our register and that produces unexpected results!
Still getting mystery results from with E5 though, which is interesting. Will update as I discover more.
Thank you for your help and hints, it got me somewhere! If you have anything else, any help is welcome!
So, ultimately the problem lie in not hitting all the right controls. The appropriate way to use these was the following:.
save and disable IA32_PERF_GLOBAL_CTRL settings save, disable IA32_PERFEVTSEL0 save, clear IA32_PMC0 save, disable IA32_PEBS_ENABLE save, add "freeze_while_smm" bit (14) to IA32_DEBUG_CONTROL msr clear IA32_PERF_GLOBAL_STATUS overflow bits via IA32_PERF_GLOBAL_OVF_CTRL enable IA32_PERFEVTSEL0 with Retired Branches, OS, and Enable enable PMC0 bit in IA32_PERF_GLOBAL_CTRL do stuff disable all IA32_PERF_GLOBAL_CTRL read IA32_PMC0 restore original settings