How many branches are in the

Yoav_A_ · ‎07-08-2014

I'm trying to write an extension to kvm that stops execution after a fixed number of branch instructions (for example 1000). I've set PERFEVTSEL0 and set the PMC0 (msr 0xc1) to -1000, and wrote an ISR for PMC. the hw raises an interrupt which causes a vmexit but when reading the PMC0 register the value is more than 0, why is that so? Is the performance counters not precise? Regards Yoav

Patrick_F_Intel1 · ‎07-08-2014

Hello Yoav,

The counter will continue to increment until it is stopped. There are probably multiple branches taken in order to service the ISR. You might be able to reduce the extra counting by only counting ring 3 events (as opposed to ring 0 events) but this may not be what you want to do.

It seems like some chips may have a 'freeze counters on overflow' bit you can set (but it has been a while since I last read a 'how to program counters' documents). It seems like setting the value so low (1000) runs the risk of triggering your ISR a lot.

Pat

Yoav_A_ · ‎07-08-2014

The interrupt is supposed to cause a vmexit and stop the counter, but it doesn't and the counter has a positive value. Am i missing something?

Patrick_F_Intel1 · ‎07-08-2014

I don't know if you are missing anything. There isn't enough detail. Programming the counters is hard enough without adding the complexity of a virtual machine.

Pat

Yoav_A_ · ‎07-08-2014

What details are missing?

Patrick_F_Intel1 · ‎07-08-2014

How many branches are in the ISR/vmexit code path before you get to whatever it is that is supposed to stop the counter from counting and what sort of 'greater than 0' counts are you getting?

Usually, if no virtual machines are involved, you have to explicitly stop the counter. I don't know anything about what happens when VMs are involved.

Yoav_A_ · ‎07-09-2014

there are a few branches but the vmexit is supposed to switch off the performance counters (HOST_IA32_PERF_GLOBAL_CTRL is 0). after the vmexit I'm reading the pmc0 counter and the value is sometimes more than 0, which means that i missed by a couple of branches.
I'll try to repharse. are the performance counters for branches accurate? can it be that the IRQ is raised not at the time of the counter overflow but a few branches later?

Patrick_F_Intel1 · ‎07-09-2014

Yes, the branch event is accurate (as far as I know). But probably the vmexit has branches and interrupts are branches. The vmexit seems to be a complex (more than one instruction) procedure. I'm guessing that the vmexit process has conditional branches in it.

Obviously I don't know much about the vmexit. Do you know if this guess is correct for the vmexit?

Yoav_A_ · ‎07-09-2014

This is not the case, I don't think it's the problem. lets say i want to implement a branch stopper, e.g. stop after 1000 branches and gather statistics. how would i go and implement this without virtualization.

0. set MSR_CORE_PERF_GLOBAL_OVF_CTRL = 1
1. set PERF_GLOBAL_CTRL = 1
2. set MSR_P6_PERFCTR0 (msr 0xc1) = -1000
3. set IA32_PERFEVTSEL0:
3.1. evt_sel = 0xc4
3.2. umask = 0x00
3.3. usr = 1
3.4. int = 1
3.5. en = 1

am i missing anything?

Patrick_F_Intel1 · ‎07-09-2014

Looks good. It should work. If you have a test 'looping' kernel that does a 1000 loops then your ISR should get invoked. Any interrupts that happen during the test will increment the count and the long jump to the ISR will be another increment.

Patrick_F_Intel1 · ‎07-09-2014

What chip are you using? On sandybridge, I don't see a 0xc4 event that uses a umask=0.

Yoav_A_ · ‎07-09-2014

I'm using haswell chip, but I still when reading pmc0 value getting a value larger than 0. do you have a code sample that uses the performance counter interrupt feature ?

Patrick_F_Intel1 · ‎07-09-2014

No, I don't know of a sample ISR driver. You might also collect BR_INST_RETIRED.FAR_BRANCH (0xc4, umask=0x40) which will count interrupts (IIRC) and see if BR_INST_RETIRED.FAR_BRANCH is equal to the over count.

Yoav_A_ · ‎07-09-2014

It doesn't account for the bad count, could it be an HW bug?

McCalpinJohn · ‎07-09-2014

(1) Using Umask 0x00 with Event 0xC4 seems to be inviting trouble for two reasons: First, it is an "architectural" event and "architectural" events are often less tightly specified than the machine-specific events. Second, it counts all branch instructions, which may include control transfers that you are not thinking of counting. In particular, it may count control transfers that are necessary to get to the code that stops the counter from continuing to count.

(2) You never said how many "extra" counts you are seeing. Is it 1? 10? 100? Performance counters are sometimes exact, but Chapter 19 of Volume 3 of the SW developer's guide starts with the warning:

The counter values reported by the performance-monitoring events are approximate and believed to be useful as relative guides for tuning software.

(3) Using inline RDPMC instructions I have seen that the related event BR_INST_EXEC.TAKEN_CONDITIONAL (Event 0x88, Umask 0x81) is exact on my Xeon E5-2680 (Sandy Bridge) systems -- see the comments in another forum thread at https://software.intel.com/en-us/forums/topic/405642#comment-1746797 ;

If you want to know if the counter is correct, the only way to avoid extraneous code is to put inline RDPMC instructions right where you want them. Even that is not guaranteed in all cases because of ordering issues, but the cores tend to execute in FIFO order so it is usually correct. There are sneaky tricks for enforcing ordering of RDPMC instructions using false dependencies, but you have to avoid the register-zeroing idioms that the hardware recognizes. That is too long a topic for today.

Yoav_A_ · ‎07-09-2014

Hi John,
I've changed the event to ROB_MISC_EVENT_LBR_INSERTS which is machine specific for haswell, and activated the LBR to select the specific branches that i want. but still i'm getting extra events (up to 10 more).

McCalpinJohn · ‎07-09-2014

Ten extra branches seems like a lot, but in a virtualized environment the number of extra layers of software could include this many.

It would be interesting to compare this against the counts in a non-virtualized environment. I don't know if any tools enable this to be done directly, but it should be relatively easy to hack the kernel code that processes the performance monitor interrupt (__perf_event_overflow in kernel/events/core.c, if I am reading the code correctly) to get it to read the current value of the counters before it does anything else. For a one-time test you could just add a kernel debug print of the values obtained to see whether it has incremented above zero in this (presumably much shorter) code path

Of course one would really prefer some form of user-mode interrupt support to avoid the kernel crossing entirely (since all the kernel is doing is sending the data back to the user run-time for processing), but that is a much larger topic.

Emery_A_ · ‎05-20-2016

Hi Dr. John McCalpin,

I configure the architectural performance monitoring counter INTEL_MSR_PERFMON_CRT1.

0. set MSR_CORE_PERF_GLOBAL_OVF_CTRL = 1
1. set PERF_GLOBAL_CTRL = 1
2. set INTEL_MSR_PERFMON_CRT1 = -20000
3. set IA32_PERFEVTSEL1:
3.1. evt_sel = 0xc0
3.2. umask = 0x00
3.3. usr = 1
3.4. int = 1
3.5. en = 1

My program is running in kernel mode but I am counting user program instructions.

I set the NMI vector in local APIC to trigger an exception when the counter overflows.

The exception is not taken in account by the processor exactly when the overflow occur. There are a variable excess at each run from 10 to 50 instructions.

How can I configure the LAPIC NMI to trigger the NMI more soon (when the overflow occur) without any delay?

Thank you for your attention

EKA.

McCalpinJohn · ‎05-23-2016

I don't use interrupt-based performance monitoring very often, and have never written this sort of code myself, so this is beyond my expertise.

Some processors allow "freezing" the performance counters on PMIs, which should eliminate the extra counts you are seeing. This is discussed in Section 17.4.7 of Volume 3 of the SW Developer's Manual (document 325384-058).

I don't know if it is possible to make the interrupt happen "sooner" (which is a difficult concept in an out-of-order processor), but the PEBS subsystem was designed to ensure that the processor state is captured at the time of sampled events. This is discussed in Chapters 17, 18, and 19 of Volume 3 of the Software Developer's Manual.

Linux kernels have included PEBS support for a while. It is not easy to understand, but combining the documentation in Volume 3 with the examples in the various Linux kernels should provide some insight into how to make PEBS work.

Several recent processor generations have support for a version of the INST_RETIRED event that has hardware support to reduce the "PEBS shadow in IP distribution" (Event 0xC0, Umask 0x01), but this feature only appears to be available for the retired instructions event, and not for the branch events that you are interested in.

I don't know if any of these are directly applicable to your problem, but at least they seem to be related topics!

performance counters interrupt and virtualiztion