John,

Jianhua_H_ · ‎05-27-2014

Hi All,

I've been working on a project to use performance monitor counters to profile cache misses. Our CPU is Xeon E5-2687w (Sandy Bridge) with Hyper-Threading turned off, running on a Linux machine. I used RDPMC instruction to read the counter values at user space. when CPUID.EAX = 0x1, EAX returns 0x206d7, so the counters I am interested are in Table 19-7 in the Developer's manual Vol.3

c1: MEM_UOP_RETIRED.ALL_LOADS D0H/81H

c2: MEM_LOAD_UOPS_RETIRED.L1_HIT D1H/01H

c3: MEM_LOAD_UOPS_RETIRED.L2_HIT D1H/02H

c4: MEM_LOAD_UOPS_RETIRED.HIT_LFB D1H/40H

I was hoping I could get L1 data cache misses from c1 - c2, and first two level cache misses c1 - c2 - c3 - c4. I tested the readings with a simple program, reading an array of int64_t data[1024], and get c1 = 1024 after removing overhead, and c3/c4 almost always 0 and c2 + c3 + c4 = 1024 = c1, as everything should fit into L1 cache.

However, when I used these counters to profile our production code, I sometimes got c2 + c3 + c4 > c1 and even c2 > c1 occasionally e.g.

AllLoadUops: 398, L1DataCacheHit: 1143, LineFillBufferHit: 10, L2CacheDataHit: 14, InstructionRetired 1319

AllLoadUops: 295, L1DataCacheHit: 283, LineFillBufferHit: 34, L2CacheDataHit: 33, InstructionRetired 792

I have ruled out the possibility of counter overflows. I only read the lower 32 bits of the counter from EAX, and the two adjacent readings of the counters in the above examples are not overflowed. (overflow once would be still OK for me)

This counter mismatch happened with and without CPUID guarding the counter reading code. I don't think out-of-order execution can cause 700 uops mismatch either?

The only reason I can think of is context switch. All code were running on a particular core set by cpu affinity with about 1.5 switch/sec, The chances seemed too high when I had code running for 10 mins period, reading counters about 40000 times, this "abnormality" showed up 5 times.

Can anyone point out to me some possible explanations for this? Do I understand what the event counters measure correctly? I want to know the reason for this, otherwise I don't know how much I should trust the rest of the profiling. Thanks a lot

Patrick_F_Intel1 · ‎05-28-2014

Hello Jianhua,

You can also get interrupts on the cpu. The interrupts can trash your cache.

The HW_INTERRUPTS event can be used to count the interrupts per cpu and you can use /proc/interrupts (on Linux) to see which interrupts are hitting your test.

Pat

Bernard · ‎05-28-2014

If you are on Windows you can set affinity mask per core to pin thread to specific core and also increase priority of your code to Real Time thus minimizing possibility of your code being swapped out.As Pat mentioned the ISRs could pose the problem , but on Windows platform their are executed on CPU0.

Jianhua_H_ · ‎05-28-2014

Thank you Pat, and iliyapolak. I've looked at the interrupts. we have about ~500/sec Local timer interrupts. I think it might be frequent enough to do something harmful and definitely worth looking into anyway. The next question then is how large is the interrupt handler body? is it possible to introduce hundreds of instructions? How do I tune down this particular interrupt in Linux to its effect on the counters, I am doing some research on this, but any help is appreciated.

I also tried to only count user space instructions (I had been counting both OS and user space instructions) in the hope of getting more consistent number, as I assume both interrupts and context switch happen in kernel space. The numbers does look better, one mismatch I get in this case is 30, and others are less than 10. Could these numbers be explained by OOO executions? Thanks.

Jianhua

Patrick_F_Intel1 · ‎05-28-2014

Are the timer interrupts on just 1 particular cpu? If so, you could pin your code to a different cpu (possibly).

It is hard to say how many instructions per interrupt and/or what the interrupt actually does. I believe Linux combines (coalesces) interrupts so as to reduce the number of interrupts.

When I was trying to do something similar to what you are trying to do, I made sure I set the affinity to a particular cpu and then raised the priority of the task (to reduce context switching) but I could never eliminate OS interrupts completely. The best I could do was use the HW_INTERRUPTS event to keep track of whether a data collection was hit with an interrupt. Also I think I stored off the rdpmc values in an array so as not to disturb the data collection. And you need to look at the disassembly to see exactly what code is getting executed.

Also you can use the code below to keep track of context switches around your loop:

struct rusage usage1, usage2;

getrusage(RUSAGE_SELF, &usage1);

/* do work here */

getrusage(RUSAGE_SELF, &usage2);
printf("vol csw= %d, invol csw= %d\n", usage2.ru_nvcsw - usage1.ru_nvcsw, usage2.ru_nivcsw - usage1.ru_nivcsw);

McCalpinJohn · ‎05-28-2014

There are a lot of problems with those particular counters... :-(

You appear to be avoiding errata 241 (since you are not looking at LLC counts) and errata 243 (since you are not running with SMT enabled).

There is still a poorly documented "feature" that AVX loads will never increment the L2 hit counter (D1/02) or the LLC hit counter (D1/04). According to Section B.3.4.1 of the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966-029):

On 32-byte Intel AVX loads, all loads that miss in the L1 DCache show up as hits in the L1 DCache or hits in the LFB. They never show hits on any other level of memory hierarchy. Most loads arise from the line fill buffer (LFB) when Intel AVX loads miss in the L1 DCache.

In my experience, seeing inconsistent results 5 times out of 40000 measurements is remarkably good --- especially given the extremely small counts in your examples. Are you sure you need to be sampling the code at such fine granularity?

The other thing I would try is running with the hardware prefetchers disabled. This is sometimes available as a BIOS option.

Bernard · ‎05-28-2014

>>>The next question then is how large is the interrupt handler body? is it possible to introduce hundreds of instructions? How do I tune down this particular interrupt in Linux to its effect on the counters, I am doing some research on this, but any help is appreciated.>>>

If you are on Windows then you can issue kernel debugger command !idt -a and walk IDT table.Disassembling Interrupt handlers will give you an exact answer about the code being executed.I do not remember exactly the number of instructions executed in case of timer interrupt.What I do remember is that execution went further to *PIC handler.

Regarding Linux please study this source code

linux/arch/i386/kernel/irq.c

You may also want to read this article related to IDT on Linux.

http://phrack.org/issues/59/4.html

*edited: should have been written PIC handler.

Bernard · ‎05-28-2014

>>> made sure I set the affinity to a particular cpu and then raised the priority of the task (to reduce context switching) but I could never eliminate OS interrupts completely.>>>

Probably because DIRQL have higher priority than user mode and even kernel mode code at least on Windows.For example Windows scheduler and dispatcher which are responsible for context switching are executiong at IRQL level 0x2 and hardware interrupt are executing at higher level thus masking lower level interrupts.Probably the same scenario is on Linux.

Jianhua_H_ · ‎05-29-2014

@Pat, you are correct, it was the hardware interrupts, mostly from APIC timers, that screw up the counters. Now I also count the HW_INTERRRUPTS, and excluding their effects, values seem to add up nicely. but I wasn't sure what Event Num/Umask I should use for Sandy Bridge. I couldn't find it listed on the Developer's manual Vol.3. I found on the Internet Event Num = 0xcb and Umask = 0x01. They seemed to work. at least my results made sense. Where I can verify this setting on Intel's document?

Also it seems APIC timer interrupts can be reduced/disabled? Linux uses it mainly for process monitoring purposes. But thank you for pointing to me this right direction of research.

@John, we are using AVX. certainly will keep in mind this nuance. Thanks.

Bernard · ‎05-29-2014

Try to read this document http://download.intel.com/design/chipsets/datashts/29056601.pdf

Bernard · ‎05-30-2014

>>>Linux uses it mainly for process monitoring purposes>>>

They are used probably as a waitable timers.

Patrick_F_Intel1 · ‎05-30-2014

Hello Jianhua,

Yes, the HW_INTERRUPTS event I used looks like the same as what you used. Event 0xcb, umask 0x01.

Pat

perfwise · ‎06-01-2014

On a similar note, I have never seen on Haswell a non-zero count for PMC 0xd1 with unit mask 0x40. This should count LFB hits for retired loads but never does. Any comments all.. but I'm quite certain this PMC is broken.

Perfwise

McCalpinJohn · ‎06-02-2014

Note that Haswell can perform 2 32-Byte loads per cycle, so if you are loading from a single contiguous range the two halves of the cache line will be loaded simultaneously. The "load hit LFB" may not be broken so much as not relevant?

A way to test this would be to have the AVX loads accessing multiple arrays in an interleaved fashion. Two arrays might not be enough -- the hardware might reorder the loads to execute two loads from the same array in each cycle. (Sandy Bridge certainly does this with 32-bit loads to avoid bank conflicts.) It is possible that Haswell could reorder for more than two input streams, but this gets more complex as the number of streams increases, so there seems to be a fair chance that the execution order will be interleaved at some point.

I really need to set up my Haswell MacBook Pro to dual boot so that I can get a Linux kernel running that will let me play with the performance counters and miscellaneous configuration bits....

perfwise · ‎06-07-2014

John,

I've never observed any counts that are non-zero for the PMC I denoted on HW, 0xd1 unit mask 0x40, on any SPEC workloads or on my contiguous or random access to memory or on strided accesses. I presume it's broken based upon these observations and other tea leaves you can read from the other PMCs and the requests to the L2.

perfwise

McCalpinJohn · ‎06-07-2014

It would not be shocking to see another broken performance counter event. If I had a nickel for every broken performance counter event I have run across in the last 20 years....

It is curious that the failure of 32-Byte AVX loads to increment the MEM_LOAD_UOPS_RETIRED.L2_HIT or MEM_LOAD_UOPS_RETIRED.LLC_HIT events on Sandy Bridge is not documented either as a "feature" in the event descriptions in Section 19.4 of Volume 3 of the SW Developer's guide or as a processor errata. Does Haswell also show this behavior?

Patrick_F_Intel1 · ‎06-09-2014

Hello Perfwise & John,

I'm looking into whether MEM_LOAD_UOPS_RETIRED.HIT_LFB, MEM_LOAD_UOPS_RETIRED.LLC_HIT, MEM_LOAD_UOPS_RETIRED.L2_HIT are reported to have the issues you describe.

Pat

Patrick_F_Intel1 · ‎06-10-2014

Hello perfwise,

I get non-zero results for MEM_LOAD_UOPS_RETIRED.HIT_LFB on every test I've tried. I'm using the non-precise version of the event. I've tested the event on an almost idle system and on while running a latency test. I can't tell if the values are I'm using an Ivy bridge system though. There aren't any reported issues with MEM_LOAD_UOPS_RETIRED.HIT_LFB.

Pat

Performance Counter Mismatch for data cache misses