Solved: Intel PCM: L3 Cache Misses Discrepancy(?)

tim_kiefer · ‎01-12-2012

Hi all,

I am using the PCM tool for some experiments on a 4 Socket (WestmereEX) machine. I am puzzled about a discrepancy that I observe when counting the L3 Cache Misses on two different ways...

I started with the PCM tool as it is and used it to monitor L3 Cache Misses (L3MISS coloumn) in 1 second intervals. Looking at the code, I know that the following event is counted: MEM_LOAD_RETIRED_L3_MISS. (The description for this event from the Software Developer Manual: "Counts number of retired loads that
miss the L3 cache. The load was satisfied by a remote socket, local memory or an IOH".)

I focused on the "per socket" statistics as I was interested in socket 0 only. Doing so, I ran two workloads with the rough result that workload A hast about 10 times as many L3MISS events than workload B.

Next I was interested in the (MESIF) state that cache lines are in, when they are read (shared or exclusive). I figured that this information is only provided in uncore counters so I programmed several uncore counters following the guide (Intel Xeon Processor E7 Family (Westmere EX) Uncore Performance Monitoring Programming Guide). Precisely, I programmed all C-Boxes to count the LLC_MISSES (event 0x14 on page 2-25). The LLC_MISSES for all 10 C-Boxes are summed up in the tool and I was expecting the result to reflect the number of L3 misses for a certain socket.

Not looking at any cache line states yet, I was performing a sanity check to see whether the L3 Cache Misses are equal in both setups. Surprisingly, I see quite different results. Not only differ they quantitatively (which I could understand taken the different ways to measure into account), but they also differ qualitative. My problem: workload A now only has about half of the L3 misses that workload B has (it used to be 10 times more).

Having thought about this for a while: Am I missing something? Do these two ways to count L3 misses actually count different events? Is one way counting a subset of the events that the other way does?

Any help will be appreciated! Thanks!
- tim

PS: I also compard L3 HITS counted with the core counters and with the uncore (C-Box) counters and although they differ by factor 5, they at least show the same qualitative results.

Roman_D_Intel · ‎01-12-2012

Tim,

which PCM version do you use? The recent Intel PCM version 1.7 uses ARCH_LLC_MISS to count LLC cache misses on Westmere-EX.

Could you try recentPCM version/ARCH_LLC_MISS event instead?

Thanks,
Roman

View solution in original post

Roman_D_Intel · ‎01-12-2012

Tim,

which PCM version do you use? The recent Intel PCM version 1.7 uses ARCH_LLC_MISS to count LLC cache misses on Westmere-EX.

Could you try recentPCM version/ARCH_LLC_MISS event instead?

Thanks,
Roman

tim_kiefer · ‎01-13-2012

Hi Roman,

spot on!!! I was indeed using PCM version 1.6, where L3 misses were counted with the MEM_LOAD_RETIRED_L3_MISS event.

As you proposed, I tried the ARCH_LLC_MISS event instead and it shows pretty much the same as what the uncore counters (C-Boxes) show. Hence, my results are consistent now.

Can you explain what the MEM_LOAD_RETIRED_L3_MISS event tells me (on WestmereEX)? The numbers here are still pretty different from the L3 Misses and I am trying to make sense of it (i.e., to interpret the results here correctly).

Thanks (again :)) for your quick help!
- tim

GHui · ‎02-06-2012

I want to check LLC-Miss(Last level cache misses) on Nehalem-EP. I used architectural event to collect LLC-Miss. Which program I can get to test LLC-Miss event?

Patrick_F_Intel1 · ‎02-06-2012

'Which program to use'depends on what you want to do.
PCM will show the memory bandwidth due to LLC-Miss events.
You can get PCM from http://software.intel.com/en-us/articles/intel-performance-counter-monitor/
PCM will compute the memory bandwidth (and other metrics) for you.

If you want to be able to see which process/modules/functions are causing the baandwidth, you'll need a tool like Intel VTune Amplfier.
You can get Amplifier from http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/

You can also use tools like linux oprofile or linux 'perf' but I'm not so familar with these tools.
Pat

GHui · ‎02-06-2012

Thanks for your help.My english is not very good.

I just want to test LLC-Miss event. So I want to get a program. I want to know LLC-Miss that I collect is right or wrong.

You said "memory bandwidth due to LLC-Miss events". Is there any relevances between two events? I see the two events are all near to Cbox.

Jay_J_ · ‎03-15-2015

Hi Roman and Fay,

Let me ask you a question about memory bandwidth. Using pcm-numa.x in PCM 2.8, I recorded the memory access counts for both local and remote while running stream benchmark, and found that there was a difference between peak bandwidth reported by stream and the maximum bandwidth based on the PCM recorded number. Actually, I calculated the maximum bandwidth by multiplying the largest memory access counts per second out of records by 64B. For example, 100M accesses (Local DRAM accesses + Remote DRAM accesses) is translated into 6.4GB/s throughput. Since the translated value is quite smaller than stream result (say, 24GB/s), I am wondering if I miss something for my PCM-based measurement. Could you tell me if LLC miss based counts used in PCM includes hardware memory prefetch events as well? If it does, do you have any thought on how I can interpret the difference?

Thanks for your reply in advance.

Juyoung

Roman_D_Intel · ‎03-16-2015

Juyoung,

PCM shows "demand" cache misses: LLC counts do not include misses generated by HW prefetcher.

Best regards,

Roman

Jay_J_ · ‎03-16-2015

Roman, thanks for your quick response.

McCalpinJohn · ‎03-17-2015

If you have root access on the system you can disable hardware prefetchers (as described at https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors). ; This should improve the counts. I have not had a chance to check to see if the bugs in these events in the Sandy Bridge processors have been fixed in Haswell EP -- I seem to recall that some of the counts are still wrong even with the hardware prefetchers disabled, but I don't have the details at hand.