Difference between Sandy Bridge LLC miss events

Wilson_R_ · ‎02-21-2013

On a Sandy Bridge processor I'm trying to find an answer for what the difference is between the following two events. This is based on the information in the developer manual volume 3b part 2 insections 19.1 and 19.3.

The architecural performance event "LLC Misses" which is also called LONGEST_LAT_CACHE.MISS Event Num: 0x2E Umask Value: 0x41
MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS Event Num: 0xD4 Umask Value: 0x02

I have seen one post that stated the LLC Misses counts LLC misses due to loads and stores, but not LLC misses due to hardware prefetches.
Could someone help explain the difference between these two events?

perfwise · ‎02-23-2013

Wilson,

The 0x2E only count demand requests and misses to the L3, so those requests which originated from an instruction's load, and possibly store. The other mask sounds like it counts the # of ops that missed, but I suspect it's less useful. You really don't know from these PMCs the activity that's going on in SB/IB. For instance, you don't know the L2 hardware prefetches which hit or missed, the LLC prefetches which were made (also originating in the L2) and you don't know the write backs of modified data.

To get a better understanding of the L3 I suggest you use PMC 0x34. This allows you to measure each Cache Block in the SB/IV L3 and track read/write requests which hit or miss. Also 0xB0 is very useful to getting an idea of the breakdown of the "types" of requests to the L3.

I don't work for Intel, but have learned this through alot of trial and error.. hope it helps. I wish there was better documentation and assistance from Intel on their PMCs. The documentation is poor, or they don't work and it's up to others like us to determine what works or doesn't.

Perfwise

Bernard · ‎02-23-2013

>>>I wish there was better documentation and assistance from Intel on their PMCs. The documentation is poor, or they don't work and it's up to others like us to determine what works or doesn't. >>>

Poor documentation could be done intentionaly in order to not expose to general public some of the processor features.

perfwise · ‎02-24-2013

Absolutely, but many of the counters I've programmed the results don't make sense or don't work. Also, some counters work in one revision and don't in others. I'm just voicing the point that it would be quite customer centric to have some decent backwards support and understanding of what's going on. For example, in my SB I can't even measure memory bandwidth. Now I'm sure there was support for that for someone who built the chip. Also these L3 and hw pref stats from the DC are very misleading. The question above highlights the difficulties with such limited documentation.

Bernard · ‎02-24-2013

I suppose that more in depth information is accessable for some Intel's partners like Microsoft and other software companies.You stated in your post that sometimes the results do not make any sense can you rule out the possiblity of programming error?.Regarding poor documentation I think that it could be called "some functionality and features are obscured intentionaly by design".We are simply not given the full finite state machine representation of the PMU counters implementation and this can lead to unexpected behaviour and strange results.

Bernard · ‎02-24-2013

@perfwise

Are you programming PMU counters under Linux or Windows.If under Windows how do you display your data?

Wilson_R_ · ‎02-25-2013

Thank you Perfwise for sharing the information. Based on your answer that could explain why the number of events for 0xD4 is much larger than 0x2E.

-Wilson

Adam_J_3 · ‎08-29-2013

perfwise,

can you give any more info on how to use pmc 0x34 and/or other counters for determining whats happening in the LLC? also, sandy bridge has the same counters as ivy bridge right?

thanks!

McCalpinJohn · ‎08-29-2013

Caveat: The following notes assume that you are working with the Sandy Bridge "client" part (Core i3/5/7 or Xeon E3) with the CPUID signature of 06_2AH (Family 6, Model 42). The Sandy Bridge "server" parts (Xeon E5, CPUID signature 06_2DH, Family 6, Model 45) have a completely different uncore, for which the performance counters are described in the "Intel Xeon Processor E5-2600 Product Family Uncore Performance Monitoring Guide", document 327043-001 (March, 2012).

I am pretty sure that the event 0x2E counts only demand accesses to the L3. Using the STREAM benchmark as a test, I found that this event counted only about 9% of the expected accesses under normal operation, while it counted all the expected accesses when the hardware prefetchers were disabled (plus a few percent extra -- probably for page table walks?).

The description of Event 0x34 in table 19-10 does mention access types, but uses an unusual nomenclature. The phrase "core initiated cacheable read requests" is a bit unclear. Without doing experiments, I would interpret it as:

Demand Data loads that miss the L2: definitely counts these
Instruction Fetches that miss the L2: probably counts these
L1 data prefetch loads that miss the L2: maybe counts these?
L2 prefetch loads that miss the L2: almost certainly does not count these

When I try various 0x34 events with STREAM, I get counts near zero for everything I tried. It looks like the event is targeting hits, not misses, and I expect essentially zero L3 hits with or without prefetch. Maybe if I tried an L3-containable array size? No time now.

Event 0xD4 is a bit confusing -- notice that the description of the event suggests something rather different than the name:
Name: MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS <-- seems pretty clear
Description: Retired load uops with unknown information as data source in cache serviced the load. <--- not so clear
When I ran STREAM with this event I got very low counts -- many orders of magnitude less than the Event 21 values.

This sort of information should also be available from the offcore response events documented in Table 19-8. It is clear that not all of the plausible bit combinations are supported, but it is not always completely clear what the bit combinations that are defined and named in Table 19-8 actually count. No time to deal with that today....