Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Ivy Bridge LLC miss

Min_X_
Beginner
604 Views

Hi,

I am trying to run some small computation in the Ivy Bridge (i7-3770) without causing any LLC misses. I preload every data and code into the cache by touching them beforehand, and disable all interrupts during the computation. Also, I enabled only one core from the BIOS. 

Currently, the target small computation is a single iteration over a buffer, which has been preloaded before the real iteration. The size of the buffer is 8192*120*2 bytes. When I measure the LLC miss using the  MEM_LOAD_UOPS_RETIRED.LLC_ MISS (20d1) performance counter. Occasionally, the iteration could cause 1 LLC miss in an unpredictable manner. I am wondering if there are some hardware features that could lead to such kind of unpredicted cache misses on preloaded data.

In addition, I am confused about the LONGEST_LAT_CACHE.MISS and the MEM_LOAD_UOPS_RETIRED.LLC_ MISS. My understanding is that the previous one counts on LLC miss events while the second one counts the number of UOPS that causes LLC miss. And my observation is that LONGEST_LAT_CACHE.MISS is always larger than MEM_LOAD_UOPS_RETIRED.LLC_ MISS for the same sequence of instructions, even when MEM_LOAD_UOPS_RETIRED.LLC_ MISS is zero. Is this normal?

Thanks.

Min 

0 Kudos
4 Replies
McCalpinJohn
Honored Contributor III
604 Views

There are many features that can cause data to be evicted from the caches.  Most of these are in the operating system software.  As a practical matter, it is not generally possible to absolutely, positively ensure that the operating system will not access addresses that cause an overflow of any of the cache congruence classes that are being used to hold your data.  I have not worked much with the "client" uncore used in the Core i7 processors, but the "server" uncore uses undocumented pseudo-LRU replacement schemes and undocumented hash functions to map addresses to the L3 slices.  It is likely that at least some of these complexities also apply to the Core i3/i5/i7 processors.

The LONGEST_LAT_CACHE.MISS event is an "architectural" event.  The event is included in the table of Ivy Bridge events in Table 19-11 of Volume 3 of the Intel Architecture Software Developer's Manual.  The description the event refers to Table 19-1, which lists the "architectural" events.   There is a bigger discussion of the "architectural" events in Section 18.2.1.2 of the same document, which notes:

"This event counts each cache miss condition for references to the last level cache. The event count may include speculation and cache line fills due to the first-level cache hardware prefetcher, but may exclude cache line fills due to other hardware-prefetchers.

Because cache hierarchy, cache sizes and other implementation-specific characteristics; value comparison to estimate performance differences is not recommended."

In my experience, this event counts L3 misses due to demand loads, demand stores and L1 hardware prefetch loads.  It does not count L3 misses due to the L2 hardware prefetchers (which often generate the majority of the traffic between memory and L3).

The MEM_LOAD_UOPS_RETIRED.LLC_MISS event only counts L3 misses due to loads -- NOT L3 misses due to stores.   It *probably* does not include L3 misses due to L1 hardware prefetches, since it is triggered by load uops -- but Intel's documentation is very sparse on the subject of L1 hardware prefetches in general.  Like the LONGEST_LAT_CACHE.MISS event, this event does not show traffic due to L2 hardware prefetch operations.   So it is possible for either of these events to be zero, even if all of the data is actually coming from memory -- as long as the L2 hardware prefetchers pull the data into the L3 cache before the demand load miss from the L2 arrives at the L3 cache.  (Sometimes the L2 hardware prefetchers will pull the data all the way into the L2 -- in this case you will not even get an L3 access.)

Because of the difference in counting store misses I would expect LONGEST_LAT_CACHE.MISS to be larger than MEM_LOAD_UOPS_RETIRED.LLC_MISS if there are any stores that miss in the L3 cache.

The MEM_LOAD_UOPS_RETIRED.* events have bugs in a number of processors (especially with HyperThreading enabled).  You should look in the "specification update" document for your processor model to see if any such bugs have been disclosed.

 

0 Kudos
Min_X_
Beginner
604 Views

Hi John,

I did some more tests on LONGEST_LAT_CACHE.MISS and MEM_LOAD_UOPS_RETIRED.LLC_MISS. Single core, no hyperthreading, no H/W prefetching, and IF is set to zero during the test. 

For MEM_LOAD_UOPS_RETIRED.LLC_MISS, I can get consistent zero miss for an entire array iteration over 16384*120*4 bytes of data with read and store.

For LONGEST_LAT_CACHE.MISS, the number of misses is around 1-7 for each test. The number of LONGEST_LAT_CACHE.MISS is almost the same even if I replace read and store by read only. As an extreme, I even compare the values of LONGEST_LAT_CACHE.MISS using two consecutive rdpmc (the first two rdpmc are supposed to preload the instruction and the target memory address for storing the results):

mov $0x0, %eax
mov %eax, %ecx
mfence
rdpmc
mov %eax, 0x87e000c
mov %edx, 0x87e0010
mov $0x0, %eax
mov %eax, %ecx
mfence
rdpmc
mov %eax, 0x87e001c
mov %edx, 0x87e0020
mov $0x0, %eax
mov %eax, %ecx
mfence
rdpmc
mov %eax, 0x87e000c
mov %edx, 0x87e0010
mov $0x0, %eax
mov %eax, %ecx
mfence
rdpmc
mov %eax, 0x87e001c
mov %edx, 0x87e0020

 

Even consecutive measurement like this gives me non-zero LONGEST_LAT_CACHE.MISS. I think all the possible factors, i.e. load, store and prefetch, have been eliminated. I have no idea what could still cause the misses.

Thanks.

Min

0 Kudos
Min_X_
Beginner
604 Views

Hi John,

I wanto double checked my results using the DMND_DATA_RD and DMND_RFO on my Ivy Bridge processor.

I configure IA32_PERFEVTSEL0 to count on 0x01b7 and IA32_PERFEVTSEL1 to count 0x01bb, and set MSR_OFFCORE_RSP0 to 1 (DMND_DATA_RD) and MSR_OFFCORE_RSP1 to 2 (DMND_RFO)).

I set the response type on bit 22 (Local) to count on access to local DRAM only.

The results show that both the DMND_DATA_RD and the DMND_RFO are zero during the execution of the array iteration, even when neither the 412e nor the 20d1 is zero. What do you think?

Thanks.

Min

0 Kudos
McCalpinJohn
Honored Contributor III
604 Views

I very seldom pay attention to single-digit counts from the hardware performance counters.  There are just too many things that the processor does that we don't know about to be absolutely in control. 

  • For example, even if you disable interrupts in the kernel, I don't know of any way to prevent the BIOS from sending an SMI interrupt to perform SMM functions.  These are mostly invisible to the user state of the processor, but may not be completely invisible.
  • The Power Control Unit is also capable of interrupting the processor using a mechanism that is not visible to most mechanisms, and no one outside of Intel knows exactly what sort of "footprint" these transactions might have. 
  • The LLC may do some autonomous processing that is not documented.  Some Intel processors have used "LRU replacement hints" from the L1/L2 to the L3 to tell the L3 not to evict data that is in active use at the L1/L2 level (even if the line has not been reloaded from the L3 for a long time).  Since these are mostly undocumented, it is extremely difficult to rule out a low level of unexpected behavior from such mechanisms.  For example, the L1 instruction cache might be telling the L3 to keep the cache lines corresponding to the active text at a higher priority, and this may reduce the effective associativity available for data.

 

0 Kudos
Reply