I am trying to run some small computation in the Ivy Bridge (i7-3770) without causing any LLC misses. I preload every data and code into the cache by touching them beforehand, and disable all interrupts during the computation. Also, I enabled only one core from the BIOS.
Currently, the target small computation is a single iteration over a buffer, which has been preloaded before the real iteration. The size of the buffer is 8192*120*2 bytes. When I measure the LLC miss using the MEM_LOAD_UOPS_RETIRED.LLC_ MISS (20d1) performance counter. Occasionally, the iteration could cause 1 LLC miss in an unpredictable manner. I am wondering if there are some hardware features that could lead to such kind of unpredicted cache misses on preloaded data.
In addition, I am confused about the LONGEST_LAT_CACHE.MISS and the MEM_LOAD_UOPS_RETIRED.LLC_ MISS. My understanding is that the previous one counts on LLC miss events while the second one counts the number of UOPS that causes LLC miss. And my observation is that LONGEST_LAT_CACHE.MISS is always larger than MEM_LOAD_UOPS_RETIRED.LLC_ MISS for the same sequence of instructions, even when MEM_LOAD_UOPS_RETIRED.LLC_ MISS is zero. Is this normal?
There are many features that can cause data to be evicted from the caches. Most of these are in the operating system software. As a practical matter, it is not generally possible to absolutely, positively ensure that the operating system will not access addresses that cause an overflow of any of the cache congruence classes that are being used to hold your data. I have not worked much with the "client" uncore used in the Core i7 processors, but the "server" uncore uses undocumented pseudo-LRU replacement schemes and undocumented hash functions to map addresses to the L3 slices. It is likely that at least some of these complexities also apply to the Core i3/i5/i7 processors.
The LONGEST_LAT_CACHE.MISS event is an "architectural" event. The event is included in the table of Ivy Bridge events in Table 19-11 of Volume 3 of the Intel Architecture Software Developer's Manual. The description the event refers to Table 19-1, which lists the "architectural" events. There is a bigger discussion of the "architectural" events in Section 18.104.22.168 of the same document, which notes:
"This event counts each cache miss condition for references to the last level cache. The event count may include speculation and cache line fills due to the first-level cache hardware prefetcher, but may exclude cache line fills due to other hardware-prefetchers.
Because cache hierarchy, cache sizes and other implementation-specific characteristics; value comparison to estimate performance differences is not recommended."
In my experience, this event counts L3 misses due to demand loads, demand stores and L1 hardware prefetch loads. It does not count L3 misses due to the L2 hardware prefetchers (which often generate the majority of the traffic between memory and L3).
The MEM_LOAD_UOPS_RETIRED.LLC_MISS event only counts L3 misses due to loads -- NOT L3 misses due to stores. It *probably* does not include L3 misses due to L1 hardware prefetches, since it is triggered by load uops -- but Intel's documentation is very sparse on the subject of L1 hardware prefetches in general. Like the LONGEST_LAT_CACHE.MISS event, this event does not show traffic due to L2 hardware prefetch operations. So it is possible for either of these events to be zero, even if all of the data is actually coming from memory -- as long as the L2 hardware prefetchers pull the data into the L3 cache before the demand load miss from the L2 arrives at the L3 cache. (Sometimes the L2 hardware prefetchers will pull the data all the way into the L2 -- in this case you will not even get an L3 access.)
Because of the difference in counting store misses I would expect LONGEST_LAT_CACHE.MISS to be larger than MEM_LOAD_UOPS_RETIRED.LLC_MISS if there are any stores that miss in the L3 cache.
The MEM_LOAD_UOPS_RETIRED.* events have bugs in a number of processors (especially with HyperThreading enabled). You should look in the "specification update" document for your processor model to see if any such bugs have been disclosed.
I did some more tests on LONGEST_LAT_CACHE.MISS and MEM_LOAD_UOPS_RETIRED.LLC_MISS. Single core, no hyperthreading, no H/W prefetching, and IF is set to zero during the test.
For MEM_LOAD_UOPS_RETIRED.LLC_MISS, I can get consistent zero miss for an entire array iteration over 16384*120*4 bytes of data with read and store.
For LONGEST_LAT_CACHE.MISS, the number of misses is around 1-7 for each test. The number of LONGEST_LAT_CACHE.MISS is almost the same even if I replace read and store by read only. As an extreme, I even compare the values of LONGEST_LAT_CACHE.MISS using two consecutive rdpmc (the first two rdpmc are supposed to preload the instruction and the target memory address for storing the results):
mov $0x0, %eax mov %eax, %ecx mfence rdpmc mov %eax, 0x87e000c mov %edx, 0x87e0010 mov $0x0, %eax mov %eax, %ecx mfence rdpmc mov %eax, 0x87e001c mov %edx, 0x87e0020 mov $0x0, %eax mov %eax, %ecx mfence rdpmc mov %eax, 0x87e000c mov %edx, 0x87e0010 mov $0x0, %eax mov %eax, %ecx mfence rdpmc mov %eax, 0x87e001c mov %edx, 0x87e0020
Even consecutive measurement like this gives me non-zero LONGEST_LAT_CACHE.MISS. I think all the possible factors, i.e. load, store and prefetch, have been eliminated. I have no idea what could still cause the misses.
I wanto double checked my results using the DMND_DATA_RD and DMND_RFO on my Ivy Bridge processor.
I configure IA32_PERFEVTSEL0 to count on 0x01b7 and IA32_PERFEVTSEL1 to count 0x01bb, and set MSR_OFFCORE_RSP0 to 1 (DMND_DATA_RD) and MSR_OFFCORE_RSP1 to 2 (DMND_RFO)).
I set the response type on bit 22 (Local) to count on access to local DRAM only.
The results show that both the DMND_DATA_RD and the DMND_RFO are zero during the execution of the array iteration, even when neither the 412e nor the 20d1 is zero. What do you think?
I very seldom pay attention to single-digit counts from the hardware performance counters. There are just too many things that the processor does that we don't know about to be absolutely in control.