Does operating frequency influence cache misses? - Page 2

kopcarl · ‎06-26-2013

I run 462.libquantum on my i5-2400 in 1.6Ghz, 2.1Ghz, 2.7Ghz and 3.1Ghz respectively, and I find that LLC misses increase in higher frequency. The details are as follows: LLC miss: 5E+09, 6.9E+09, 9E+09, 1E+10 in (1.6Ghz, 2.1Ghz, 2.7Ghz, 3.1Ghz). I am wondering why changing frequency can influence cache misses?

McCalpinJohn · ‎06-28-2013

The explanation I provided yesterday appears to be correct --- you are seeing fewer "L3 misses" at lower frequencies because the L2 hardware prefetchers have more time to prefetch the data into the L3 cache before the demand miss (or prefetch) reaches the L3 cache.

To test this I took a simple code that repeatedly sums a long vector (250 MB) and measures the "L3 cache misses" using exactly the same event that you used:
perf stat -e r53412e ./ReadOnly_withStalls 0

The argument to the code is the number of times to execute a delay loop (with no memory references) between memory loads. I use the "rdtsc" instruction to create the delay loop, and add the low-order 32 bits of the TSC to a dummy variable that is printed at the end (to prevent the optimizer from removing the delay code).

With an argument of 0 (no added delays), the code reported that it was reading from memory at an average bandwidth of 13.47 GB/s while "perf stat" reported an average frequency of 3.1 GHz.
The reported number of L3 cache misses was 89578602, compared to an expected value of 327680000 (250 MiB read 80 times), so 27.3% of the loads were reported as cache misses.

With an argument of 1, the code reported an average memory bandwidth of 1813 MB/s, showing that the delay loop worked. The average frequency reported by "perf stat" remained 3.1 GHz.
This time the reported number of L3 cache misses was 29396814, or 8.9% of the actual cache lines transfered.

So spreading out the loads reduced the reported L3 cache "misses" by a factor of more than three, with the CPU frequency fixed.

The lesson is that the "L3 cache miss" event increments when a load (or L1 prefetch) arrives at the L3 cache before the data arrives at the L3 cache. This is a *subset* of the loads that got their data from beyond the L3. The other part of the subset (which is not measured by this counter) consists of the loads whose data was prefetched (by one of the L2 prefetchers) into the L3 cache before the load (or L1 prefetch) arrived at the L3.

To finally put the nail in the coffin on this issue, I disabled the L2 prefetchers and re-ran the test cases. In each case the number of reported "L3 misses" was 101.54% of the expected value. The "extra" 1.54% corresponds almost exactly to the 1/64 increase in traffic required to load the TLB entries (one 64 Byte cache line read for every 4 KiB page).

Case closed.

Bernard · ‎06-28-2013

>>>Actually i run this test for several times, and the results are very close>>>

I am not sure if you can pin counter to specific address space of executing thread.

kopcarl · ‎06-29-2013

John D. McCalpin wrote:

To finally put the nail in the coffin on this issue, I disabled the L2 prefetchers and re-ran the test cases. In each case the number of reported "L3 misses" was 101.54% of the expected value. The "extra" 1.54% corresponds almost exactly to the 1/64 increase in traffic required to load the TLB entries (one 64 Byte cache line read for every 4 KiB page).

Thank you for your contributions to my question. Your tests sounds very convictive. Do you know how to disable L2 prefetchers on i5-2400?

carl

McCalpinJohn · ‎07-01-2013

I don't know of any public references to the configuration bits used to disable prefetchers on the various Intel processors.

Enable/disable hardware prefetch functionality is available via BIOS options on many systems, so it must be documented for the BIOS writers.

This may be a case of simple caution -- although disabling and re-enabling hardware prefetchers on a "live" system is typically safe, it is quite possible that there are corner cases in which such changes could cause the system to hang or generate incorrect results. (That other vendor of x86_64 processors documents the MSRs required to control both the "core" and "memory controller" prefetchers. The documentation does not address the issue of whether these are safe to modify on a "live" system.) Enabling/disabling hardware prefetch is not a feature that could easily be considered "necessary" for customers (especially since the BIOS-based alternative exists), so the expense of exhaustive testing would have to be considered a very low priority in Intel's engineering budget.

kopcarl · ‎07-03-2013

John D. McCalpin wrote:

I don't know of any public references to the configuration bits used to disable prefetchers on the various Intel processors.

Enable/disable hardware prefetch functionality is available via BIOS options on many systems, so it must be documented for the BIOS writers.

This may be a case of simple caution -- although disabling and re-enabling hardware prefetchers on a "live" system is typically safe, it is quite possible that there are corner cases in which such changes could cause the system to hang or generate incorrect results. (That other vendor of x86_64 processors documents the MSRs required to control both the "core" and "memory controller" prefetchers. The documentation does not address the issue of whether these are safe to modify on a "live" system.) Enabling/disabling hardware prefetch is not a feature that could easily be considered "necessary" for customers (especially since the BIOS-based alternative exists), so the expense of exhaustive testing would have to be considered a very low priority in Intel's engineering budget.

Thanks. One more question, the Optimization Reference Manual 2.2.5.4 says data prefetch to the L2 and Last Level Cache are fetched from memory to the L2 cache and last level cache. Is this implying that Streamer and Spatial Prefetcher (MLC prefetchers) fetch data from memory directly?

McCalpinJohn · ‎07-06-2013

Most recent Intel processors have two "L1 prefetchers" and two "L2 prefetchers". See, for example, the discussion of the Sandy Bridge core in the Intel Software optimization guide. (The Nehalem/Westmere cores are similar.) The "L1 prefetchers" bring data into the L1 cache, while (if I recall the wording correctly) the "L2 prefetchers" bring data into either the L3 or L2 cache, depending on how busy the system happens to be.

If an L2 prefetch finds data in the L3 cache, then it won't go all the way to DRAM, but any of the L1 or L2 prefetches will propagate all the way out to memory if necessary to find the desired cache line.

The "Last Level Cache Miss" event discussed in this thread appears to be incremented when demand misses or L1 prefetches miss in the L3 cache, but is not incremented when L2 prefetches miss in the L3 cache.

Bernard · ‎07-26-2013

I wonder if prefetching implementation maintain some kind of history of prefetching distance table which could be based on application performance(count of cache misses).