Performance Counters to measure L1, L2 Cache Misses
Hi, I'm currently optimizing some algorithms in assembler using software prefetching.
Now I'd like to measure the changes. I used the performance counters below on my Xeon 5130 with Intel Core Architecture. But while the execution time decreases after optimization the l1 and l2 cache misses seem to increase.
In normal use, prefetch must be expected to increase number of misses. It might increase number of requests by a larger margin than misses. I suspect these raw counters include plenty of duplicate counts. Even on the VTune forum, you'd be lucky to get a full explanation of how somewhat more meaningful statistics such as misses retired and hit ratios are derived.
The point of prefetch is to generate an extra miss so as to begin loading data to cache at a sufficient interval before the program requires them. Unless you manage the software prefetch so as to hit each required cache line (and only those lines) just once, you are generating duplicate misses yourself. Those don't cost much, particularly if they don't lead to more misses retired, but you are counting them. On most recent CPU models, software prefetch would eliminate some hardware prefetch, even if the latter are still turned on. Due to out-of-order execution, the program will still attempt to access the data before it could actually use them, generating additional misses, which might be seen as guarding against the possibility that no prefetch is under way. You have available some events for VTune and PTU to see how many of these misses occur. With luck, the data will arrive before they are actually needed. On CPUs which are designed to allow prefetch beyond the end of the data stream, it is more efficient to let the software prefetch run over than to take the time to test whether the prefetch runs beyond the loop. The latter is required on some more primitive instruction sets, such as LRBni. The former frequently results in 30% more data being fetched than are ever used, and a corresponding increase in misses. It has to get much worse than that before it becomes worth while to avoid it. If you haven't read about the pros and cons of hardware vs. software prefetch you will need to take that into account in your work, as well as finding many other available references on pertinent subjects.
"Unless you manage the software prefetch so as to hit each required cache line (and only those lines) just once, you are generating duplicate misses yourself."
Thats exactly what I am doing... The data input for my algorithm is exactly 256kByte large and I use a loop which works wich 128Byte each Iteration. I prefetched the first value and unrolled the last Iteration. So that all values used by that loop are being prefetched without any overhead. Prefetch Scheduling distance is one Iteration, which seems to work fine.
Too bad the Xeon 5130 doesn't have the performance counters for prefetching :(
try with 64B per software prefetch instead since it's the L1/L2/L3 $ line size of your target CPU
though when reading for ownership it may fetch 2 lines at once (depending on your BIOS settings, look at "adjacent cacheline prefetch") so maybe your timings will not be improved, don't know about the perf counters. NB: this was like that already on the P4 and for this reason some people were saying that the L2$ line size was 128B on P4 which was wrong
Glad you did.. Thats one thing that i find cool with vTune....
As far as prefetch -- I have no clue about it.... Coz, the hardware prefetcher is also at work... In fact, you should be happy that prefetch gave u the performance you wanted... for me, nothing really changed with prefetch.. I just dropped the idea of using it..