Performance Counters to measure L1, L2 Cache Misses

xift · ‎06-21-2010

Hi,
I'm currently optimizing some algorithms in assembler using software prefetching.

Now I'd like to measure the changes.
I used the performance counters below on my Xeon 5130 with Intel Core Architecture. But while the execution time decreases after optimization the l1 and l2 cache misses seem to increase.

Performance counters I used:

Use Eventnum Umask
L1 Requests 0x40 0x0F
L2 Requests 0x2E 0xFF
L1 Misses 0xCB 0x02
L2 Misses 0x24 0x01

Are these the right performance counters?

Thanks in advance,
Michael

TimP · ‎06-21-2010

In normal use, prefetch must be expected to increase number of misses. It might increase number of requests by a larger margin than misses. I suspect these raw counters include plenty of duplicate counts. Even on the VTune forum, you'd be lucky to get a full explanation of how somewhat more meaningful statistics such as misses retired and hit ratios are derived.

xift · ‎06-21-2010

Why should the number of misses (and requests) increase if execution time decreases (up to 50%).

So where could I get answers? (My bachelor thesis depends on this problem...)

TimP · ‎06-21-2010

The point of prefetch is to generate an extra miss so as to begin loading data to cache at a sufficient interval before the program requires them. Unless you manage the software prefetch so as to hit each required cache line (and only those lines) just once, you are generating duplicate misses yourself. Those don't cost much, particularly if they don't lead to more misses retired, but you are counting them. On most recent CPU models, software prefetch would eliminate some hardware prefetch, even if the latter are still turned on. Due to out-of-order execution, the program will still attempt to access the data before it could actually use them, generating additional misses, which might be seen as guarding against the possibility that no prefetch is under way. You have available some events for VTune and PTU to see how many of these misses occur. With luck, the data will arrive before they are actually needed.
On CPUs which are designed to allow prefetch beyond the end of the data stream, it is more efficient to let the software prefetch run over than to take the time to test whether the prefetch runs beyond the loop. The latter is required on some more primitive instruction sets, such as LRBni. The former frequently results in 30% more data being fetched than are ever used, and a corresponding increase in misses. It has to get much worse than that before it becomes worth while to avoid it.
If you haven't read about the pros and cons of hardware vs. software prefetch you will need to take that into account in your work, as well as finding many other available references on pertinent subjects.

xift · ‎06-21-2010

"Unless you manage the software prefetch so as to hit each required cache line (and only those lines) just once, you are generating duplicate misses yourself."

Thats exactly what I am doing... The data input for my algorithm is exactly 256kByte large and I use a loop which works wich 128Byte each Iteration. I prefetched the first value and unrolled the last Iteration. So that all values used by that loop are being prefetched without any overhead. Prefetch Scheduling distance is one Iteration, which seems to work fine.

Too bad the Xeon 5130 doesn't have the performance counters for prefetching :(

xift · ‎06-22-2010

Maybe it's just the implementation of the prefetch instructions...

Could it be that they generate "Fake-Misses" that trigger the hardware prefetch units?

bronxzv · ‎06-22-2010

>wich 128Byte each Iteration.

try with 64B per software prefetch instead since it's the L1/L2/L3 $ line size of your target CPU

though when reading for ownership it may fetch 2 lines at once (depending on your BIOS settings, look at "adjacent cacheline prefetch") so maybe your timings will not be improved, don't know about the perf counters. NB: this was like that already on the P4 and for this reason some people were saying that the L2$ line size was 128B on P4 which was wrong

xift · ‎06-22-2010

Yep I know the adjacent cache line prefetch feature, its switched on in BIOS and thats why I use 128Byte per iteration. 64Byte is slower.

Damn ... the more performance counters I measure the less I know why prefetching causes a significant speedup!

k_sarnath · ‎07-01-2010

"Damn ... the more performance counters I measure the less I know why prefetching causes a significant speedup!"

:-) Welcome to the world of Intel optimizations.

None really knows what is happening inside the CPU... It follows heisenburg uncertainity principle... lol..

k_sarnath · ‎07-02-2010

Just measure the RS_FULL ratio and ROB_FULL ratio.. YOu will have an idea of how easily the muops r flowing through the pipe..

xift · ‎07-02-2010

Yeah,

figured that out too.

Well it doesn't explain the prefetching behavior but it shows why my code is faster than before.

k_sarnath · ‎07-03-2010

Glad you did.. Thats one thing that i find cool with vTune....

As far as prefetch -- I have no clue about it.... Coz, the hardware prefetcher is also at work... In fact, you should be happy that prefetch gave u the performance you wanted... for me, nothing really changed with prefetch.. I just dropped the idea of using it..