- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm currently optimizing some algorithms in assembler using software prefetching.
Now I'd like to measure the changes.
I used the performance counters below on my Xeon 5130 with Intel Core Architecture. But while the execution time decreases after optimization the l1 and l2 cache misses seem to increase.
Performance counters I used:
Use Eventnum Umask
L1 Requests 0x40 0x0F
L2 Requests 0x2E 0xFF
L1 Misses 0xCB 0x02
L2 Misses 0x24 0x01
Are these the right performance counters?
Thanks in advance,
Michael
I'm currently optimizing some algorithms in assembler using software prefetching.
Now I'd like to measure the changes.
I used the performance counters below on my Xeon 5130 with Intel Core Architecture. But while the execution time decreases after optimization the l1 and l2 cache misses seem to increase.
Performance counters I used:
Use Eventnum Umask
L1 Requests 0x40 0x0F
L2 Requests 0x2E 0xFF
L1 Misses 0xCB 0x02
L2 Misses 0x24 0x01
Are these the right performance counters?
Thanks in advance,
Michael
Link Copied
11 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In normal use, prefetch must be expected to increase number of misses. It might increase number of requests by a larger margin than misses. I suspect these raw counters include plenty of duplicate counts. Even on the VTune forum, you'd be lucky to get a full explanation of how somewhat more meaningful statistics such as misses retired and hit ratios are derived.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why should the number of misses (and requests) increase if execution time decreases (up to 50%).
So where could I get answers? (My bachelor thesis depends on this problem...)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The point of prefetch is to generate an extra miss so as to begin loading data to cache at a sufficient interval before the program requires them. Unless you manage the software prefetch so as to hit each required cache line (and only those lines) just once, you are generating duplicate misses yourself. Those don't cost much, particularly if they don't lead to more misses retired, but you are counting them. On most recent CPU models, software prefetch would eliminate some hardware prefetch, even if the latter are still turned on. Due to out-of-order execution, the program will still attempt to access the data before it could actually use them, generating additional misses, which might be seen as guarding against the possibility that no prefetch is under way. You have available some events for VTune and PTU to see how many of these misses occur. With luck, the data will arrive before they are actually needed.
On CPUs which are designed to allow prefetch beyond the end of the data stream, it is more efficient to let the software prefetch run over than to take the time to test whether the prefetch runs beyond the loop. The latter is required on some more primitive instruction sets, such as LRBni. The former frequently results in 30% more data being fetched than are ever used, and a corresponding increase in misses. It has to get much worse than that before it becomes worth while to avoid it.
If you haven't read about the pros and cons of hardware vs. software prefetch you will need to take that into account in your work, as well as finding many other available references on pertinent subjects.
On CPUs which are designed to allow prefetch beyond the end of the data stream, it is more efficient to let the software prefetch run over than to take the time to test whether the prefetch runs beyond the loop. The latter is required on some more primitive instruction sets, such as LRBni. The former frequently results in 30% more data being fetched than are ever used, and a corresponding increase in misses. It has to get much worse than that before it becomes worth while to avoid it.
If you haven't read about the pros and cons of hardware vs. software prefetch you will need to take that into account in your work, as well as finding many other available references on pertinent subjects.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Unless you manage the software prefetch so as to hit each required cache line (and only those lines) just once, you are generating duplicate misses yourself."
Thats exactly what I am doing... The data input for my algorithm is exactly 256kByte large and I use a loop which works wich 128Byte each Iteration. I prefetched the first value and unrolled the last Iteration. So that all values used by that loop are being prefetched without any overhead. Prefetch Scheduling distance is one Iteration, which seems to work fine.
Too bad the Xeon 5130 doesn't have the performance counters for prefetching :(
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe it's just the implementation of the prefetch instructions...
Could it be that they generate "Fake-Misses" that trigger the hardware prefetch units?
Could it be that they generate "Fake-Misses" that trigger the hardware prefetch units?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>wich 128Byte each Iteration.
try with 64B per software prefetch instead since it's the L1/L2/L3 $ line size of your target CPU
though when reading for ownership it may fetch 2 lines at once (depending on your BIOS settings, look at "adjacent cacheline prefetch") so maybe your timings will not be improved, don't know about the perf counters. NB: this was like that already on the P4 and for this reason some people were saying that the L2$ line size was 128B on P4 which was wrong
try with 64B per software prefetch instead since it's the L1/L2/L3 $ line size of your target CPU
though when reading for ownership it may fetch 2 lines at once (depending on your BIOS settings, look at "adjacent cacheline prefetch") so maybe your timings will not be improved, don't know about the perf counters. NB: this was like that already on the P4 and for this reason some people were saying that the L2$ line size was 128B on P4 which was wrong
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yep I know the adjacent cache line prefetch feature, its switched on in BIOS and thats why I use 128Byte per iteration. 64Byte is slower.
Damn ... the more performance counters I measure the less I know why prefetching causes a significant speedup!
Damn ... the more performance counters I measure the less I know why prefetching causes a significant speedup!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Damn ... the more performance counters I measure the less I know why prefetching causes a significant speedup!"
:-) Welcome to the world of Intel optimizations.
None really knows what is happening inside the CPU... It follows heisenburg uncertainity principle... lol..
:-) Welcome to the world of Intel optimizations.
None really knows what is happening inside the CPU... It follows heisenburg uncertainity principle... lol..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just measure the RS_FULL ratio and ROB_FULL ratio.. YOu will have an idea of how easily the muops r flowing through the pipe..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yeah,
figured that out too.
Well it doesn't explain the prefetching behavior but it shows why my code is faster than before.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Glad you did.. Thats one thing that i find cool with vTune....
As far as prefetch -- I have no clue about it.... Coz, the hardware prefetcher is also at work... In fact, you should be happy that prefetch gave u the performance you wanted... for me, nothing really changed with prefetch.. I just dropped the idea of using it..
As far as prefetch -- I have no clue about it.... Coz, the hardware prefetcher is also at work... In fact, you should be happy that prefetch gave u the performance you wanted... for me, nothing really changed with prefetch.. I just dropped the idea of using it..
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page