I'll be interested to have information about the behavior of prefetch hints instructions such as prefetcht0,prefetchnta,prefetchw,... for modern processors such as Sandy Bridge and Ivy Bridge. I ask because there is nothing about it in the optimization guide  apparently. It will be arguably a good thing for developers to know to which cache level data are prefetched with the diverse variants. I'll glad if someone provide a pointer to some detailed explanation.
 Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012
sorry for off topic,but it is nice to see you again on IDZ forums:)
indeed it was a moment that I didn't come here, thanks for the warm welcome
hey, I see that in the meantime your black belt points have gone through the roof!
Sergey Kostrov wrote:
>>...I'll be interested to have information about the behavior of prefetch hints instructions such as prefetcht0, prefetchnta, prefetchw,...
>>for modern processors such as Sandy Bridge and Ivy Bridge...
There are also some optimization tips in Intel C++ Compiler User and Reference Guides and please take a look.
thanks for your feedback Sergey,
I haven't found any processor specific details in the C++ documentation so far, basically I have found:
- the documentation for the "prefetch insertion optimization" /Qopt-prefetch[:n], I have remarked that /Qopt-prefetch requires /O3 so I have to test it again, the last time I tried it I rushed my tests: compiled with /O2 with no visible change to my timings
- minimal explanation for the "Cacheability Support Intrinsics" and the _MM_HINT_T0, etc. hints
Sergey Kostrov wrote:
I recently experienced some issue with application of _mm_prefetch on a computer with Intel Core i7-3840QM CPU ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ).
A piece of code with prefetching that perfectly works on older computers, for example with Pentium 4 or Atom N270 CPUs, doesn't provide performance gains when used on the computer with Ivy Bridge CPU. I think this is because significantly larger L3, L2 and L1 cache lines and it is clear that I don't fetch data properly in for-loops. Unfortunately, I still didn't have time to investigate it completely ( with VTune ) and _mm_prefetch is commented out for that configuration ( the code works fast with and without prefetching ).
this is pretty much what I'm experiencing too on Ivy Bridge vs. P4 and older CPUs, I have removed a while ago all explicit prefetch in loops (well handled by the hardware prefetchers), I have only a very few cases still with explicit prefetch, I can see at best a 5% speedup in single thread mode, down to 0% in multithread mode (8 threads with hyperthreading enabled on a Core i7 3770K)
btw Linus Torvalds reports serious slowdown in the Linux kernel due to expliict prefetch here: http://www.realworldtech.com/forum/?threadid=132668&curpostid=132772