Prefetch instructions

bronxzv · ‎04-13-2013

I'll be interested to have information about the behavior of prefetch hints instructions such as prefetcht0,prefetchnta,prefetchw,... for modern processors such as Sandy Bridge and Ivy Bridge. I ask because there is nothing about it in the optimization guide [1] apparently. It will be arguably a good thing for developers to know to which cache level data are prefetched with the diverse variants. I'll glad if someone provide a pointer to some detailed explanation.

[1] Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-026, April 2012

SergeyKostrov · ‎04-16-2013

>>...I'll be interested to have information about the behavior of prefetch hints instructions such as prefetcht0, prefetchnta, prefetchw,... >>for modern processors such as Sandy Bridge and Ivy Bridge... There are also some optimization tips in Intel C++ Compiler User and Reference Guides and please take a look. I recently experienced some issue with application of _mm_prefetch on a computer with Intel Core i7-3840QM CPU ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ). A piece of code with prefetching that perfectly works on older computers, for example with Pentium 4 or Atom N270 CPUs, doesn't provide performance gains when used on the computer with Ivy Bridge CPU. I think this is because significantly larger L3, L2 and L1 cache lines and it is clear that I don't fetch data properly in for-loops. Unfortunately, I still didn't have time to investigate it completely ( with VTune ) and _mm_prefetch is commented out for that configuration ( the code works fast with and without prefetching ).

Bernard · ‎04-18-2013

Hi bonxzv,

sorry for off topic,but it is nice to see you again on IDZ forums:)

bronxzv · ‎04-20-2013

iliyapolak wrote:

Hi bonxzv,

sorry for off topic,but it is nice to see you again on IDZ forums:)

Hi iliyapolak,

indeed it was a moment that I didn't come here, thanks for the warm welcome

hey, I see that in the meantime your black belt points have gone through the roof!

bronxzv · ‎04-20-2013

Sergey Kostrov wrote:

>>...I'll be interested to have information about the behavior of prefetch hints instructions such as prefetcht0, prefetchnta, prefetchw,...
>>for modern processors such as Sandy Bridge and Ivy Bridge...

There are also some optimization tips in Intel C++ Compiler User and Reference Guides and please take a look.

thanks for your feedback Sergey,

I haven't found any processor specific details in the C++ documentation so far, basically I have found:

- the documentation for the "prefetch insertion optimization" /Qopt-prefetch[:n], I have remarked that /Qopt-prefetch requires /O3 so I have to test it again, the last time I tried it I rushed my tests: compiled with /O2 with no visible change to my timings

- minimal explanation for the "Cacheability Support Intrinsics" and the _MM_HINT_T0, etc. hints

Sergey Kostrov wrote:

I recently experienced some issue with application of _mm_prefetch on a computer with Intel Core i7-3840QM CPU ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ).

A piece of code with prefetching that perfectly works on older computers, for example with Pentium 4 or Atom N270 CPUs, doesn't provide performance gains when used on the computer with Ivy Bridge CPU. I think this is because significantly larger L3, L2 and L1 cache lines and it is clear that I don't fetch data properly in for-loops. Unfortunately, I still didn't have time to investigate it completely ( with VTune ) and _mm_prefetch is commented out for that configuration ( the code works fast with and without prefetching ).

this is pretty much what I'm experiencing too on Ivy Bridge vs. P4 and older CPUs, I have removed a while ago all explicit prefetch in loops (well handled by the hardware prefetchers), I have only a very few cases still with explicit prefetch, I can see at best a 5% speedup in single thread mode, down to 0% in multithread mode (8 threads with hyperthreading enabled on a Core i7 3770K)

btw Linus Torvalds reports serious slowdown in the Linux kernel due to expliict prefetch here: http://www.realworldtech.com/forum/?threadid=132668&curpostid=132772

Bernard · ‎04-20-2013

Thanks bronxzv.

Yes I am spending a lot of time on this forum gaining knowledge and sharing my knowledge with other users.